Paper - Department of Computer Science

advertisement
Monocular 3D Object Detection for Autonomous Driving
Xiaozhi Chen1 , Kaustav Kundu2 , Ziyu Zhang2 , Huimin Ma1 , Sanja Fidler2 , Raquel Urtasun2
1
Department of Electronic Engineering, Tsinghua University
2
Department of Computer Science, University of Toronto
{chenxz12@mails., mhmpub@}tsinghua.edu.cn, {kkundu, zzhang, fidler, urtasun}@cs.toronto.edu
Abstract
The goal of this paper is to perform 3D object detection from a single monocular image in the domain of autonomous driving. Our method first aims to generate a
set of candidate class-specific object proposals, which are
then run through a standard CNN pipeline to obtain highquality object detections. The focus of this paper is on
proposal generation. In particular, we propose an energy
minimization approach that places object candidates in 3D
using the fact that objects should be on the ground-plane.
We then score each candidate box projected to the image
plane via several intuitive potentials encoding semantic segmentation, contextual information, size and location priors and typical object shape. Our experimental evaluation
demonstrates that our object proposal generation approach
significantly outperforms all monocular approaches, and
achieves the best detection performance on the challenging KITTI benchmark, among published monocular competitors.
1. Introduction
In recent years, autonomous driving has been a focus of
attention of both industry as well as the research community. Most initial efforts rely on expensive LIDAR systems,
such as the Velodyne, and hand-annotated maps of the environment. In contrast, recent efforts try to replace the LIDAR with cheap on-board cameras, which are readily available in most modern cars. This is an exciting time for the
vision community, as this application domain provides us
with many interesting challenges.
The focus of this paper is on high-performance 2D and
3D object detection from monocular imagery in the context of autonomous driving. Most of the recent object detection pipelines [19, 20] typically proceed by generating a
diverse set of object proposals that have a high recall and
are relatively fast to compute [45, 2]. By doing this, computationally more intense classifiers such as CNNs [28, 42]
can be devoted to a smaller subset of promising image re-
gions, avoiding computation on a large set of futile candidates. Our paper follows this line of work.
Different types of object proposal methods have been
developed in the past few years. A common approach
is to over-segment the image into superpixels and group
these using several similarity measures [45, 2]. Approaches
that efficiently explore an exhaustive set of windows using
simple “objectness” features [1, 11], or contour information [55] have also been proposed. The most recent line
of work aims to learn how to propose promising object
candidates using either ensembles of binary segmentation
models [27], parametric energies [29] or window classifiers
based on CNN features [18].
These proposal generation approaches have been shown
to be very effective in the context of the PASCAL VOC
challenge, which require a rather loose notion of localization, i.e., a detection is said to be correct if it overlaps
more than 50% with the ground truth. In the context of autonomous driving, however, a much more strict overlap is
required, in order to provide a more accurate estimate of the
distance from the ego-car to the potential obstacles. As a
consequence, popular approaches, such as R-CNN [20] fall
significantly behind the competitors on autonomous driving benchmarks such as KITTI [16]. The current leader on
KITTI is Chen et al. [10], which exploits stereo imagery to
create accurate 3D proposals. However, most cars are currently equipped with a single camera, and thus monocular
object detection is of crucial importance.
Inspired by this approach, this paper proposes a method
that learns to generate class-specific 3D object proposals
with very high recall by exploiting contextual models as
well as semantics. These proposals are generated by exhaustively placing 3D bounding boxes on the ground-plane
and scoring them via simple and efficiently computable image features. In particular, we use semantic and object
instance segmentation, context, as well as shape features
and location priors to score our boxes. We learn per-class
weights for these features using S-SVM [24], adapting to
each individual object class. The top object candidates are
then scored with a CNN, resulting in the final set of detec-
Candidate sampling in 3D space
Z
Y
X
projection
instance semantic
class semantic
Scoring
&
NMS
Proposals
shape
2D candidate boxes
context
location
Features
Figure 1: Overview of our approach: We sample candidate bounding boxes with typical physical sizes in the 3D space by assuming a
prior on the ground-plane. We then project the boxes to the image plane, thus avoiding multi-scale search in the image. We score candidate
boxes by exploiting multiple features: class semantic, instance semantic, contour, object shape, context, and location prior. A final set of
object proposals is obtained after non-maximum suppression.
tions. Our experiments show that our approach is able to
perform really well on KITTI, outperforming all published
monocular object detectors and being almost on par with
the leader [10], which exploits stereo imagery.
2. Related Work
Our work is related to methods for object proposal generation, as well as monocular 3D object detection. We will
mainly focus our literature review on the domain of autonomous driving.
Significant progress in deep neural nets [28, 42] has
brought increased interest in methods for object proposal
generation since deep nets are typically computationally demanding, making sliding window challenging [20]. Most
of the existing work on proposal generation uses RGB [45,
55, 9, 2, 11, 29], RGB-D [4, 21, 31, 25], or video [35].
In RGB, most methods combine superpixels into larger regions via several similarity functions using e.g. color and
texture [45, 2]. These approaches prune the exhaustive set
of windows down to about 2K proposals per image achieving almost perfect recall on PASCAL VOC [12]. [9] defines
parametric affinities between pixels and finds the regions
using parametric min-cut. The resulting regions are then
scored via simple features, and the top-ranked proposals are
used in recognition tasks [8, 15, 53]. Exhaustively sampled
boxes are scored using several “objectness” features in [1].
BING proposals [11] score boxes based on an object closure
measure as a proxy for “objectness”. Edgeboxes [55] score
an exhaustive set of windows based on contour information
inside and on the boundary of each window.
The most related approaches to ours are recent methods
that aim to learn how to propose objects. [29] learns parametric energies in order to propose multiple diverse regions.
In [27], an ensemble of figure-ground segmentation models
are learnt. Joint learning of the ensemble of local and global
binary CRFs enables the individual predictors to specialize in different ways. [26] learned how to place promising
object seeds and employ geodesic distance transform to obtain candidate regions. Parallel to our work, [18] introduced
a method that generates object proposals by cascading the
layers of the convolutional neural network. The method is
efficient since it explores an exhaustive set of windows via
integral images over the CNN responses. Our approach also
exploits integral images to score the candidates, however,
in our work we exploit domain priors to place 3D bounding
boxes and score them with semantic features. We use pixellevel class scores from the output layer of the grid CNN, as
well as contextual and shape features.
In RGB-D, [10] exploited stereo imagery to exhaustively
scored 3D bounding boxes using a conditional random field
with several depth-informed potentials. Our work also evaluates 3D bounding boxes, but uses semantic object and instance segmentation and 3D priors to place proposals on
the ground plane. Our RGB potentials are partly inspired
by [15, 53] which exploits efficiently computed segmentation potentials for 2D object detection.
Our work is also related to detection approaches for autonomous driving. [54] first detects a candidate set of objects via a poselet-like approach and then fits a deformable
wireframe model within the box. [38] extends DPM [13] to
3D by linking parts across different viewpoints, while [14]
extends DPM to reason about deformable 3D cuboids. [34]
uses an ensemble of models derived from visual and geometrical clusters of object instances. Regionlets [32] proposes boxes via Selective Search and re-localizes them using a top-down approach. [46] introduced a holistic model
that re-reasons about DPM object candidates via cartographic priors. Recently proposed 3DVP [47] learns occlusion patterns in order to significantly improve performance
of occluded cars on KITTI.
3. Monocular 3D Object Detection
In this paper, we present an approach to object detection,
which exploits segmentation, context as well as location priors to perform accurate 3D object detection. In particular,
we first make use of the ground plane in order to propose
objects that lie close to it. Since our input is a single monocular image, our ground-plane is assumed to be orthogonal to
0.9
0.8
Box proposal
ROI
pooling
FCs
0.7
FC
Softmax
classification
FC
Box
regression
FC
Orientation
regression
Context region
Average Precision
FCs
Concatenation
Conv
layers
ROI
pooling
0.6
0.5
0.4
0.3
0.2
Selective Search
EdgeBoxes
3DOP
Ours
0.1
Figure 2: CNN architecture adopted from [10] used to score our
proposals for object detection and orientation estimation.
the image plane and a distance down from the camera, the
value of which we assume to be known from calibration.
Since this ground-plane may not reflect perfect reality in
each image, we do not force objects to lie on the ground, and
only encourage them to be close. The 3D object candidates
are then exhaustively scored in the image plane by utilizing class segmentation, instance level segmentation, shape,
contextual features and location priors. We refer the reader
to Fig. 1 for an illustration. The resulting 3D candidates
are then sorted according to their score, and only the most
promising ones (after non-maxima suppression) are further
scored via a Convolutional Neural Net (CNN). This results
in a fast and accurate approach to 3D detection.
3.1. Generating 3D Object Proposals
We represent each object with a 3D bounding box, y =
(x, y, z, θ, c, t), where (x, y, z) is the center of the 3D box,
θ denotes the azimuth angle and c ∈ C is the object class
(Cars, Pedestrians and Cyclists on KITTI). We represent
the size of the bounding box with a set of representative 3D
templates t, which are learnt from the training data. We
use 3 templates per class and two orientations θ ∈ {0, 90}.
We then define our scoring function by combining semantic
cues (both class and instance level segmentation), location
priors, context as well as shape:
>
>
E(x, y) =wc,sem
φc,sem (x, y) + wc,inst
φc,inst (x, y)+
>
>
wc,cont
φc,cont (x, y) + wc,loc
φc,loc (x, y)+
>
wc,shape
φc,shape (x, y)
We next discuss each of these potentials in more detail.
Semantic segmentation: This potential takes as input a
pixelwise semantic segmentation containing multiple semantic classes such as car, pedestrian, cyclist and road. We
incorporate two types of features encoding semantic segmentation. The first feature encourages the presence of an
object inside the bounding box by counting the percentage
of pixels labeled as the relevant class:
P
i∈Ω(y) Sc (i)
φc,seg (x, y) =
,
|Ω(y)|
0
10 1
10 2
10 3
#proposals
Figure 3: AP vs #proposals on Car for moderate setting.
with Ω(y) the set of pixels in the 2D box generated by projecting the 3D box y to the image plane, and Sc the segmentation mask for class c. The second feature computes
the fraction of pixels that belong to classes other than the
object class
P
0
i∈Ω(y) Sc (i)
,
φc,non−seg,c0 (x, y) =
|Ω(y)|
This feature is two dimensional, as one dimension contains
the road and the other aggregates all other classes (but the
class of the proposal). Hence this potential tries to minimize
the fraction of pixels inside the bounding box belonging to
other classes. Note that these features can be computed very
efficiently using as many integral images as classes. In this
paper we use [41, 52, 3] to compute the semantic segmentation features. [41, 52] jointly learn the convolutional features as well as the pairwise Gaussian MRF potentials to
smooth the output labeling. SegNet [3] performs semantic labeling via a fully convolutional encoder-decoder. In
particular, we use the pre-trained model on PASCAL VOC
+ COCO from [52] for Car segmentation. To reduce discrepancies of surrogate classes, we use the pre-trained SegNet model from [3] for Pedestrian and Cyclist segmentation. Note that very few semantic annotations are available
for KITTI and thus we did not fine-tune their models. Additionally, we exploited the annotations in the road benchmark
of KITTI, and fine-tuned the network of [41] for road.
Shape: This feature captures the shape of the objects.
Specifically, we first compute the contours in the output of
the segmentation (instead of the original image). We then
create two grids for the 2D candidate box, one containing
only a single cell and one that has K × K cells. For each
cell, we count the number of contour pixels inside it. Overall, this gives us a (1 + K × K) feature vector across all
cells. This potential tries to place a bounding box tightly
around the object, encouraging the spatial distribution of
contours within its grid to match the expected shape of a
specific class. These features can be computed very efficiently using an integral image (counting contour pixels).
Instance Segmentation: Similar to [15, 53], we exploit instance level segmentation features, which score the
amount of segment inside the box and outside the box.
However, we simply choose the best segment for each
bounding box based on the IoU overlap, and not reason
about the segment ID at inference time. This speeds up
computation. This feature helps us to detect objects that are
occluded as they form different instances. Note that these
features can be very efficiently computed using as many
integral images as instances that compose the segmentation. To compute instance-level segmentation we exploit
the approach of [51, 50], which uses a CNN to create both
instance-level pixel labeling as well as ordering in depth.
We re-trained their model so that no overlap (not even in
terms of sequences) exist between our training and validation. Note that this approach is only available for Cars.
Context: This feature encodes the presence of contextual
labels, e.g. cars are on the road, and thus we can see road
below them. We use a rectangle below the 2D projection
of the 3D bounding box as the contextual region. We set
its height to 1/3 of the height of the box, and use the same
width, as in [33]. We then compute the semantic segmentation features in the contextual region. We refer the reader to
Fig. 1. for an illustration.
enlarging the proposal regions by a factor of 1.5, following [53]. Both branches are composed of a RoI pooling
layer and two fully-connected layers. RoIs are obtained by
projecting the proposals or context regions onto the conv5
feature maps. We obtain the final feature vectors by concatenating the output features from the two branches. The
network architecture is illustrated in Fig. 2.
We use a multi-task loss to jointly predict category labels, bounding box offsets, and object orientation. For
background boxes, only the category label loss is employed.
We weight each loss equally, and define the category loss
as cross entropy, the orientation loss as a smooth `1 and
the bounding box offset loss as a smooth `1 loss over the
4 coordinates that parameterized the 2D bounding box, as
in [20].
3.4. Implementation Details
3.3. CNN Scoring of Top Proposals
Sampling Strategy: We discretize the 3D space such that
the voxel size is 0.2m along each dimension. To reduce
the search space during inference in our proposal generation model, we place 3D candidate boxes on the ground
plane. As we only use one monocular image as input, we
cannot estimate an accurate road plane. Instead, as the camera location is known in KITTI, we use a fixed ground plane
for all images with the normal of the plane facing up along
camera’s Y axis (assuming that the image plane is orthogonal to the ground plane), and the distance of the camera
from the plane is hcam = 1.65m. To be robust to ground
plane errors (e.g., if the road has a slope), we also sample
candidate boxes on additional planes obtained by deviating
the default plane by several offsets. In particular, we fix the
normal of the plane and set height to hcam = 1.65 + δ. We
set δ ∈ {0, ±σ} for Car and δ ∈ {0, ±σ ± 2σ} for Pedestrian and Cyclist, where σ is the MLE estimate of the standard deviation by assuming a Gaussian distribution of the
distance from the objects to the default ground plane. We
use more planes for Pedestrian and Cyclist as small objects
are more sensitive to errors. We further reduce the number
of sampled boxes by removing boxes inside which all pixels
were labeled as road, and those with very low prior probability of 3D location. This results in around 14K candidate
boxes per ground plane, template and per image. Our sampling strategy reduces it to 28%, thus speeding up inference
significantly.
In this section, we describe how the top candidates (after
non-maxima suppression) are further scored via a CNN. We
employ the same network as in [10], which for completeness we briefly describe here. The network is built using
the Fast R-CNN [19] implementation. It computes convolutional features from the whole image and splits it into two
branches after the last convolutional layer, i.e., conv5. One
branch encodes features from the proposal regions while another is specific to context regions, which are obtained by
Network Setup: We use the VGG16 model from [42]
trained on ImageNet to initialize our network. We initialize
the two branches with the weights of the fully-connected
layers of VGG16. To handle particularly small objects in
KITTI images, we upscale the input image by a factor of
3.5 following [10], which was found to be crucial to achieve
very good performance. We employ a single scale for the
images during both training and testing. We use a batch size
of N = 1 for images and a batch size of R = 128 for pro-
Location: This feature encodes a location prior of objects
in both birds-eye perspective as well as in the image plane.
We learn the prior using kernel density estimation (KDE)
with a fixed standard deviation of 4m for the 3D prior and 2
pixels for the image domain. The 3D prior is learned using
the 3D ground-truth bounding boxes available in [16]. We
visualize the prior in Fig. 1.
3.2. 3D Proposal Learning and Inference
We use exhaustive search as inference to create our candidate proposals. This can be done efficiently as all the features can be computed with integral images. In particular,
it takes 1.8s in a single core, but inference can be trivially
parallelized to be real time. We learn the weights of the
model using structured SVM [44]. We use the parallel cutting plane implementation of [40]. We use 3D Intersectionover-Union (IoU) as our task loss.
1
0.8
0.8
0.8
0.6
BING
SS
EB
MCG
MCG-D
3DOP
Ours
0.4
0.2
0
10 1
10 2
10 3
0.6
0.2
0
10 1
10 4
10 3
0.2
0
10 1
10 4
0.8
0.8
0.8
BING
SS
EB
MCG
MCG-D
3DOP
Ours
0.2
10 2
10 3
0.6
0.2
0
10 1
10 4
BING
SS
EB
MCG
MCG-D
3DOP
Ours
0.4
# candidates
10 2
10 3
recall at IoU threshold 0.5
1
0.4
0.2
0
10 1
10 4
0.8
0.8
0
10 1
10 2
10 3
# candidates
10 4
0.6
BING
SS
EB
MCG
MCG-D
3DOP
Ours
0.4
0.2
0
10 1
10 2
# candidates
10 3
10 4
recall at IoU threshold 0.5
0.8
recall at IoU threshold 0.5
1
0.2
10 2
10 3
10 4
# candidates
1
BING
SS
EB
MCG
MCG-D
3DOP
Ours
10 4
BING
SS
EB
MCG
MCG-D
3DOP
Ours
0.4
1
0.4
10 3
0.6
# candidates
0.6
10 2
# candidates
1
0.6
BING
SS
EB
MCG
MCG-D
3DOP
Ours
0.4
1
0
10 1
Cyclist
10 2
0.6
# candidates
recall at IoU threshold 0.5
Pedestrian
recall at IoU threshold 0.5
# candidates
recall at IoU threshold 0.5
BING
SS
EB
MCG
MCG-D
3DOP
Ours
0.4
recall at IoU threshold 0.7
1
recall at IoU threshold 0.7
Car
recall at IoU threshold 0.7
1
0.6
BING
SS
EB
MCG
MCG-D
3DOP
Ours
0.4
0.2
0
10 1
10 2
10 3
10 4
# candidates
(a) Easy
(b) Moderate
(c) Hard
Figure 4: Proposal Recall vs #Candidates. We use an overlap threshold of 0.7 for Car, and 0.5 for Pedestrian and Cyclist. Methods that
use depth information are indicated in dashed lines. Note that the comparison to 3DOP [10] and MCG [2] is unfair as we use a monocular
image and they use a stereo pair.
posals. We run SGD with an initial learning rate of 0.001
for 30K iterations and then reduce it to 0.0001 for another
10K iterations.
4. Experimental Evaluation
We evaluate our approach on the challenging KITTI
dataset [16]. The KITTI object detection benchmark has
three classes: Car, Pedestrian, and Cyclist, with 7,481 training and 7,518 test images. Detection for each class is evaluated in three regimes: easy, moderate, hard, which are
defined according to the occlusion and truncation levels of
objects. We use the train/val split provided by [10] to evaluate the performance of our class-dependent proposals. The
split ensures that images from the same sequence do not exist in both training and validation sets. We then evaluate our
full detection pipeline on the test set of KITTI. We refer the
reader to the supplementary material for many additional
results.
Metrics: We evaluate our class-dependent proposals using best achievable (oracle) recall following [22, 45]. Oracle recall computes the percentage of ground-truth objects covered by proposals with IoU overlap above a certain
threshold. We set the threshold to 70% for Car and 50%
for Pedestrian and Cyclist, following the KITTI setup. We
also report average recall (AR), which has been shown to be
highly correlated with detection performance. We also evaluate the whole pipeline of our 3D object detection model on
KITTI’s two tasks: object detection, and object detection
and orientation estimation. Following the standard KITTI
setup, we use the Average Precision (AP) metric for the
object detection task, and Average Orientation Similarity
(AOS) for object detection and orientation estimation task.
Baselines: We compare our proposal generation method
to several top-performing approaches on the validation set:
3DOP [10], MCG-D [21], MCG [2], Selective Search
(SS) [45], BING [11], and Edge Boxes (EB) [55]. Note
that 3DOP and MCG-D exploit depth information, while
0.8
0.6
0.4
0.2
0
0.5
0.7
0.8
0.9
0
0.5
1
0.6
0.9
0
0.5
1
0.8
0.6
0.4
0.2
0.7
0.8
0.9
0.8
0.6
0.7
0.8
0.9
0
0.5
1
0.6
0.2
0.8
0.9
1
0.7
0.8
0.9
1
IoU overlap threshold
BING 4.3
SS 6.1
EB 4.4
MCG 8.2
MCG-D 10.7
3DOP 37.7
Ours 32.1
0.8
0.6
0.4
0.2
0.7
0.6
1
BING 4.1
SS 6
EB 4.4
MCG 8
MCG-D 10.2
3DOP 37.6
Ours 32.5
0.8
0.4
0.6
0.4
0.2
1
recall
0.6
1
BING 6.1
SS 5
EB 6.9
MCG 12.2
MCG-D 14
3DOP 38.6
Ours 34.6
IoU overlap threshold
BING 6.1
SS 7.4
EB 5
MCG 10.9
MCG-D 10.8
3DOP 52.7
Ours 36.2
0.9
0.6
recall
1
0.8
0.8
0.4
0
0.5
1
0.7
1
0.2
0.6
0.6
IoU overlap threshold
BING 6.6
SS 5.1
EB 7.7
MCG 13.2
MCG-D 16.1
3DOP 43.6
Ours 36.8
IoU overlap threshold
recall
Cyclist
0.8
1
recall
recall
Pedestrian
0.6
0.7
IoU overlap threshold
BING 7.6
SS 5.4
EB 9.2
MCG 15
MCG-D 19.6
3DOP 49.3
Ours 40.6
0.8
0.4
0.2
recall
1
0
0.5
0.6
0.2
0.6
BING 7
SS 16.5
EB 23
MCG 31.3
MCG-D 32.8
3DOP 57
Ours 57
0.8
0.4
IoU overlap threshold
0
0.5
1
BING 7.7
SS 18
EB 26.4
MCG 36.1
MCG-D 38.8
3DOP 57.5
Ours 59.9
0.8
recall
0.6
recall
Car
1
BING 12.3
SS 26.7
EB 37.4
MCG 45.1
MCG-D 49.6
3DOP 65.8
Ours 66.5
recall
1
0.4
0.2
0
0.5
IoU overlap threshold
(a) Easy
0.6
0.7
0.8
0.9
1
0
0.5
0.6
0.7
0.8
0.9
IoU overlap threshold
IoU overlap threshold
(b) Moderate
(c) Hard
1
Figure 5: Recall vs IoU using 500 proposals. The number next to the labels indicates the average recall (AR). Note that 3DOP and
MCG-D exploit stereo imagery, while the remaining methods as well as our approach use a single monocular image.
0.6
0.4
0.2
0
10 1
10 2
# candidates
10 3
10 4
0.8
0.6
0.6
0.4
Loc
+ClsSeg
+Context
+Shape
+InstSeg
0.2
0
10 1
Loc 31.2
+ClsSeg 44.2
+Context 53.4
+Shape 59
+InstSeg 59.9
0.8
recall
average recall
0.8
1
1
Loc
+ClsSeg
+Context
+Shape
+InstSeg
recall at IoU threshold 0.7
1
10 2
10 3
# candidates
0.4
0.2
10 4
0
0.5
0.6
0.7
0.8
0.9
1
IoU overlap threshold
Figure 6: Ablation study of features on Car proposals for moderate data: From left to right: average recall (AR) vs #candidates,
Recall vs #candidates at IoU threshold of 0.7, Recall vs IoU for 500 proposals. We start from the basic model (Loc), which only uses
location prior feature, and then gradually add other types of features: class semantics, context, shape, and instance semantics.
the remaining methods as well as our approach only use a
single RGB image. Note that all of the above approaches,
but 3DOP, are class independent (trained to detect any foreground object), while we use class-specific weights as well
as semantic segmentation in our features.
Proposal Recall: We evaluate the oracle recall for the
generated proposals on the validation set. Fig. 4 shows recall as a function of the number of proposals. Our approach
achieves significantly higher recall than all baselines when
using less than 500 proposals on Car and Pedestrian. In
LSVM-MDPM-sv [17, 13]
SquaresICF [5]
ACF-SC [6]
MDPM-un-BB [13]
DPM-VOC+VP [38]
OC-DPM [37]
SubCat [34]
DA-DPM [48]
R-CNN [23]
pAUCEnsT [36]
FilteredICF [49]
DeepParts [43]
CompACT-Deep [7]
3DVP [47]
AOG [30]
Regionlets [32]
Faster R-CNN [39]
Ours
Easy
68.02
69.11
71.19
74.95
74.94
84.14
87.46
84.80
84.75
86.71
92.33
Cars
Moderate
56.48
58.66
62.16
64.71
65.95
75.46
75.77
75.94
76.45
81.84
88.66
Hard
44.18
45.95
48.43
48.76
53.86
59.71
65.38
60.70
59.70
71.12
78.96
Easy
47.74
57.33
51.53
59.48
54.67
56.36
61.61
65.26
67.65
70.49
70.69
73.14
78.86
80.35
Pedestrians
Moderate
39.36
44.42
44.49
44.86
42.34
45.51
50.13
54.49
56.75
58.67
58.74
61.15
65.90
66.68
Hard
35.95
40.08
40.38
40.37
37.95
41.08
44.79
48.60
51.12
52.78
52.71
55.21
61.18
63.44
Easy
35.04
42.43
51.62
70.41
72.26
76.04
Cyclists
Moderate
27.50
31.08
38.03
58.72
63.35
66.36
Hard
26.21
28.23
33.38
51.83
55.90
58.87
Table 1: Average Precision (AP) (in %) on the test set of the KITTI Object Detection Benchmark.
AOG [30]
LSVM-MDPM-sv [17, 13]
DPM-VOC+VP [38]
OC-DPM [37]
SubCat [34]
3DVP [47]
Ours
Easy
33.79
67.27
72.28
73.50
83.41
86.92
91.01
Cars
Moderate
30.77
55.77
61.84
64.42
74.42
74.59
86.62
Hard
24.75
43.59
46.54
52.40
58.83
64.11
76.84
Easy
43.58
53.55
44.32
71.15
Pedestrians
Moderate
35.49
39.83
34.18
58.15
Hard
32.42
35.73
30.76
54.94
Easy
27.54
30.52
65.56
Cyclists
Moderate
22.07
23.17
54.97
Hard
21.45
21.58
48.77
Table 2: AOS scores (in %) on the test set of KITTI’s Object Detection and Orientation Estimation Benchmark.
Metric
AP
AOS
Proposals
Type
SS [45]
EB [55]
3DOP [10]
Ours
SS [45]
EB [55]
3DOP [10]
Ours
Monocular
Monocular
Stereo
Monocular
Monocular
Monocular
Stereo
Monocular
Easy
75.91
86.81
93.08
93.89
73.91
83.91
91.58
91.90
Cars
Moderate
60.00
70.47
88.07
88.67
58.06
67.89
85.80
86.28
Hard
50.98
61.16
79.39
79.68
49.14
58.34
76.80
77.09
Easy
54.06
57.79
71.40
72.20
44.55
46.80
61.57
62.20
Pedestrians
Moderate
47.55
49.99
64.46
65.10
39.05
40.22
54.79
55.77
Hard
40.56
42.19
60.39
60.97
33.15
33.81
51.12
51.78
Easy
56.26
55.01
83.82
84.26
39.82
43.97
73.94
71.95
Cyclists
Moderate
39.16
37.87
63.47
64.25
28.20
30.36
55.59
53.10
Hard
38.83
35.80
60.93
61.94
28.40
28.50
53.00
51.32
Table 3: Object detection and orientation estimation results on validation set of KITTI. We use 2000 proposals for all methods.
particular, our approach requires only 100 proposals for Car
and 300 proposals for Pedestrian to achieve 90% recall in
the easy regime. Note that the other 2D methods require orders of magnitude more proposals to reach the same recall.
When using 2K proposals, we achieve recall on par with
the best 3D approach, 3DOP [10], while being more than
20% higher than other baselines. Note that the comparison
to 3DOP [10] and MCG [2] is unfair as we use a monocular image and they use depth information. We also show
recall as a function of the overlap threshold for top 500 pro-
posals in Fig. 5. Our approach outperforms the baselines
except for 3DOP (which uses stereo) across all IoU thresholds. Compared with 3DOP, we get lower recall at high IoU
thresholds on Pedestrian and Cyclist.
Ablation Study: We study the effects of different features
on the object proposal recall in Fig. 6. It can be seen that
adding each potential improves performance, particularly at
the regime of fewer proposals. The instance semantic feature improves recall especially when using fewer proposals
Figure 7: Qualitative examples of car detections results: (left) top 50 scoring proposals (color from blue to red indicates increasing
score), (middle) 2D detections, (right) 3D detections.
(e.g., < 300). Without the instance feature, we still achieve
90% recall using 1000 proposals. By removing both instance and shape features, we would need twice the number
of proposals (i.e., 2000) to reach 90% recall.
Object Detection and Orientation Estimation: We use
the network described in Sec. 3.3 to score our proposals
for object detection. We test our full detection pipeline
on the KITTI test set. Results are reported and compared
with state-of-the-art monocular methods in Table 1 and Table 2. Our approach significantly outperforms all published
monocular methods. In terms of AP, we outperform the
second best method Faster R-CNN [39] by a significant
margin of 7.84%, 2.26%, and 2.97% for Car, Pedestrian,
and Cyclist, respectively, in the hard regime. For orientation estimation, we achieve 12.73% AOS improvement over
3DVP [47] on Car in the hard regime.
Comparison with Baselines: As strong baselines, we
also use our CNN scoring on top of three other proposals
methods, 3DOP [10], EdgeBoxes (EB) [55], and Selective
Search (SS) [45], where we re-train the network accordingly. Table 3 shows detection and orientation estimation
results on KITTI validation. We can see that our approach
outperforms Edge Boxes and Selective Search by around
20% in terms of AP and AOS, while being competitive with
the best method, 3DOP. Note that this comparison is not
fair as 3DOP uses stereo imagery, while we employ a single monocular image. Nevertheless it is interesting to see
that we achieve similar performance. We also report AP as
a function of the number of proposals for Car in the mod-
erate setting, in Fig. 3. When using only 10 proposals per
image, our approach already achieves AP of 53.7%, while
3DOP is 35.7%. With more than 100 proposals, our AP is
almost the same as 3DOP. EdgeBoxes reaches its best performance (78.7%) with 5000 proposals, while we need only
200 proposals to achieve AP of 80.6%.
Qualitative Results: Examples of our 3D detection results are in Fig. 7. Notably, our approach produces highly
accurate detections in 2D and 3D even for very small or occluded objects.
5. Conclusions
We have proposed an approach to monocular 3D object
detection, which generates a set of candidate class-specific
object proposals that are then run through a standard CNN
pipeline to obtain high-quality object detections. Towards
this goal, we have proposed an energy minimization approach that places object candidates in 3D using the fact
that objects should be on the ground-plane, and then scores
each candidate box via several intuitive potentials encoding
semantic segmentation, contextual information, size and location priors and typical object shape. We have shown that
our object proposal generation approach significantly outperforms all monocular approaches, and achieves the best
detection performance on the challenging KITTI benchmark.
Acknowledgements. The work was partially supported
by NSFC 61171113, NSERC and Toyota Motor Corporation.
References
[1] B. Alexe, T. Deselares, and V. Ferrari. Measuring the objectness of image windows. PAMI, 2012. 1, 2
[2] P. Arbelaez, J. Pont-Tusetand, J. Barron, F. Marques, and
J. Malik. Multiscale combinatorial grouping. In CVPR.
2014. 1, 2, 5, 7
[3] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A
deep convolutional encoder-decoder architecture for image
segmentation. arXiv preprint arXiv:1511.00561, 2015. 3
[4] D. Banica and C. Sminchisescu. Cpmc-3d-o2p: Semantic
segmentation of rgb-d images using cpmc and second order
pooling. In CoRR abs/1312.7715, 2013. 2
[5] R. Benenson, M. Mathias, T. Tuytelaars, and L. Van Gool.
Seeking the strongest rigid detector. In CVPR, 2013. 7
[6] C. Cadena, A. Dick, and I. Reid. A fast, modular scene understanding system using context-aware object detection. In
ICRA, 2015. 7
[7] Z. Cai, M. Saberian, and N. Vasconcelos.
Learning
complexity-aware cascades for deep pedestrian detection. In
ICCV, 2015. 7
[8] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV.
2012. 2
[9] J. Carreira and C. Sminchisescu. Cpmc: Automatic object
segmentation using constrained parametric min-cuts. PAMI,
34(7):1312–1328, 2012. 2
[10] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler,
and R. Urtasun. 3d object proposals for accurate object class
detection. In NIPS, 2015. 1, 2, 3, 4, 5, 7, 8
[11] M. Cheng, Z. Zhang, M. Lin, and P. Torr. BING: Binarized
normed gradients for objectness estimation at 300fps. In
CVPR, 2014. 1, 2, 5
[12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman. The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results. 2
[13] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part
based models. PAMI, 2010. 2, 7
[14] S. Fidler, S. Dickinson, and R. Urtasun. 3d object detection and viewpoint estimation with a deformable 3d cuboid
model. In NIPS, 2012. 2
[15] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up
segmentation for top-down detection. In CVPR, 2013. 2, 4
[16] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In
CVPR, 2012. 1, 4, 5
[17] A. Geiger, C. Wojek, and R. Urtasun. Joint 3d estimation of
objects and scene layout. In NIPS, 2011. 7
[18] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and L. V.
Gool. Deepproposal: Hunting objects by cascading deep
convolutional layers. In arXiv:1510.04445, 2015. 1, 2
[19] R. Girshick. Fast R-CNN. In ICCV, 2015. 1, 4
[20] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic
segmentation. CVPR, 2014. 1, 2, 4
[21] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning
rich features from RGB-D images for object detection and
segmentation. In ECCV. 2014. 2, 5
[22] J. Hosang, R. Benenson, P. Dollár, and B. Schiele. What
makes for effective detection proposals? arXiv:1502.05082,
2015. 5
[23] J. Hosang, M. Omran, R. Benenson, and B. Schiele. Taking
a deeper look at pedestrians. In arXiv, 2015. 7
[24] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. JLMR, 2009. 1
[25] A. Karpathy, S. Miller, and L. Fei-Fei. Object discovery in
3d scenes via shape analysis. In ICRA, 2013. 2
[26] P. Kr ahenb uhl and V. Koltun. Geodesic object proposals. In
ECCV, 2014. 2
[27] P. Kr ahenb uhl and V. Koltun. Learning to propose objects.
In CVPR, 2015. 1, 2
[28] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS,
2012. 1, 2
[29] T. Lee, S. Fidler, and S. Dickinson. A learning framework for
generating region proposals with mid-level cues. In ICCV,
2015. 1, 2
[30] B. Li, T. Wu, and S. Zhu. Integrating context and occlusion
for car detection by hierarchical and-or model. In ECCV,
2014. 7
[31] D. Lin, S. Fidler, and R. Urtasun. Holistic scene understanding for 3d object detection with rgbd cameras. In ICCV,
2013. 2
[32] C. Long, X. Wang, G. Hua, M. Yang, and Y. Lin. Accurate object detection with location relaxation and regionlets
relocalization. In ACCV, 2014. 2, 7
[33] R. Mottaghi, X. Chen, X. Liu, S. Fidler, R. Urtasun, and
A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014. 4
[34] E. Ohn-Bar and M. M. Trivedi. Learning to detect vehicles
by clustering appearance patterns. IEEE Transactions on Intelligent Transportation Systems, 2015. 2, 7
[35] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatiotemporal object detection proposals. In ECCV, 2014. 2
[36] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Pedestrian detection with spatially pooled features and structured
ensemble learning. In arXiv:1409.5209, 2014. 7
[37] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Occlusion
patterns for object class detection. In CVPR, 2013. 7
[38] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Multi-view
and 3d deformable part models. PAMI, 2015. 2, 7
[39] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 7, 8
[40] A. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. Box
in the box: Joint 3d layout and object reasoning from single
images. In ICCV, 2013. 4
[41] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. 2015. 3
[42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
In
arXiv:1409.1556, 2014. 1, 2, 4
[43] Y. Tian, P. Luo, X. Wang, and X. Tang. Deep learning strong
parts for pedestrian detection. In ICCV, 2015. 7
[44] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun.
Support Vector Learning for Interdependent and Structured
Output Spaces. In ICML, 2004. 4
[45] K. Van de Sande, J. Uijlings, T. Gevers, and A. Smeulders.
Segmentation as selective search for object recognition. In
ICCV, 2011. 1, 2, 5, 7, 8
[46] S. Wang, S. Fidler, and R. Urtasun. Holistic 3d scene understanding from a single geo-tagged image. In CVPR, 2015.
2
[47] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven
3d voxel patterns for object category recognition. In CVPR,
2015. 2, 7, 8
[48] J. Xu, S. Ramos, D. Vozquez, and A. Lopez. Hierarchical Adaptive Structural SVM for Domain Adaptation. In
arXiv:1408.5400, 2014. 7
[49] S. Zhang, R. Benenson, and B. Schiele. Filtered channel features for pedestrian detection. In arXiv:1501.05759, 2015.
7
[50] Z. Zhang, S. Fidler, and R. Urtasun. Instance-level segmentation with deep densely connected mrfs. In CVPR, 2016.
4
[51] Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monocular Object Instance Segmentation and Depth Ordering with
CNNs. In ICCV, 2015. 4
[52] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015. 3
[53] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler.
SegDeepM: Exploiting segmentation and context in deep
neural networks for object detection. In CVPR, 2015. 2,
4
[54] M. Zia, M. Stark, and K. Schindler. Towards scene understanding with detailed 3d object representations. IJCV, 2015.
2
[55] L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV. 2014. 1, 2, 5, 7, 8
Download