Survey of Object Classification in 3D Range Scans A L L A N ZE L E NER T HE G R A DUATE CE N T ER, CUN Y JA N UA RY 8 T H 2 0 1 5 Overview 1. Introduction ◦ ◦ 2. Problem definition: Object Recognition, Object Classification, and Semantic Segmentation Problem domains: LiDAR scanners for outdoor scenes and RGB-D sensors for indoor scenes Urban object classification ◦ 3. Case study: Vehicle object detection and classification Indoor object classification ◦ Cluttered scenes with large variety of objects 4. Related Works 5. Comparison and Conclusions ◦ ◦ Criteria for evaluation: Classification accuracy, range of classes, use of data Context through structured prediction and learned 3D feature representations Object Recognition Query Matches (Decreasing score order) Model Scene Lai and Fox (IJRR 08) Mian, Bennamoun, Owens (IJCV 09) Object Classification o Segmentation or sliding template used to find candidate regions for classification o Feature based classification may be invariant to pose and intra-class variation o More compressed representation than entire database of object models o Detection and recognition may still work better in practice for controlled applications Golovinskiy, Kim, and Funkhouser (ICCV 2009) Semantic Segmentation o Every point in the scene is labeled, including both objects of interest and background o Typically a joint optimization of segmentation and classification o Formally utilizes context in MRF/CRF model, where by context we mean nearby regions Wu, Lenz, and Saxena (RSS 2014) LiDAR Scans for Outdoor/Urban Scenes o Long range sensors for outdoor scenes o Fast scans at low resolution or slow scans at high resolution, depending on number of individual sensors o Moving sensors and registration from multiple scans result in unstructured point cloud data with no adjacency grid o RGB imagery tends to be low quality, challenging to align, or simply unavailable RGB-D Images for Indoor Scenes o Short range sensors for indoor scenes o Real-time 30 FPS depth maps based on structured light or time of flight in infrared o Integrated RGB camera is better aligned and provides better quality under indoor conditions than LiDAR systems o RGB-D image grid makes it well suited for traditional 2D computer vision techniques on image frames from a single view Patterson et al. • Object Detection from Large-Scale 3D datasets Using Bottom-up and Top-down Descriptors. Patterson, Mordohai, and Daniilidis. (ECCV 2008) Spin Image Extended Gaussian Images Patterson et al. 1. Compute normals for all points and spin images for a subset of sampled points 2. Classify spin image features as either positive (object) or negative (background) points using nearest neighbor classifier. 3. Greedy region growing of positively classified points gives object hypothesis 4. Compute EGI and constellation EGI for object hypothesis and compute alignment and similarity with database model objects. • • • 5. Rotation hypothesis based on angles subtended by pairs of points Translation based on maximum frequency of Fourier transform of best rotation hypothesis Similarity based on fraction of inliers defined as query points that are nearby model points with small cosine similarity between normals after alignment If similarity is above a threshold then the object is positively detected and points that overlap with the database model after alignment are labeled to obtain segmentation. Patterson et al. o Precision 0.92 and Recall 0.74 for chosen inlier threshold parameter o Computation and comparison of EGIs is slow due to alignment o Cost of object detection grows linearly in the size of the database Recall Precision Huber et al. • Parts-based 3D Object Classification. Huber, Kapuria, Donamukkala, and Hebert. (CVPR 2004) Huber et al. o Vehicles are segmented into front/middle/back parts and part classes are generated as follows: oFor each part 𝑟𝑖 , the distance between spin image features in 𝑟𝑖 and ∀𝑟𝑗 , 𝑖 ≠ 𝑗 is computed to produce 𝑝(𝑚 𝑟𝑖 = 𝑟𝑗 |𝑟𝑖 ) where the event denotes a nearest neighbor match from a feature of part 𝑟𝑖 to a feature in part 𝑟𝑗 . oA symmetric similarity matrix is computed as the average of the matching probabilities between all pairs of parts. Part classes are determined by agglomerative clustering and the features for each part class are clustered by k-means to produce a class representation. Huber et al. o Relationship between object class and part class is determined by Bayes’ theorem, 𝑝 𝑂𝑗 𝑅𝑖 = 𝑝 𝑅𝑖 𝑂𝑗 𝑝 𝑂𝑗 𝑗𝑝 𝑅𝑖 𝑂𝑗 𝑝 𝑂𝑗 o 𝑝 𝑅𝑖 𝑂𝑗 is determined empirically from the training data and 𝑝 𝑂𝑗 is assumed uniform o Object class is determined by maximizing likelihood over all parts arg max 𝑗 𝜋𝑅 𝑅𝑖 𝑝 𝑂𝑗 𝑅𝑖 𝑅𝑖 ∈ℛ o Here 𝜋𝑅 𝑅𝑖 is determined by matching features between the query part 𝑅𝑖 and the set of part classes 𝑅 as described during the part class generation stage. Huber et al. o Excellent accuracy on simulated scans but lacks experiments for real data. o Consistent part segmentation requires recovery of vehicle pose. o Improvement over classifier without using parts Solid Line: Parts-based Dashed Line: Object-based Golovinskiy et al. • Shape-based Recognition of 3D Point Clouds. Golovinskiy, Kim, and Funkhouser. (ICCV 2009) Golovinskiy et al. o Localization and segmentation are based on K-NN graph weighted by point distances o Localization performed by agglomerative clustering o Segmentation performed by min-cut using virtual background vertex and background radius parameter. o Contextual features use geolocation alignment with street map and occupancy grid of neighboring objects. o Relatively poor classification performance, perhaps due to a lack of local features Stamos et al. • Online Algorithms for Classification of Urban Objects in 3D Point Clouds. Stamos, Hadjiliadis, Zhang, and Flynn. (3DIMPVT 2012) o Online classification of scan lines using HMMs and CUSUM hypothesis testing 𝑆𝑛+1 = max 0, 𝑆𝑛 + 𝑥𝑛 − 𝜔𝑛 o 𝜔𝑛 is the likelihood of observation 𝑥𝑛 under the null hypothesis HMM o Change detected at large value of 𝑆𝑘 Stamos et al. o Simple features between points o Signed angles: sgn 𝐷𝑖,𝑘 ⋅ 𝐷𝑖,𝑘−1 𝑧 𝑇 ⋅ 𝐷𝑖,𝑝 o Line angles: Consistent for collinear points o Sequence of online classifications performed to refine from coarse to fine classes o Each additional classifier incorporates more prior knowledge about the target class. E.g., cars should be on the street at a certain height Xiong et al. • 3D Scene Analysis via Sequenced Predictions over Points and Regions. Xiong, Munoz, Bagnell, Hebert. (ICRA 2011) Context accumulated from neighbor segments Context from segment sent down to individual points Context from points averaged and sent up to segment Xiong et al. o Multi-Round Stacking generates contextual features by using a sequence of weak classifiers to predict class labels of neighbors o Two-level hierarchy of regions: segments and points. MRS is run on one level of the hierarchy and then the results are passed on to the other level. o Sensitive to quality of labeling in training, particularly if there is a “misc” class Contextual features for tree-trunk class Silberman and Fergus • Indoor Scene Segmentation Using a Structured Light Sensor. Silberman and Fergus. (ICCV 2011) Silberman and Fergus o Conditional Random Field 𝐸 𝑦 = 𝜙 𝑥𝑖 , 𝑖; 𝜃 + 𝑖 𝜓 𝑦𝑖 , 𝑦𝑗 𝜂 𝑖, 𝑗 𝑖,𝑗 o 𝜙 ⋅ - Color/depth features and location prior o 𝜓 𝑦𝑖 , 𝑦𝑗 = 0 if 𝑦𝑖 = 𝑦𝑗 , 3 otherwise o 𝜂 ⋅ - Spatial transition based on gradient o Location prior improves performance for classes in consistent configurations with respect to camera but decreases otherwise o E.g., bookshelves in office vs library 3D Location Priors Couprie et al. • Indoor Semantic Segmentation Using Depth Information. Couprie, Farabet, Najman, LeCun. (ICLR 2013) Couprie et al. o Simple application of CNN framework improves accuracy on classes at consistent depths such as walls and floors but performance for objects of interest degrades o Depth gradients alone are not informative and depth information must be normalized or interpreted to be invariant to variations Anand et al. • Contextually Guided Semantic Labeling and Search for 3D Point Clouds. Anand, Koppula, Joachims, and Saxena. (IJRR 2012) Anand et al. o MRF trained by structured SVM. 𝐾 𝑦𝑖𝑘 𝑤𝑛𝑘 ⋅ 𝜙𝑛 𝑖 𝑓𝑤 𝑥, 𝑦 = 𝑖∈𝑉 𝑘=1 𝑦𝑖𝑙 𝑦𝑗𝑘 [𝑤𝑡𝑙𝑘 ⋅ 𝜙𝑡 𝑖, 𝑗 ] + 𝑖,𝑗 ∈𝐸 𝑇𝑡 ∈𝑇 𝑙,𝑘 ∈𝑇 o 𝜙𝑛 𝑖 - Unary features o 𝜙𝑡 𝑖, 𝑗 - Pairwise features, may be associative or non-associative depending on 𝑇𝑡 o Associative – Feature between neighboring segments of same class, 𝑇𝑡 only has self loops o Object non-associative –Features between related class labels of neighboring segments Anand et al. 𝑟𝑖 − 𝑟𝑗 𝑟𝑗 − 𝑟𝑖 𝑇 𝑇 𝑛𝑖 ≥ 0 𝑛𝑗 ≥ 0 Anand et al. o Object part categories better exploit relationships than object categories alone o Registered 3D scenes provide more coverage and context than single view scenes o Common errors include objects that lie on top of other objects, e.g. a book on a table. o Either the result of poor segmentation or smoothing effect from pairwise potentials Related Works • Unsupervised Feature Learning for RGB-D Based Object Detection. Bo, Ren, and Fox (ISER 2012) Related Works • Convolutional-Recursive Deep Learning for 3D Object Classification. Socher, Huval, Bhat, Manning and Ng. (NIPS 2012) Related Works Kahler and Reid. (ICCV 2013) Müller and Behnke. (ICRA 2014) Related Works Sliding Shapes for 3D Object Detection in Depth Images. Song and Xiao. (ECCV 2014) Related Works • Instance Segmentation of Indoor Scenes Using a Coverage Loss. Silberman, Sontag, and Fergus. (ECCV 2014) Input Perfect Semantic Segmentation Naïve Region Growing Correct Instance Segmentation Related Works • Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception. Wu, Lenz, and Saxena. (RSS 2014) NO-CT: Non-overlapping constraints HR-CT: Hierarchical relation constraints Related Works • Classification of Vehicle Parts in Unstructured 3D Point Clouds. Zelener, Mordohai, and Stamos. (3DV 2014) • Unsupervised segmentation of parts by RANSAC plane fitting • Structured prediction over parts and object class by HMM and structured perceptron • Does not require pose estimation, experiments performed using real data p1 p2 x1 x2 ⋯ pn c xn x1 x2 … xn Comparison o Fine tuned object recognition methods still appear to work best for specific tasks o E.g. for car detection in urban scenes o Indoor scenes have many potential objects of interest, difficult to scale number of classes oObject classification requires 3D shape features that are discriminative o Simple accumulators like the spin image are still competitive choices for features o Learned representations may do better, but how to construct them is a challenge o Differences in representations between point clouds and RGB-D images o Errors in segmentation may propagate to classification o Semantic segmentation jointly optimizes segmentation and classification o Structured prediction provides useful context-based relationships, but can lead to false assumptions o Context relationships are also often fixed and manually engineered Conclusions o 3D shape and context based features provide consistent improvements to classification systems o Learned 3D representations that are aware of the unique properties of 3D shape features may see improvement over simple application of 2D techniques o Structured prediction to model relationships between objects, their parts, and their environment also improves performance o Sparse or hierarchical structured relationships are desirable for computational efficiency