Computer Vision and Image Understanding 120 (2014) 81–90 Contents lists available at ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu Efficient semantic image segmentation with multi-class ranking prior q Deli Pei a,b,c,d, Zhenguo Li e, Rongrong Ji f,⇑, Fuchun Sun b,c,d,⇑ a Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China c State Key Laboratory of Intelligent Technology and Systems, Beijing 100084, China d Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China e Huawei Noah’s Ark Lab, Hong Kong, China f Department of Cognitive Science, Xiamen University, Xiamen 361005, China b a r t i c l e i n f o Article history: Received 3 October 2012 Accepted 6 October 2013 Available online 25 October 2013 Keywords: Computer vision Machine learning Semantic segmentation Structural SVMs a b s t r a c t Semantic image segmentation is of fundamental importance in a wide variety of computer vision tasks, such as scene understanding, robot navigation and image retrieval, which aims to simultaneously decompose an image into semantically consistent regions. Most of existing works addressed it as structured prediction problem by combining contextual information with low-level cues based on conditional random fields (CRFs), which are often learned by heuristic search based on maximum likelihood estimation. In this paper, we use maximum margin based structural support vector machine (S-SVM) model to combine multiple levels of cues to attenuate the ambiguity of appearance similarity and propose a novel multiclass ranking based global constraint to confine the object classes to be considered when labeling regions within an image. Compared with existing global cues, our method is more balanced between expressive power for heterogeneous regions and the efficiency of searching exponential space of possible label combinations. We then introduce inter-class co-occurrence statistics as pairwise constraints and combine them with the prediction from local and global cues based on S-SVMs framework. This enables the joint inference of labeling within an image for better consistency. We evaluate our algorithm on two challenging datasets which are widely used for semantic segmentation evaluation: MSRC-21 dataset and Stanford Background dataset and experimental results show that we obtain high competitive performance compared with state-of-the-art methods, despite that our model is much simpler and efficient. Ó 2013 Elsevier Inc. All rights reserved. 1. Introduction Semantic segmentation is a fundamental but challenging problem in computer vision, which aims to assign each pixel in an image a pre-defined semantic label. It can be seen as an extension of the traditional object detection which aims at detecting prominent objects in the foreground of an image, with closed relation to some other fundamental computer vision tasks such as image segmentation and image classification. Semantic segmentation has many applications in practice, including scene understanding, robot navigation, and image retrieval. Semantic image segmentation algorithms in early stage typically solve this problem from a pixel-wise labeling perspective [1,2]. Although using pixels as labeling units is simple and straight- q This paper has been recommended for acceptance by Nicu Sebe. ⇑ Corresponding authors. Addresses: Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China (F. Sun). Department of Cognitive Science, School of Information Science and Technology, Xiamen University, Xiamen 361005, China (R. Ji). E-mail addresses: derrypei@gmail.com (D. Pei), li.zhenguo@huawei.com (Z. Li), rrji@xmu.edu.cn (R. Ji), fcsun@mail.tsinghua.edu.cn (F. Sun). 1077-3142/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.cviu.2013.10.005 forward, pixel itself contains limited and ambiguous information that cannot always be discriminative enough to determine its correct label. On the other hand, the proliferation of unsupervised image segmentation algorithms, such as mean shift [3], graph based segmentation [4,38], quick shift [5], TurboPixel [6] and SLIC [7], enables higher order features representation of regions. Therefore, more recently semantic segmentation approaches based on region-wise labeling [8–13] are also well investigated, which make use of region-level features that are not only more informative but also robust to noise, clutter, illuminate variance et al. In such a setting, an initial unsupervised segmentation is commonly adopted for pre-processing. However, image segmentation is still far away from being perfect without regard to the extensive attempts in the last several decades. From this point of view, how to make best use of these imperfect unsupervised image segmentation algorithms on the semantic segmentation problem is of fundamental importance yet is still unclear. Although higher order features extracted from regions are more expressive and informative than those from pixels, sematic ambiguity still exists because of the appearance similarity. A general consent is that contextual information within an image is a very useful cue to attenuate this ambiguity, which can be used to 82 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 suppress/encourage the presence of object classes during labeling. Context refers to any information that is not extracted directly from local appearance and can be summarized into two categories: pairwise constraints and global cues. Pairwise constraints, such as smoothness based on contrast [14,9], relative location [10,11] and co-occurrence [8,11,15] are used to model the pairwise relationship between regions within an image. Global constraints are usually used to enforce higher level consistency of region sets or image level. Some approaches are proposed to model these cues, such as using image classification results [13], Potts potential [12], pN Potts potential [16] and its improved versions robust PN potential [14], pN-based hierarchical CRFs [17], and Harmony potential [9]. These models will be further discussed in Section 2. In terms of the methodology, most of the existing methods [16,12,14,10,11,15,9] use conditional random fields (CRFs) to combine these constraints from different levels and make joint inference of labeling within an image, which is also known as structured prediction. In contrast to many sophisticated algorithms for inference, these models [10,11,9,14,15] are usually learned by gradient descent or heuristic search on validation set based on maximum likelihood estimation. On the other hand, Zhu et al. [18] showed that the max-margin based learning algorithm is more robust for structured prediction compared with the maximum likelihood estimation based learning algorithm in many machine learning applications. In this paper, we use maximum margin based structural support vector machine (S-SVMs) model to combine multiple levels of cues to attenuate the ambiguity of appearance similarity and we propose multi-class ranking based global constraints to confine the object classes to be considered when labeling regions within an image. For global cues, we first rank all the object classes for an image (class with higher probability present in the image gets larger score) using multi-class ranking algorithm [20] and transform the ranking scores into image-level soft constraint to confine the possible classes present in the image. The advantages of this global cues can be seen from two aspects: on the one hand, compared with robust pN potential [14] which limits their parent node to take only one single label, our method ranks all the classes for an image and thus is more representative to heterogeneous regions. On the other hand, since we compute the ranking scores for all the classes and transform them to soft constraint, we do not need to make hard decision for every class and thus avoid searching exponential space of possible label combination as harmony potential [9]. The global cues are integrated with the prediction obtained from region feature and logistic regression to encouraging more likely classes while suppressing the others. We then introduce inter-class co-occurrence statistics as pairwise constraints and combine them with the prediction from previous stage under S-SVMs framework. This enables the joint inference of labeling within an image for better consistency. Moreover, our model can be can be efficiently learned with cutting plane algorithm [19] instead of using heuristic search approach as in CRFs learning. Experimental results show that we obtain high competitive performance with state-of-the-art methods with a much simpler and efficient model on two challenging datasets: MSRS21 and Stanford Background Dataset. Probably the most related work is [21], which discussed the application of structural SVM in image semantic segmentation and compared with alternative maximum likelihood method. However, our model is different from their model in designing pairwise and global constraints as well as loss function in parameter learning. The standard contrast-dependent Potts model was used as pairwise constraint in contrast to our co-occurrence property. With regard to global constraints, they used very simple and straightforward K image-level classification results and the advantage of multiple classes ranking over 1-VS-All classifiers is discussed in [20]. The remainder of the paper is organized as follows: In the next section we review the related work. Our model is presented in Section 3, including the problem formulation and model details. Sections 4 and 5 describe the inference and learning methods. Implementation details and performance evaluation are shown in Section 6 while conclusions are drawn in Section 7. 2. Related work Despite of the success in inferring pixel labels [1,2], more recent methods tend to infer labels over regions or superpixels for the sake of lower computational complexity and incorporating higher level semantic cues. For these approaches, traditional image segmentation algorithms such as Normalized Cut [8], meanshift [14,17,13], graph-based image segmentation [10], quick shift [9,22] are adopted to get initial segments. More recently, several over-segmentation algorithms [6,7] are developed to bypass the problem of tradition segmentation algorithms, such as the semantic ambiguity (regions span multiple object classes) and the difficulty to determine the optimal number of segment regions. These algorithms try to seek the trade-off between reducing image complexity through pixel-grouping and avoiding under-segmentation [6]. Images are decomposed into much smaller regions than object size, e.g. 100–300 regions. Many traditional segmentation algorithms can also be adopted to generate superpixels by setting a finer level region segments. Qualitative results of different segmentation algorithms are given in Fig. 1, where each image is decomposed into approximate 150 superpixels. It can be seen that over-segmentation algorithms tent to segment an image into regions with regions with approximate size while the region size of traditional segmentation may vary a lot with the complexity of the content. Although various powerful features have been proposed recently (e.g. color histogram, texture and SIFT, these feature are still not informative enough to achieve high classification performance because of the appearance similarity. To attenuate this ambiguity of feature representation, some pairwise constraints, such as smoothness [14,9,23], relative location [10,11] and co-occurrence [8,11,15], are further introduced to attenuate the ambiguity of feature representation: (i) The assumption for pairwise smoothing term is that adjacent regions tend to have same label, and subsequently spatially adjacent regions with different labels will be punished. To keep the boundary, appearance contrast is considered in smoothing term, by which regions with larger appearance contrast will be punished less for their inconsistent labels. However, the dilemma of this smoothing term is that regions with similar appearance will naturally tend to have same label. This is contradicted with the objective of smoothing term that expecting spatially adjacent regions with variant appearance to have same label. (ii) The co-occurrence statistics exploit the property that some classes (e.g. boat, water) are more likely to present within an image than others (e.g. car, water). Thus the existence of one class can be used as the evidence of expecting the presence of some highly related classes and suppress the presence of other unlikely classes. For instance, Rabinovich et al. [8,11] construct context matrices by counting the co-occurrence frequency among object labels in the training set to incorporate semantic contextual information. Ladicky et al. [15] claimed that the co-occurrence cost should depend only on the labels present in an image, it should be invariant to the number and location of pixels that object occupies. (iii) Gould et al. [10] encoded the inter-class spatial relationship as a local feature in a two-stage classification process. However, because of the 2D projection, relative location in images is usually uninformative and hence degenerates to co-occurrence constraint. D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 83 Fig. 1. Over-segmentation examples of different segmentation algorithms, with approximate 150 superpixels for each image. Pairwise constraint can only capture local context information between regions. A more recent trend is building a hierarchical model by adding an extra global constraint to pairwise framework to incorporate constraints on higher level, such as the group of segments or image level. Plath et al. [12] proposed a Potts potential to model the label consistency of regions in a hierarchical tree structure, which punished all nodes that have inconsistent labels with their parent label. Kohli et al. [14] adapted the pN Potts potential proposed in [16] to a segment quality sensitive higher order potential, named robust PN potential. The cost of inconsistent labeling in high contrast region will be less compared to low contrast region. However, a drawback of both high order potentials [12,14] is that they both limited their parent node to take only one single label, which is often not the case and makes it unable to handle heterogeneous regions. Csurka and Perronnin [13] proposed to use image classification results to reduce the number of classes to be considered in an image. But this hard constraint schema did not take into account the classification accuracy and classification errors could propagate to following stages and affect the overall performance. The work [17] proposed a novel hierarchical CRF framework which allowed for integration of features computed at different levels to avoid single choice of quantization. Gonfaus et al. [9] proposed more expressive constraint named harmony potential, which restrict the power set over all possible labels on image level first and then use it as a higher order constraint. However, the exponential sized power set makes the exact inference infeasible. And heuristic method such as branch-and-bound sampling has to be applied to get an approximation of the best assignment, which results in taking into account a small subset only. Besides the context cues directly extracted from the image, priors from various vision tasks are also introduced to improve the performance. Several approaches considered jointing the object detection and multi-class image segmentation by feeding information from one task to the other [24–26,15,9]. Heitz et al. [24] developed Cascaded Classification Models (CCM) to combine the subtasks of scene categorization, object detection, multiclass image segmentation for holistic scene understanding. However, since these subtasks were only coupled by their input/output variables in a loose style, each of them is still optimized separately and information sharing was limited and may cause inconsistent representation. Gould et al. [25] proposed a hierarchical region-based approach that combined joint object detection with image segmentation to reason simultaneously about pixels, regions and ob- jects in an image. Ladicky et al. [15] integrated the results from sliding window detectors with low-level pixel-based unary and pairwise relations into a conditional random field framework (CRF) for joint reasoning about regions, objects and their attributes and similar idea in [9]. 3. Model In this section we will specify our model for the structural prediction. First, we consider superpixels obtained by an unsupervised image segmentation, and use xi, i = 1, . . . , N to denote the feature vector of superpixel i and yi 2 C = {c1, c2, . . . , cK} for its corresponding label where N and K are the numbers of superpixels and classes, respectively. The whole image can then be represented as the collection of superpixel feature vectors, X = {xij i = 1, . . . , N}, and an assignment of labels to the set of superpixels is referred to as a labeling of the image, denoted by Y = {yij i = 1, . . . , N}. Our objective is to learn a function F(X, Y) that is able to capture the compatibility of the prediction Y and the observation X, such that the better the prediction Y describes the image content X, the higher value F(X, Y) becomes. Thus, given the observation X, the optimal prediction Y can be found by maximizing F(X, Y) over all possible labelings: ^ ¼ arg max FðX; YÞ: Y ð1Þ Y Following the structural SVMs [27], we assume the compatibility function F is linear in terms of a combined feature representation of inputs and outputs u(X, Y) (also known as joint feature map): FðX; YÞ ¼ hx; uðX; YÞi: ð2Þ The joint feature map u(X, Y) can be designed in order to capture multi-scale, multi-layer and contextual cues. Given the joint feature map, the task in learning is to train an optimal model parameter x using training set. The local constraint, also called unary potential, captures the local appearance evidence for labeling superpixels; the mid-level constraint usually exploits pairwise relationship, such as smoothness, relative location and co-occurrence, between superpixels. In some approaches, certain global constraint is also applied to infer possible labeling from image level rather than superpixels. We will specify how we define these constraints and combine them together in the following sections. 84 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 3.1. The unary potential First we detail our feature representation for superpixels. The raw features of a superpixel consist of two ingredients, appearance-based descriptors and bag-of-word (BoW) representation. Following [10], the appearance-based descriptors include color and texture features which compute mean, standard deviation, skewness, and kurtosis statistics of the superpixel’s color distribution and filter responses. In addition, we also extract the location and geometry features of the superpixel. For more details we refer the reader to [10]. The BoW representation has been shown useful in many state-of-the-art vision systems. Therefore we also incorporate it for superpixel representation. Moreover, as shown in [22,28], BoW features extracted not only inside superpixels, but also in their neighborhood can describe superpixels more effectively. Thus, for each superpixel wee extract BoW from both itself and its adjacent regions and then concatenate them together. The final representation of the raw features becomes: superpixel casts a vote for its support to all the other superpixels’ class labels given its region size and the confidence of its initial guess, which is defined as follows: usi ;sj ðyi ; yj Þ ¼ Pðyi jsi ÞSi þ Pðyj jsj ÞSj P ; i Si where P(yijsi) is the probability of superpixel si taking label yi defined in (4) and Si is the size of the superpixel i. Thus, each superpixel i receives N 1 votes from all the other superpixels for its label assignment yi: V si ðyi Þ ¼ N X lyi ;yj usi ;sj ðyi ; yj Þ: > ð3Þ sai sbi where is the appearance descriptor, is the concatenated BoW feature, and ha, hb are the weight parameters to be learned by cross validation. Instead of using the above raw feature, we compute an intermediate representation from these raw features via logistic regression, which makes feature more compact. Given the raw feature representation si of a superpixel, the probability of taking label l 2 C = {c1, c2, . . . , cK} can be computed by the following logistic regression model: Pðljsi Þ ¼ 8 expðbl0 þbTl si Þ > > < 1þPcK1 expðbt0 þbT si Þ if l ¼ c1 ; . . . ; cK1 ; t¼c1 t 1 > > : 1þPcK1 expðbt0 þbT si Þ if l ¼ cK ; t¼c1 ð4Þ t where b is the learned parameter for the logistic regression. We concatenate class probabilities to form the K-dimensional intermediate representation: xi ¼ ðPðc1 jsi Þ; Pðc2 jsi Þ; . . . ; PðcK jsi ÞÞ> : ð5Þ Moreover, we assign the most probable label to the superpixel as an initial label guess for further joint inference: li ¼ arg max Pðljsi Þ: ð6Þ l2C As a baseline, the performances of raw features under various over-segmentation algorithms are systematically evaluated in Section 6 and compared with those obtained from structured prediction using contextual information. The unary potential can be written as follows: F unary ðX; YÞ ¼ X xTyi xi ; ð7Þ i where xc1 ; . . . ; xcK 2 RK are the model parameters for the unary potential. 3.2. The pairwise potential The unary potential part computes not only the intermediate representation of superpixels, but also the initial labeling of each superpixel based on local features. However the performance of such labeling may not be satisfactory due to the ambiguity on low level representation. To leverage the semantic context between superpixels and attenuate the ambiguity, we introduce a voting strategy to exploit co-occurrence property of objects within an image. Based on the initial label obtained in Section 3.1, each ð9Þ j¼1;j–i We define the pairwise potential by aggregating votes of all superpixels for their label assignments Y and then we have F pair ðX; YÞ ¼ X XX V si ðyi Þ ¼ lyi ;yj usi ;sj ðyi ; yj Þ; i si ¼ ðha sai ; hb sbi Þ ; ð8Þ i ð10Þ j–i where lci ;cj are K2 model parameters for pairwise potential, describing the preferences of co-occurrent class pair in the data. 3.3. The global constraint When most of the superpixels are correctly labeled, the cooccurrence property is beneficial to rectify the minor superpixels that are mislabeled. However, as the proportion of mislabeled superpixels increases, it is more likely that the error is propagated to other superpixels due to the voting scheme. To resolve this problem, global constraint on image level is further introduced to confine the possible classes present in an image. However, the existing global consistency potentials are either too simple in expressive power, which only allows regions have a single class label such as Potts [12] and Robust PN-based [14], or too complicated which has to search exponential space of likely combinations of labels such as Harmony potential [9]. We propose a new efficient global constraint which has a better trade-off between expressive power for to heterogeneous regions and the efficiency of searching exponential space of possible label combinations. With the help of multi-class ranking algorithm in [20], we first rank all the object classes from image level and then transform the ranking into soft constraint. To obtain the multi-class ranking score, each image is represented by kernel descriptor [29] and its corresponding binary label 1 K vector li ¼ fli ; . . . ; li g 2 f1; þ1gK , where K is the total number of j object classes, li ¼ þ1 denotes the presence of class j in image Ii j and li ¼ 1 denotes the absence. We aim to learn K classification function ft(Ii), Rd ? R, t 2 C = {c1, c2, . . . , cK}, one for each class, such that for any image I; fci ðIÞ scores higher than fcj ðIÞ when I is more likely belonging to class ci than class cj.1 The ranking score indicates the confidence of assigning specific label to a given image. Although it is informative, the result is still very rough. Therefore, rather than setting a threshold to binary this vector to obtain possible label set as in [13,30], we transform this ranking score into soft constraint using a sigmoid function: ht ðIÞ ¼ 1 þ q; 1 þ a expðb f t ðIÞÞ ð11Þ where a, b, q are parameters to be learned from a validation set. Now each image Ij can then be represented by a K dimensional vector r j ¼ fhc1 ðIj Þ; . . . ; hcK ðIj Þg. Then this soft constraint is integrated with the unary potential to impose image label prior 1 We use the code available at http://www.cse.msu.edu/bucakser/software.html. 85 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 to superpixels within image Ij. Thus the intermediate representation of superpixel defined in (5) can be revised as follows: e x i ¼ ðhc1 ðIj Þ Pðc1 j si Þ; hc2 ðIj Þ Pðc2 j si Þ; . . . ; hcK ðIj Þ PðcK jsi ÞÞ> : ð12Þ To illustrate the benefit of our proposed soft constraint strategy, we compare it with two alternative global constraint strategies: Top n labels hard constraint and Threshold-t constraint. The Top n constraint selects the most probable n labels for each image according to the rank score vector f computed in the above procedure, other labels are simply discarded. In Threshold-t strategy, instead of selecting a fixed number of labels for each image, we filter out unlikely labels by setting a threshold to the ranking score vector f. We compare with these strategies to show the efficiency of our proposed approach in Section 6.2. The advantages of transforming multi-label ranking to global constraints can be seen from two aspects: on one hand, the multi-label ranking score inferred from image level is more representative to heterogeneous regions by encouraging multiple labels, compared with the robust pN model [14] which limits their parent node to take only one single label. On the other hand, instead of inferring possible label set of an image from exponential sized power set of labels as in [22], which is intractable and can only be solved by sampling strategy, we can directly compute the ranking scores of every label for an image and it can be integrated directly with prediction results obtained from local features and logistic regression. 3.4. Overall compatibility function Combining all the above development together, we propose the compatibility function as follows: FðX; YÞ ¼ F unary ðX; YÞ þ F pair ðX; YÞ X XX ¼ xTyi ex i þ xyi ;yj usi ;sj ðyi ; yj Þ: i i ð13Þ j–i The compatibility function combines local and global cues and contextual information in a unified framework and makes joint labeling inference, which efficiently attenuates the ambiguity of local appearance similarity and makes the labeling more consistent. We systematically evaluate our model on two challenging datasets for semantic segmentation and compare with state-of-the-art methods in Section 6. 4. Inference The inference process defined in (1) seeks the most compatible labeling Y for a given observation X. Typically the process of maximizing this compatible function can be formulated as an integer programming problem, which is NP-hard in general except some special cases (e.g. K = 1) and consequently can only be solved approximately. In this paper, we adopt a greedy search algorithm in an iterative style because of its simplicity. First we rewrite the compatibility function as follows: FðX; YÞ ¼ ¼ X xTyi xei þ XX lyi ;yj usi ;sj ðyi ; yj Þ i i i j–i j–i X X fxTyi xei þ lyi ;yj usi ;sj ðyi ; yj Þg X ¼ gðyi jxi ; y1 ; . . . ; yi1 ; yiþ1 ; . . . ; yN Þ; Thus the inference process is described as follows: (1) In each iteration we randomly choose one superpixel and fix all the other superpixels’ labels. (2) We compute the score function g of all K possible classes for this superpixel. (3) If the label with largest score is different with the previous label, then update the label. (4) The iteration stops when no more label changes or reaches max iterations. Like most greedy search algorithm, the initialization is crucial to the performance. For our case, we found that the local prediction obtained from logistic regression serves as a natural and good start prediction. We initialize our prediction with logistic regression results in Eq. (12) instead of random values. The pseudocode of the above operations is given as follows: Algorithm 1. inference algorithm 1: Input: 2: Image feature Ij, superpixels si, i = 1, . . . , N 3: Initialization: ^i ¼ maxyi 2C hyi ðIj Þ Pðyi jsi Þ 4: y 5: xei ¼ ðhc1 ðIj Þ Pðc1 jsi Þ; hc2 ðIj Þ Pðc2 jsi Þ; . . . ; hcK ðIj Þ PðcK jsi ÞÞ 6: rebeat 7: for all superpixel si do ^1 ; . . . ; y ^i1 ; y ^iþ1 ; . . . ; y ^N Þ 8: yi ¼ maxyi 2C gðyi j xei ; y ^i then 9: if yi –y ^i 10: update yi ! y 11: end if 12: end for 13: until no label changes OR reaches max interactions ^i ji ¼ 1; . . . ; Ng 14: return Y ¼ fy Because the inference is conducted directly on superpixels instead of pixels, the number of variables is significantly reduced, typically from tens of thousands (e.g. an image of 400 300 pixels) to several hundred(usually 100–300 superpixels per image). Therefore the inference algorithm converges very fast, typically less than 15 iterations. 5. Learning In this section, we discuss how to learn the proposed model 14, i.e., the model parameters x. To find the optimal solution x⁄, we follow the idea in [27] for structured output prediction, and consider the following maximum-margin optimization problem: min 1 CX kxk2 þ n 2 n i i s:t: 8i ni P 0 b 2 X n Y; hx; duðX; Y; Y b Þi P Dð Y b ; YÞ ni ; 8Y b Þ ¼ uðX; YÞ uðX; Y b Þ; ni is a slack variable which where duðX; Y; Y becomes non-zero when the margin is violated, Y is the ground truth label of the given image and X is the structured output space. b ; YÞ is the loss function that quantifies how incorrect the predicDð Y b is when Y is the correct output value. tion Y One intuitive form of the loss function is 0–1 loss on each superpixel: ð14Þ i where g() is the potential of superpixel xi being labeled yi while the rests are y1, . . . , yi1, yi+1, . . . , yN. ð15Þ b ; YÞ ¼ Dð Y N X ^i ; yi ÞÞ; ð1 dðy ð16Þ i where d takes 1 when two values are identical and 0 otherwise. 86 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 However loss function defined in (16) penalizes incorrect superpixel labeling equally without taking into account the region size. Thus the loss of a large mislabeled superpixel is equal to the loss of a very small one. We then derive a more appropriate loss function as follows: b ; YÞ ¼ Dð Y PN i gSi ð1 dðy^i ; yi ÞÞ P i Si ; ð17Þ where Si is the area of superpixel i and g is a weight factor to be learned from cross validation. Because the structured output space X to be sought grows exponentially with the numbers of superpixels N and object classes K, the number of constraints in (15) is also exponentially large which makes it impossible to optimize directly. Current state-ofthe-art approaches typically use cutting plane algorithm proposed by Joachims et al. [19] and their implementation SVM Struct package.2 For a better efficiency, we follow a variant implementation of the cutting plan algorithm presented in [31]. The learning algorithm aims at finding a small set of constraints that ensures a sufficiently accurate solution. It starts with an unconstrained optimization problem as a relaxation of original problem and maintains a working set Wi. In each iteration through the training process, the ‘‘most violated’’ constraint is selected and then added to the existing working set if certain condition is satisfied. Once a constraint is added, we optimize the problem again to get new solution. Iteration stops when no constraint has changed or objective precision has reached. 6. Experimental results the superpixels from different over-segmentation methods using Logistic Regression. Particularly, to stress the role of over-segmentation, no pairwise or global contextual information is incorporated. We use MSRC-21 in this experiment. For the feature representation of superpixels, we combine appearance-based and bag-of-word descriptors (see Section 3.1). The appearance-based descriptor has 238 features, consisting of (1) color features computing the mean, standard deviation, skewness, and kurtosis statistics of RGB, Lab, and YCrCb color-space channels and gray image (4 10 dimensions); (2) texture features computing the same statistics of 48 filter responses (4 48 dimensions), including first and second derivatives of Gaussian and Laplacian-of-Gaussian with various orientations and scales; (3) shape features (3 dimensions); and (4) location features (3 dimensions). To build a bag-of-word (BoW) representation, we divide an image into 16 16 pixel cells with 75% overlap. Each cell is captured by a 128-dimensional SIFT descriptor. The dictionary size is 400 visual words built with K-means clustering and these descriptors are then quantized using nearest neighbor. To represent a superpixel, we concatenate the BoW representations of the superpixel and the region around it, giving a BoW feature vector of length 2 400 = 800. The overall representation of each superpixel is thus of 238 + 800 = 1038 dimensions. A Logistic Regressor is trained on the training set obtained by standard split of MSRC-21 dataset, where the cost parameter is set to C = 25. For evaluation metric, we follow [17] to use the global accuracy, which is the proportion of correctly labeled pixels to all the pixels considered (excluding pixels with void label): P Nii accuracy ¼ P i ; i;j N ij ð18Þ In this section, we evaluate the proposed method on two benchmarking datasets, the MSRC-21 Dataset [32] and the Stanford Background Dataset (SBD) [33], which are widely used for semantic image segmentation evaluation. MSRC-21 consists of 591 images in 21 classes: building, grass, tree, cow, sheep, sky, airplane, water, face, car, bicycle, flower, sign, bird, book, chair, road, cat, dog, body, and boat, where the ground truth is provided at pixel level. A void label is included to avoid the membership ambiguity of pixels on object boundaries, which is typically ignored in training and evaluation. Following [2], we divide MSRC-21 into 45% for training, 10% for validation, and 45% for test. SBD is mostly used for background understanding, where various foreground objects, including car, cow, book, boat, chair, person, et al., are merged into one foreground class. It contains 715 images chosen from the following public datasets: LabelMe [34], MSRC-21 [32], PASCAL VOC [35], and Geometric Context [36]. Eight category labels were obtained using Amazon’s Mechanical Turk (AMT), which include sky, tree, road, grass, water, building, mountain, and foreground. where Nij is the number of pixels of label i (ground-truth) being labeled as j. The results are shown in Fig. 2(a), where different numbers of superpixels are tested. We can see that the results associated with FH, SLIC and TP are relatively robust when the number of superpixels is greater than 50, compared to that of MS. Overall, FH performs slightly better than the rest, and is adopted later in our structured prediction model. Moreover, we computer the performance of different segmentations by assigning the dominant labels to superpixels as shown in Fig. 2(b). It can be seen that the accuracies increase with the number of superpixels as expected. That’s because the greater the granularity, the better the segmentation coincides with the border. On the other side, we can see that the performance of image semantic segmentation (Fig. 2(a)) doesnt monotonically increase with the number of superpixels. 6.1. Influence of over-segmentation 6.2. Influence of global constraints Though over-segmentation is widely adopted as a key preprocessing in semantic segmentation, its impact on subsequent learning is rarely evaluated. In this section, we test the influence of four popular over-segmentation techniques, including Mean Shift (MS) [3], Felzenszwalb and Huttenlocher’s efficient graph-based segmentation (FH) [4], SLIC [5], and TurboPixel (TP) [6]. Note that different methods perform segmentation differently, as shown in Fig. 1, where each image is segmented into about 150 superpixels, MS and FH tend to generate larger superpixels in coherent regions and smaller superpixels in complex regions, while SLIC and TurboPixel appear to produce grid-style balanced superpixels. The question is how such difference in over-segmentation can affect later superpixel labeling. To this end, we consider the task of labeling In this section, we evaluate the impact of multi-label ranking as global constraints on semantic segmentation, by integrating the global constraints into the experiment in the last section. We compare three ways of using label ranking, Top n constraint, Threshold constraint, and our proposed soft constraint (see Section 3.3). In order to capture complementary image properties, we propose to use different features for multi-label ranking, w.r.t. the local features for superpixels (see Section 3.1). We adopt kernel descriptors [29] for holistic image representation, which construct kernel descriptors from gradient, color, and local binary pattern match kernels using kernel principal component analysis (KPCA). Following the setting in [29], an image is divided into 16 16 pixel patches with 50% overlap to extract low level features. We compute image-level features using efficient match kernels (EMK) on 1 1, 2 2, and 4 4 pyramid sub-regions, and perform 2 The code is available on http://svmlight.joachims.org/svm_struct.html. 87 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 (a) SLIC TurboP MeanS GraphB 73 72 71 70 50 100 150 200 250 300 95 Global Accuracy (%) Global Accuracy (%) 74 (b) 94 93 92 SLIC TurboP MeanS GraphB 91 90 50 100 Number of Superpixels 150 200 250 300 Number of Superpixels Fig. 2. Influence of initial segmentation on semantic segmentation performance. (a) The performance of semantic segmentation by unary feature and linear classifier. (b) The performance of semantic segmentation by assigning the dominant labels to superpixels. constrained kernel singular value decomposition (CKSVD) with 1000 visual words learned by K-means. Overall, each image is represented by a 84,000-dimensional feature vector. We adopt the efficient multi-class ranking algorithm [20] to learn K classification functions ft(Ii): Rd ? R, t 2 C = {c1, c2, . . . , cK}, one for each class, with the goal that for any image I; f ci ðIÞ scores higher than fcj ðIÞ when I is more likely to belong to class ci than to class cj. We compare the kernel descriptor with the widely used spatial pyramid matching (SPM) representation with similar settings. The results measured by ROC curve are shown in Fig. 3, where the area under curve (AUC) of SPM is 90.3%, while the AUC of the kernel descriptor increases to a higher 94.3%. Now we are ready to report results after the integration of multi-labeling ranking. Recall that the Top n constraint considers only the top n labels for each image according to the ranking scores, while Threshold-t constraint retains those with scores greater than t. In contrast, our method converts the ranking scores to soft constraint using the sigmoid function defined in Eq. (11) (here we set a = 3, b = 3, q = 0.4). Either hard or soft constraint is combined with Logistic Regression as in Eq. (12) and the labels of superpixels can be inferred by Eq. (6). The results are shown in Table 1. We can see that the proposed soft constraint method outperforms the other two hard constraint alternatives under various parameters. 6.3. Results for MSRC-21 In this section, we report our structured prediction results on MSRC-21. We also report the results obtained by combining local unary features and Logistic Regression, with or without global constraints, where no pairwise co-occurrence information is incorporated. For comparison, we show the results of six state-of-the-art methods, taken from [22,30,9,37,17]. The overall results are summarized in Table 2. From Table 2, we can see that using local unary features and Logistic Regression yields a baseline of 73% pixel-wise global accuracy and 59% average per-class accuracy. Note that in this baseline, the label of a region (i.e., a superpixel) is decided by its appearance alone. By integrating the multi-label ranking results, we improve the global accuracy by 6% and average accuracy by 9%. This shows that global cues can effectively guide the labeling of local regions by substantially reducing potential classes to be considered during labeling. This is because region labels will be strengthened if they are consistent with the global ranking and suppressed otherwise. By further refining the labeling with pairwise co-occurrence information using structural SVMs framework, we achieve 84% global accuracy and 76% average accuracy, which are highly competitive compared to the results reported in previous methods, although our model is much simpler and efficient 1 KDES SPM 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 0.2 0.4 0.6 0.8 1 Fig. 3. The ROC curve of two feature representation in multi-class ranking. The area under curve (AUC) of kernel descriptor is 94.3% while the AUC of spatial pyramid matching is 90.3%. Table 1 Comparison of different global constraint methods. Local Local Local Local Local Local Feature Feature + Top Feature + Top Feature + Top Feature + Top Feature + Top Local Local Local Local Local Feature + Threshold = 0 Feature + Threshold = 0.2 Feature + Threshold = 0.4 Feature + Threshold = 0.6 Feature + Threshold = 0.8 3 4 5 6 7 labels labels labels labels labels Local Feature + Soft const. Global Average 72.8 72.6 76.1 77.5 76.8 76.3 58.6 59.8 64.5 65.0 64.9 63.1 75.9 76.7 77.8 78.0 74.3 63.0 64.5 66.0 66.7 60.8 79.1 67.7 in that we decouple the global constraint from pairwise potential in joint inference and instead integrate it with the local prediction from Logistic Regression (Section 3.1). Considering the per-class accuracy, we obtained very good performance on classes such as grass, sky, flower, which can be inferred easily from local appearance and their accuracies are above 95%. For some difficult classes, such as bird and boat, the accuracies are less than 40%, due to the similar appearance, various sizes, and complex background. Fig. 4 shows example results of our model. Consider the images shown in Fig. 4(a), the labeling results obtained by applying Logistic Regression on local appearance features are shown in Fig. 4(b), where the label of a region is decided by its appearance feature 88 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 Table 2 Quantitative results on the MSRC-21 data set. The computation of these scores follows the protocol defined in [17]. The best performance is highlighted in bold. Building Grass Tree Cow Sheep Sky Aeroplane Water Face Car Bicycle Flower Sign Bird Book Chair Road Cat Dog Body Boat Global Average Gould et al. [10] Ladicky et al. [17] Gonfaus et al. [9] Munoz et al. [37] Csurka and Perronnin [30] Lucchi et al. [21] Boix et al. [22] 72 80 60 63 75 95 96 78 93 93 81 86 77 88 78 66 74 91 84 70 71 87 68 65 79 93 99 88 89 88 74 74 87 69 66 70 87 76 78 63 70 86 73 74 75 69 87 77 81 76 72 82 93 84 81 68 97 97 80 74 55 95 73 51 44 23 30 57 55 25 83 86 95 84 75 40 31 81 80 24 77 95 76 69 79 60 51 81 47 54 50 69 46 59 55 50 66 56 71 43 14 09 46 24 18 77 86 77 78 77 64 75 75 71 64 64 66 94 87 91 84 72 81 87 83 97 90 93 81 76 82 72 78 83 86 86 94 88 96 93 87 62 48 90 90 89 81 85 82 97 0 82 75 83 70 0 52 85 83 77 80 LR w/o global LR w/global Structural SVMs 66 76 70 94 98 98 83 87 87 50 68 76 52 66 79 93 68 90 77 96 81 70 70 75 64 73 86 51 82 60 84 74 88 73 77 96 54 57 72 25 32 36 69 79 90 40 58 79 82 86 87 39 18 65 41 74 60 42 57 54 19 20 35 73 79 84 59 68 76 Fig. 4. Example results on MSRC-21 data set by our model. (a) Original images. (b) Logistic regression prediction. (c) Logistic regression prediction with global constraint. (d) Structured prediction results with multi-class labeling prior and contextual information. (e) Ground-truth labeling. alone. Taking the first image for example, it can be seen that partial regions of the bird were mislabeled as dog, cat, sheep or even road because of the ambiguity of local appearance. Then the multi-label ranking results give higher confidence to labels like grass, bird and dog and suppress the presence of road, sheep and cat as in Fig. 4(c). Finally, by introducing co-occurrence property, most of regions labeled as bird would surpress the presence of dog in an image because these two classes rarely present at the same time. Fig. 4(d) shows the labeling results obtained by our final structured prediction and post-processing by grouping superpixels into a larger group, and it can be seen that the final results are much more clean and consistent. The proposed method is very efficient. It takes about 800 s for training structural SVMs on a training set of 335 samples in MSRC-21, and takes about 1 s for labeling one test image. These results are running in MATLAB 7.10.0(R2010a) 64 bit on a laptop with 2.67 GHz i5 CPU and 8 GB RAM. 6.4. Results on Stanford Background dataset In this section, we report our results on SBD. We follow [33] to perform 5-fold cross-validation with the dataset randomly divided into 572 training images and 143 test images for each fold. The results are shown in Table 3. We can see that our structured prediction model preforms favorably compared to other state-ofthe-art methods. We also observed that the incorporation of the global label-ranking, although useful, did not improve the performance significantly. This probably can be explained from two aspects: First, in SBD the foreground class includes a wide range of object classes such as person, car, cow, sheep, bicycle, and their Table 3 Quantitative results on Stanford Background dataset. The best performance is highlighted in bold. Gould et al. [33] Munoz et al. [37] LR w/o global LR w/global Structural SVMs Sky Tree Road Grass Water Building Mountain Foreground Global Average 92.6 91.6 91.9 92.4 94.9 61.4 66.3 70.0 70.0 69.7 89.6 86.7 88.9 89.3 90.0 82.4 83.0 78.3 77.8 81.4 47.9 59.8 54.9 51.5 60.9 82.4 78.4 79.6 79.3 79.9 13.8 5.0 6.3 3.0 13.5 53.7 63.5 54.2 57.9 54.1 76.4 76.9 76.6 77.0 77.8 65.5 66.2 65.5 65.2 68.0 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 89 Fig. 5. Example results on Stanford Background dataset by our model. (a) Original images. (b) Logistic regression prediction. (c) Logistic regression prediction with global constraint. (d) Structured prediction results with multi-class labeling prior and contextual information. (e) Ground-truth labeling. Note that objects such as cars, person, horse are merged as one foreground class. appearances vary drastically among classes, making it very difficult to model the appearance; Second, the number of classes in SBD is much less than MSRC-21 and therefore the multi-label ranking and co-occurrence statistics may be less informative. Beside forground, another challenging class is mountain, which has few instance in the dataset, making it very hard to label correctly. Some example results of our model are shown in Fig. 5, where we can see that local labeling is not sufficient to address the appearance ambiguity (Fig. 5(b)). In the presence of pairwise and global cues, the labeling becomes more robust, as shown in Fig. 4(d). 7. Conclusion We have presented a new structured prediction model for semantic segmentation. Traditional structured prediction frameworks using pairwise constraints alone suffer degeneration when a notable number of regions within an image are wrongly labeled in the early stage prediction by Logistic Regression, because the wrong contextual information of mislabeled regions may propagate to correct ones. Therefore it is necessary to confine possible labels from image-level. We utilized the multi-label ranking score and converted it to soft global constraint, which encourage the presence of some likely labels while suppress the presence of unlikely labels. Compared with other existing global constraint schemas, we decoupled the global constraint with pairwise constraint and integrated with unary potential directly, making it much simpler while remain efficiency. The proposed model was evaluated on two challenging datasets and experiments showed that our model obtained highly competitive performance compared with the state-of-the-art results. In the future work, we plan to integrate multi-source cues such as depth into structural SVMs framework. So far we only consider extracting multi-scale cues from single source, that is the optic image. Features from multiple sources could contain complementary information and be potentially useful for prompting performance. Acknowledgments This work was supported by the National Key Project for Basic Research of China (2013CB329403), Nature Science Foundation of China (No. 61373076), the Fundamental Research Funds for the Central Universities (No.2013121026), and the 985 Project of Xiamen University. References [1] J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation, in: Proceedings of European Conference on Computer Vision (ECCV), 2006, pp. 1–15. [2] J. Shotton, M. Johnson, R. Cipolla, Semantic texton forests for image categorization and segmentation, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. [3] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5) (2002) 603–619. [4] P. Felzenszwalb, D. Huttenlocher, Efficient graph-based image segmentation, International Journal of Computer Vision 59 (2) (2004) 167–181. [5] A. Vedaldi, S. Soatto, Quick shift and kernel methods for mode seeking, in: Proceedings of European Conference on Computer Vision (ECCV), 2008, pp. 705–718. [6] A. Levinshtein, A. Stere, K. Kutulakos, D. Fleet, S. Dickinson, K. Siddiqi, Turbopixels: fast superpixels using geometric flows, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (12) (2009) 2290–2297. [7] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, Slic Superpixels, Technical Report 149300 EPFL (June). 90 D. Pei et al. / Computer Vision and Image Understanding 120 (2014) 81–90 [8] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, S. Belongie, Objects in context, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8. [9] J. Gonfaus, X. Boix, J. Van De Weijer, A. Bagdanov, J. Serrat, J. Gonzalez, Harmony potentials for joint classification and segmentation, in: Object Categorization Using Co-Occurrence, Location and Appearance (CVPR), 2010, pp. 3280–3287. [10] S. Gould, J. Rodgers, D. Cohen, G. Elidan, D. Koller, Multi-class segmentation with relative location prior, International Journal of Computer Vision 80 (3) (2008) 300–316. [11] C. Galleguillos, A. Rabinovich, S. Belongie, Object categorization using cooccurrence, location and appearance, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. [12] N. Plath, M. Toussaint, S. Nakajima, Multi-class image segmentation using conditional random fields and global classification, in: Proceedings of International Conference on Machine Learning (ICML), 2009, pp. 817–824. [13] G. Csurka, F. Perronnin, A simple high performance approach to semantic segmentation, in: Proceedings of British Machine Vision Conference (BMVC), 2008. [14] P. Kohli, L. Ladickỳ, P. Torr, Robust higher order potentials for enforcing label consistency, International Journal of Computer Vision 82 (3) (2009) 302– 324. [15] L. Ladicky, C. Russell, P. Kohli, P. Torr, Graph cut based inference with cooccurrence statistics, in: Proceedings of European Conference on Computer Vision (ECCV), 2010, pp. 239–253. [16] P. Kohli, M. Kumar, P. Torr, P3 & beyond: solving energies with higher order cliques, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. [17] L. Ladicky, C. Russell, P. Kohli, P. Torr, Associative hierarchical CRFs for object class image segmentation, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009, pp. 739–746. [18] J. Zhu, E. Xing, B. Zhang, Laplace maximum margin markov networks, in: Proceedings of International Conference on Machine Learning (ICML), 2008, pp. 1256–1263. [19] T. Joachims, T. Finley, C. Yu, Cutting-plane training of structural SVMs, Machine Learning 77 (1) (2009) 27–59. [20] S. Bucak, P. Kumar Mallapragada, R. Jin, A. Jain, Efficient multi-label ranking for multi-class learning: application to object recognition, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009, pp. 2098– 2105. [21] A. Lucchi, Y. Li, X. Boix, K. Smith, P. Fua, Are spatial and global constraints really necessary for segmentation? in: IEEE International Conference on Computer Vision (ICCV), 2011, pp. 9–16. [22] X. Boix, J. Gonfaus, J. van de Weijer, A. Bagdanov, J. Serrat, J. Gonzàlez, Harmony potentials, International Journal of Computer Vision 96 (1) (2012) 83–102. [23] S. Nowozin, P. Gehler, C. Lampert, On parameter learning in CRF-based approaches to object class image segmentation, in: Proceedings of European conference on Computer vision (ECCV), ECCV’10, 2010, pp. 98–111. [24] G. Heitz, S. Gould, A. Saxena, D. Koller, Cascaded classification models: combining models for holistic scene understanding, in: Proceedings of Neural Information Processing Systems (NIPS), 2008. [25] S. Gould, T. Gao, D. Koller, Region-based segmentation and object detection, in: Proceedings of Neural Information Processing Systems (NIPS), vol. 1, 2009. [26] D. Hoiem, A. Efros, M. Hebert, Closing the loop in scene interpretation, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8. [27] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun, Support vector machine learning for interdependent and structured output spaces, in: Proceedings of International Conference on Machine Learning (ICML), 2004, pp. 104–111. [28] B. Fulkerson, A. Vedaldi, S. Soatto, Class segmentation and object localization with superpixel neighborhoods, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009, pp. 670–677. [29] L. Bo, X. Ren, D. Fox, Kernel descriptors for visual recognition, in: Proceedings of Neural Information Processing Systems (NIPS), 2010. [30] G. Csurka, F. Perronnin, An efficient approach to semantic segmentation, International Journal of Computer Vision 95 (2) (2011) 198–212. [31] C. Desai, D. Ramanan, C. Fowlkes, Discriminative models for multi-class object layout, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009, pp. 229–236. [32] A. Criminisi, Micorsoft Research Cambridge Object Recognition Image Database. <http://research.microsoft.com/en-us/projects/objectclassrecognition>. [33] S. Gould, R. Fulton, D. Koller, Decomposing a scene into geometric and semantically consistent regions, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1–8. [34] B. Russell, A. Torralba, K. Murphy, W. Freeman, Labelme: a database and webbased tool for image annotation, International Journal of Computer Vision 77 (1–3) (2008) 157–173. [35] M. Everingham, L. Van Gool, C. Willianms, J. Winn, A. Zisserman, The Pascal Visual Object Classes Challenge 2007 (voc2007) Results (2007). [36] D. Hoiem, A. Efros, M. Hebert, Recovering surface layout from an image, International Journal of Computer Vision 75 (1) (2007) 151–172. [37] D. Munoz, J. Bagnell, M. Hebert, Stacked hierarchical labeling, in: Proceedings of European Conference on Computer Vision (ECCV), 2010, pp. 57–70. [38] Z. Li, X.-M. Wu, S.-F. Chang, Segmentation using superpixels: A bipartite graph partitioning approach, in: Proceedings of IEEE Conference on Computer Vision and Pattern 826 Recognition (CVPR), 2012, pp. 789–796.