1 Jamie Shotton Machine Intelligence Laboratory, University of Cambridge John Winn, Carsten Rother, Antonio Criminisi Microsoft Research Cambridge, UK PresenterοΌ Kuang-Jui Hsu Date οΌ2011/7/04(Mon.) • Introduction • A Conditional Random Field Model of object Classed • Boosted Learning of Texture, Layout, and Context • Result and Comparisons 2 • Achieving automatic detection, recognition, and segmentation of object classes in photographs • Not only the accuracy of segmentation and recognition, but also the efficiency of the algorithm • At a local level, the appearance of an image patch leads to ambiguities in its class label • To overcome it, it is necessary to incorporate longer range information • To achieve, construct a discriminative model for labeling images which exploits all three type of information: texture appearance, layout, and context 3 • Overcome problems associated with object recognition techniques that rely on sparse feature • Authors’ technique based on dense feature is capable of coping with textured and untextured and with multiple objects which inter- or self-occlude 4 • Three contributions: • Use a novel type of feature called the texture-layout filter • A new discriminative model that combines texture-layout filters with lower-level image features • Demonstrate how to train this model efficiently on a very large dataset by exploiting both boosting and piecewise training methods 5 • Use a conditional random field (CRF) model to learn the conditional distribution over the class labeling given an image • Incorporate texture, layout, color, location, and edge cues 6 • Definition: log π π π, π½ texture-layout = color location ππ ππ , π; π½π + π ππ , π₯π , π½π + π(ππ , π, π½π ) π edge + partition function π(ππ , ππ , gππ π ; π½π ) − log Z(π½, x) (π,π)∈π c:class labels x: an image π: the node π in the graph π½ = π½π , π½π , π½π , π½π : the parameters to learn π: the set of edges in a 4-connected grid structure 7 • Definition:ππ ππ , π; π½π = log π(ππ |π, π) π(ππ |π, π): the normalized distribution given by a boosted classifier • This classifier models the texture, layout, and texture context of the object classed by combining novel discriminative features called texture-layout filter 8 • Represented as Gaussian Mixture Models (GMMs) in CIELab color space where the mixture coefficients depend on label • Conditional probability of the color x of a pixel π π₯|π = π π₯ π π(π|π) π with color clusters (mixture component) π π₯|π = π©(π₯|ππ , Σπ ) π: color cluster ππ , Σπ : the mean and variance respectively of color cluster π 9 π π₯|π = π π₯ π π(π|π) π The color models use Gaussian Mixture Models where the mixture coefficients π(π|π) are conditioned on the class label c 10 • But, predict the class label c given the image color of a pixel x and the color cluster k • Use the simple inference method by Bayes rule • For pixel π₯π π(π, π₯π ) π π₯π π π(π) π(π) π π π₯π = = = π(π₯π |π) π(π₯π ) π(π₯π ) π(π₯π ) So, π π π₯π ∝ π(π₯π |π) • Definition: π ππ , π₯π , π½π = log π½π ππ , π π(π|π₯π ) π 11 • Definition: π ππ , π₯π , π½π = log π½π ππ , π π(π|π₯π ) π • Learned parameter π½π ππ , π represent the distribution π(ππ |π) For discriminative inference, the arrows in the graphical model are reversed using Bayes rule 12 • Definition: π ππ , π, π½π = log π½π (ππ , π) π: the normalized version of the pixel index i ,where the normalization allows for image of different sizes • The π½π is also learned 13 • Definition: π ππ , ππ , gππ π ; π½π = −π½ππ gππ π [ππ ≠ ππ ] • gππ : the edge feature measures the difference in color between the neighboring pixels gππ = exp(−π½ π₯π − π₯π 1 2 π₯π , π₯π : three-dimensional vectors representing the colors of the pixels π, π π½= 2 π₯π − π₯π 2 −1 14 • Given the CRF model and its learned parameters, find the most probable labeling, π∗ • The optimal labeling is found by applying the alphaexpansion graph cut algorithm 15 • A current configuration (set of labels) c and fixed label πΌ ∈ 1, … , πΆ , where πΆis the number of classes • Each pixel π makes a binary decision: it can either keep its old label or switch to label πΌ • A binary vector π ∈ 0,1 π which defines the auxiliary configuration c[π]as ππ , ππ π = πΌ , ππ π π = 0 ππ π π = 1 • Start with an initial configuration π0 , given by the mode of the texture-layout potentials • Compute optimal alpha-expansion moves for label πΌ in some order, accepting the moves only they increase the objective function 16 • There are two methods to learn the parameters: • Maximum a-posteriori (MAP) – poor results • Piecewise training • Only π½π , π½π , π½π are learned by these methods • π½π is learned during boosted learning 17 • Maximizes the conditional likelihood of the labels given the train data, πΏ π½ = log π(ππ |ππ , π½) + log π(π½) π ππ ,ππ : the training data of input and output log π(π½): prevent overfitting • The maximization of πΏ π with respect to π½ can be achieved using a gradient ascent algorithm 18 • Conjugate gradient ascent did eventually converge to a solution, evaluating the learned parameter against validation data gave poor results with almost improvement • The lack of alignment between object edges and label boundaries in the roughly labeled training set forced the learned parameters to tend toward zero 19 • Based on the piecewise training method of “Piecewise Training of Undirected Models” [C. Sutton et al., 2005] • The terms are trained independently, and the recombined • The training method minimized an upper bound on the log partition function: Let π§ π½, π = log π(π½, π), and index the terms in the model by r π§ π½, π ≤ π§π (π½π , π) π π½π : the parameters of the rth term π§π (π½π ): the partition function for a model with the rth term 20 π§ π½, π ≤ π§π (π½π , π) π Proof: Use the Jensen’s inequality: ππ π₯π ππ π π₯π π ≤ , ππ ππ ππ π₯π ππ π π₯π π ≥ , ππ ππ ππ π ππ ππππ ππππ£ππ₯ ππ π ππ ππππ ππππππ£π ππ : the positive weights π§ π½, π = log π(π½, π) is concave 21 • Replacing π§ π½, π with bound π π§π (π½π , π) gives a lower • The bound can be loose, especially if the terms in the model are correlated • Performing piecewise parameter training leads to over-counting during inference in the combined model • Because of over-counting, πππππ€ = 2πππππ • To avoid this, weight the logarithm of each duplicate term by a factor of 0.5, or raise the term to the power of 0.5 22 23 • Four types of parameter have to be learned • • • • Texture-layout potential parameters Color potential parameters Location potential parameters Edge potential parameters • But the first parameters is learned during boosted learning, and each others are learned by the piecewise learing 24 • The color potentials are learned at test time for each image independently • First, the color clusters, π π₯|π = π©(π₯|ππ , Σπ ), are learned in an unsupervised manner using K-means • Then, an iterative algorithm, reminiscent of EM alternates between inferring class labeling π∗ , and updating the color potential parameters as ππ ππ , π = π ππ∗ ππ = π(π|π₯π ) + πΌπ π π(π|π₯π ) + πΌπ ππ 25 • Training these parameters by maximizing the likelihood of the normalized model containing just that potential and raising the result to a fixed power π€π to compensate for over-counting ππ,π + πΌπ ππ π, π = ππ + πΌπ ππ ππ,π : the number of pixels of class c at normalized location πin the training set ππ,π : the total number of pixels at location π 26 • The value of the two contrast-related parameters were manually selected to minimize the error on the validation 27 • Based on a novel set of features called texture-layout filter • Capable of jointly capturing texture, spatial layout, and textural context 28 1. The training images are convolved with a 17dimensional filter-band at scale π 2. The 17D responses for all training pixels are whitened 3. An un supervised clustering is performed 4. Each pixel in each image is assigned to the nearest cluster center, producing the texton map » Denote the texton map as T where pixel i has value ππ ∈ 1, … , πΎ 29 • Each texture-layout filter is a pair π, π‘ of an image region, π, and a textonπ‘ • π: defined in coordinates relative to the pixel π being classified • For simplicity, a set β of candidate rectangles are chosen at random, such their top-left and bottom-right corners lie within a fixed bounding box covering about half the image area 30 Feature response: π£ π,π‘ π = 1 ππππ(π) π∈(π+π) ππ = π‘ , i: location 31 efficiently computed over a whole image with integral images [P.Viola et al.,2001] Process: 1. Separated into K channels (one for each channel) 2. For each channel, a separated integral images is calculated (π) (π) (π) 3. Feature response: π£ π,π‘ π = ππ(π) − π − π + π πππ ππ‘π πππ ππ π (π‘) : the integral image of T for texton channel t 32 • Some classes may have large within-class textural differences, but repeatable layout of texture within a particular object instance • It uses the texton at pixel i being classified, ππ , rather than a particular learned texton 33 π(π|π, π½) ≈ π(ππ₯ |π, π½π₯ ) × π(ππ¦ |π, π½π¦ ) 34 • Employ an adapted version of the Joint Boost algorithm [A. Torralba et al.,2007] • Iteratively selects discriminative texture-layout as “weak learner” • Combine them into a powerful classifier π π π, π ,used by the texture-layout potentials • Joint Boost shares each weak learner between a set of classes C 35 • Strong classifier: π» π, π = π π β π=1 π (π) • Use the multiclass logistic transformation π(π|π, π) ∝ exp π»(π, π) [J. Friedman et al.,2000] • Each weak learner based on feature response π£ π,π‘ π π π£ π,π‘ π > π + π, ππ π ∈ πΆ βπ (π) = ππ, ππ‘βπππ€ππ π with parameter π, π, π π π∉πΆ , π, πΆ, π, π‘ 36 • Each training example i (a pixel in a training image ) is paired with a target value π§ππ ∈ {−1, +1}, and assigned a weight π€ππ specifying its classification accuracy for class c after m-1 rounds • Round m choose a new weak learner by minimizing an error function π½π€π π : π€ππ (π§ππ − βππ (π))2 π½π€π π = π π • Re-weighted: π€ππ β π −π§ππ βππ (π) π€π π 37 • Minimizing the error function π½π€π π requires an expensive brute-force search over the possible weak learner βππ (π) • Given the sharing set π, features (π, π‘), and threshold π, a closed form exist for π, π, πππ π π π∉π : π= π∈π π∈π π+π = π π π€ π π π§π π£(π, π, π‘) ≤ π π π€ π π π£(π, π, π‘) ≤ π π∈π π∈π π π π€ π π π§π π£(π, π, π‘) > π π π€ π π π£(π, π, π‘) > π π π π€ π π π§π π π = π π€ π π by minimizing π½π€π π 38 • Employ the quadratic-cost greed algorithm to speed up the search [A. Torralba et al.,2007] • Optimization over π ∈ Θ can be made efficient by careful use of histograms of weighted feature responses: • By treating Θ as an ordered set, histograms of valuesπ£ π,π‘ π , weighted appropriately by π€ππ π§ππ and π€ππ , are built over bin corresponding to the thresholds in Θ • These histogram are accumulated to give the thresholded sums for the calculation of a and b for all value of Θ at once 39 • Employ a random feature selection procedure to speed up the minimization over features • This algorithm examines only a randomly chosen fraction π βͺ 1 of the possible features [ S. Baluja et al.] 40 41 Adding more texture-layout filters improve classification 42 • The effect of different model potentials (a): the original input image (b): only using the texture-layout potentials (c): with out color modeling (d): full CRF model 43 • Texton-dependent layout filter 44 • MSRC 21-class database result 45 • Accuracy of segmentation for the MSRC 21-class database 46 • Comparison with He et al. 47 • TV sequences 48