Texture-Layout Filters

1 Jamie Shotton Machine Intelligence Laboratory, University of Cambridge John Winn, Carsten Rother, Antonio Criminisi Microsoft Research Cambridge, UK Presenter： Kuang-Jui Hsu Date ：2011/7/04(Mon.) • Introduction • A Conditional Random Field Model of object Classed • Boosted Learning of Texture, Layout, and Context • Result and Comparisons 2 • Achieving automatic detection, recognition, and segmentation of object classes in photographs • Not only the accuracy of segmentation and recognition, but also the efficiency of the algorithm • At a local level, the appearance of an image patch leads to ambiguities in its class label • To overcome it, it is necessary to incorporate longer range information • To achieve, construct a discriminative model for labeling images which exploits all three type of information: texture appearance, layout, and context 3 • Overcome problems associated with object recognition techniques that rely on sparse feature • Authors’ technique based on dense feature is capable of coping with textured and untextured and with multiple objects which inter- or self-occlude 4 • Three contributions: • Use a novel type of feature called the texture-layout filter • A new discriminative model that combines texture-layout filters with lower-level image features • Demonstrate how to train this model efficiently on a very large dataset by exploiting both boosting and piecewise training methods 5 • Use a conditional random field (CRF) model to learn the conditional distribution over the class labeling given an image • Incorporate texture, layout, color, location, and edge cues 6 • Definition: log 𝑃 𝒄 𝒙, 𝜽 texture-layout = color location 𝜓𝑖 𝑐𝑖 , 𝒙; 𝜽𝜓 + 𝜋 𝑐𝑖 , 𝑥𝑖 , 𝜽𝜋 + 𝜆(𝑐𝑖 , 𝑖, 𝜽𝜆 ) 𝑖 edge + partition function 𝜙(𝑐𝑖 , 𝑐𝑗 , g𝑖𝑗 𝒙 ; 𝜽𝜙 ) − log Z(𝜽, x) (𝑖,𝑗)∈𝜀 c:class labels x: an image 𝑖: the node 𝑖 in the graph 𝜽 = 𝜽𝜓 , 𝜽𝜋 , 𝜽𝜆 , 𝜽𝜙 : the parameters to learn 𝜀: the set of edges in a 4-connected grid structure 7 • Definition:𝜓𝑖 𝑐𝑖 , 𝒙; 𝜽𝜓 = log 𝑃(𝑐𝑖 |𝒙, 𝑖) 𝑃(𝑐𝑖 |𝒙, 𝑖): the normalized distribution given by a boosted classifier • This classifier models the texture, layout, and texture context of the object classed by combining novel discriminative features called texture-layout filter 8 • Represented as Gaussian Mixture Models (GMMs) in CIELab color space where the mixture coefficients depend on label • Conditional probability of the color x of a pixel 𝑃 𝑥|𝑐 = 𝑃 𝑥 𝑘 𝑃(𝑘|𝑐) 𝑘 with color clusters (mixture component) 𝑃 𝑥|𝑘 = 𝒩(𝑥|𝜇𝑘 , Σ𝑘 ) 𝑘: color cluster 𝜇𝑘 , Σ𝑘 : the mean and variance respectively of color cluster 𝑘 9 𝑃 𝑥|𝑐 = 𝑃 𝑥 𝑘 𝑃(𝑘|𝑐) 𝑘 The color models use Gaussian Mixture Models where the mixture coefficients 𝑃(𝑘|𝑐) are conditioned on the class label c 10 • But, predict the class label c given the image color of a pixel x and the color cluster k • Use the simple inference method by Bayes rule • For pixel 𝑥𝑖 𝑃(𝑘, 𝑥𝑖 ) 𝑃 𝑥𝑖 𝑘 𝑃(𝑘) 𝑃(𝑘) 𝑃 𝑘 𝑥𝑖 = = = 𝑃(𝑥𝑖 |𝑘) 𝑃(𝑥𝑖 ) 𝑃(𝑥𝑖 ) 𝑃(𝑥𝑖 ) So, 𝑃 𝑘 𝑥𝑖 ∝ 𝑃(𝑥𝑖 |𝑘) • Definition: 𝜋 𝑐𝑖 , 𝑥𝑖 , 𝜽𝜋 = log 𝜽𝜋 𝑐𝑖 , 𝑘 𝑃(𝑘|𝑥𝑖 ) 𝑘 11 • Definition: 𝜋 𝑐𝑖 , 𝑥𝑖 , 𝜽𝜋 = log 𝜽𝜋 𝑐𝑖 , 𝑘 𝑃(𝑘|𝑥𝑖 ) 𝑘 • Learned parameter 𝜽𝜋 𝑐𝑖 , 𝑘 represent the distribution 𝑃(𝑐𝑖 |𝑘) For discriminative inference, the arrows in the graphical model are reversed using Bayes rule 12 • Definition: 𝜆 𝑐𝑖 , 𝑖, 𝜽𝜆 = log 𝜽𝜆 (𝑐𝑖 , 𝑖) 𝑖: the normalized version of the pixel index i ,where the normalization allows for image of different sizes • The 𝜽𝜆 is also learned 13 • Definition: 𝜙 𝑐𝑖 , 𝑐𝑗 , g𝑖𝑗 𝒙 ; 𝜽𝜙 = −𝜽𝑇𝜙 g𝑖𝑗 𝒙 [𝑐𝑖 ≠ 𝑐𝑗 ] • g𝑖𝑗 : the edge feature measures the difference in color between the neighboring pixels g𝑖𝑗 = exp(−𝛽 𝑥𝑖 − 𝑥𝑗 1 2 𝑥𝑖 , 𝑥𝑗 : three-dimensional vectors representing the colors of the pixels 𝑖, 𝑗 𝛽= 2 𝑥𝑖 − 𝑥𝑗 2 −1 14 • Given the CRF model and its learned parameters, find the most probable labeling, 𝒄∗ • The optimal labeling is found by applying the alphaexpansion graph cut algorithm 15 • A current configuration (set of labels) c and fixed label 𝛼 ∈ 1, … , 𝐶 , where 𝐶is the number of classes • Each pixel 𝑖 makes a binary decision: it can either keep its old label or switch to label 𝛼 • A binary vector 𝒔 ∈ 0,1 𝑝 which defines the auxiliary configuration c[𝒔]as 𝑐𝑖 , 𝑐𝑖 𝒔 = 𝛼 , 𝑖𝑓 𝑠𝑖 = 0 𝑖𝑓 𝑠𝑖 = 1 • Start with an initial configuration 𝒄0 , given by the mode of the texture-layout potentials • Compute optimal alpha-expansion moves for label 𝛼 in some order, accepting the moves only they increase the objective function 16 • There are two methods to learn the parameters: • Maximum a-posteriori (MAP) – poor results • Piecewise training • Only 𝜽𝜋 , 𝜽𝜆 , 𝜽𝜙 are learned by these methods • 𝜽𝜓 is learned during boosted learning 17 • Maximizes the conditional likelihood of the labels given the train data, 𝐿 𝜽 = log 𝑃(𝒄𝑛 |𝒙𝑛 , 𝜽) + log 𝑃(𝜽) 𝑛 𝒄𝑛 ,𝒙𝑛 : the training data of input and output log 𝑃(𝜽): prevent overfitting • The maximization of 𝐿 𝜃 with respect to 𝜽 can be achieved using a gradient ascent algorithm 18 • Conjugate gradient ascent did eventually converge to a solution, evaluating the learned parameter against validation data gave poor results with almost improvement • The lack of alignment between object edges and label boundaries in the roughly labeled training set forced the learned parameters to tend toward zero 19 • Based on the piecewise training method of “Piecewise Training of Undirected Models” [C. Sutton et al., 2005] • The terms are trained independently, and the recombined • The training method minimized an upper bound on the log partition function: Let 𝑧 𝜽, 𝒙 = log 𝑍(𝜽, 𝒙), and index the terms in the model by r 𝑧 𝜽, 𝒙 ≤ 𝑧𝑟 (𝜽𝑟 , 𝒙) 𝑟 𝜽𝑟 : the parameters of the rth term 𝑧𝑟 (𝜽𝑟 ): the partition function for a model with the rth term 20 𝑧 𝜽, 𝒙 ≤ 𝑧𝑟 (𝜽𝑟 , 𝒙) 𝑟 Proof: Use the Jensen’s inequality: 𝑎𝑖 𝑥𝑖 𝑎𝑖 𝜑 𝑥𝑖 𝜑 ≤ , 𝑎𝑗 𝑎𝑗 𝑎𝑖 𝑥𝑖 𝑎𝑖 𝜑 𝑥𝑖 𝜑 ≥ , 𝑎𝑗 𝑎𝑗 𝑖𝑓 𝜑 𝑖𝑠 𝑟𝑒𝑎𝑙 𝑐𝑜𝑛𝑣𝑒𝑥 𝑖𝑓 𝜑 𝑖𝑠 𝑟𝑒𝑎𝑙 𝑐𝑜𝑛𝑐𝑎𝑣𝑒 𝑎𝑖 : the positive weights 𝑧 𝜽, 𝒙 = log 𝑍(𝜽, 𝒙) is concave 21 • Replacing 𝑧 𝜽, 𝒙 with bound 𝑟 𝑧𝑟 (𝜽𝑟 , 𝒙) gives a lower • The bound can be loose, especially if the terms in the model are correlated • Performing piecewise parameter training leads to over-counting during inference in the combined model • Because of over-counting, 𝜃𝜓𝑛𝑒𝑤 = 2𝜃𝜓𝑜𝑙𝑑 • To avoid this, weight the logarithm of each duplicate term by a factor of 0.5, or raise the term to the power of 0.5 22 23 • Four types of parameter have to be learned • • • • Texture-layout potential parameters Color potential parameters Location potential parameters Edge potential parameters • But the first parameters is learned during boosted learning, and each others are learned by the piecewise learing 24 • The color potentials are learned at test time for each image independently • First, the color clusters, 𝑃 𝑥|𝑘 = 𝒩(𝑥|𝜇𝑘 , Σ𝑘 ), are learned in an unsupervised manner using K-means • Then, an iterative algorithm, reminiscent of EM alternates between inferring class labeling 𝒄∗ , and updating the color potential parameters as 𝜃𝜋 𝑐𝑖 , 𝑘 = 𝑖 𝑐𝑖∗ 𝑐𝑖 = 𝑃(𝑘|𝑥𝑖 ) + 𝛼𝜋 𝑖 𝑃(𝑘|𝑥𝑖 ) + 𝛼𝜋 𝓌𝜋 25 • Training these parameters by maximizing the likelihood of the normalized model containing just that potential and raising the result to a fixed power 𝑤𝜆 to compensate for over-counting 𝑁𝑐,𝑖 + 𝛼𝜆 𝜃𝜆 𝑐, 𝑖 = 𝑁𝑖 + 𝛼𝜆 𝜔𝜆 𝑁𝑐,𝑖 : the number of pixels of class c at normalized location 𝑖in the training set 𝑁𝑐,𝑖 : the total number of pixels at location 𝑖 26 • The value of the two contrast-related parameters were manually selected to minimize the error on the validation 27 • Based on a novel set of features called texture-layout filter • Capable of jointly capturing texture, spatial layout, and textural context 28 1. The training images are convolved with a 17dimensional filter-band at scale 𝜅 2. The 17D responses for all training pixels are whitened 3. An un supervised clustering is performed 4. Each pixel in each image is assigned to the nearest cluster center, producing the texton map » Denote the texton map as T where pixel i has value 𝑇𝑖 ∈ 1, … , 𝐾 29 • Each texture-layout filter is a pair 𝑟, 𝑡 of an image region, 𝑟, and a texton𝑡 • 𝑟: defined in coordinates relative to the pixel 𝑖 being classified • For simplicity, a set ℛ of candidate rectangles are chosen at random, such their top-left and bottom-right corners lie within a fixed bounding box covering about half the image area 30 Feature response: 𝑣 𝑟,𝑡 𝑖 = 1 𝑎𝑟𝑒𝑎(𝑟) 𝑗∈(𝑟+𝑖) 𝑇𝑗 = 𝑡 , i: location 31 efficiently computed over a whole image with integral images [P.Viola et al.,2001] Process: 1. Separated into K channels (one for each channel) 2. For each channel, a separated integral images is calculated (𝑖) (𝑖) (𝑖) 3. Feature response: 𝑣 𝑟,𝑡 𝑖 = 𝑇𝑟(𝑖) − 𝑇 − 𝑇 + 𝑇 𝑟𝑏𝑙 𝑟𝑡𝑟 𝑟𝑏𝑙 𝑏𝑟 𝑇 (𝑡) : the integral image of T for texton channel t 32 • Some classes may have large within-class textural differences, but repeatable layout of texture within a particular object instance • It uses the texton at pixel i being classified, 𝑇𝑖 , rather than a particular learned texton 33 𝑃(𝒄|𝒙, 𝜽) ≈ 𝑃(𝒄𝑥 |𝒙, 𝜽𝑥 ) × 𝑃(𝒄𝑦 |𝒙, 𝜽𝑦 ) 34 • Employ an adapted version of the Joint Boost algorithm [A. Torralba et al.,2007] • Iteratively selects discriminative texture-layout as “weak learner” • Combine them into a powerful classifier 𝑃 𝑐 𝒙, 𝑖 ,used by the texture-layout potentials • Joint Boost shares each weak learner between a set of classes C 35 • Strong classifier: 𝐻 𝑐, 𝑖 = 𝑚 𝑀 ℎ 𝑚=1 𝑖 (𝑐) • Use the multiclass logistic transformation 𝑃(𝑐|𝒙, 𝑖) ∝ exp 𝐻(𝑐, 𝑖) [J. Friedman et al.,2000] • Each weak learner based on feature response 𝑣 𝑟,𝑡 𝑖 𝑎 𝑣 𝑟,𝑡 𝑖 > 𝜃 + 𝑏, 𝑖𝑓 𝑐 ∈ 𝐶 ℎ𝑖 (𝑐) = 𝑘𝑐, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 with parameter 𝑎, 𝑏, 𝑘 𝑐 𝑐∉𝐶 , 𝜃, 𝐶, 𝑟, 𝑡 36 • Each training example i (a pixel in a training image ) is paired with a target value 𝑧𝑖𝑐 ∈ {−1, +1}, and assigned a weight 𝑤𝑖𝑐 specifying its classification accuracy for class c after m-1 rounds • Round m choose a new weak learner by minimizing an error function 𝐽𝑤𝑠𝑒 : 𝑤𝑖𝑐 (𝑧𝑖𝑐 − ℎ𝑖𝑚 (𝑐))2 𝐽𝑤𝑠𝑒 = 𝑐 𝑖 • Re-weighted: 𝑤𝑖𝑐 ≔ 𝑐 −𝑧𝑖𝑐 ℎ𝑖𝑚 (𝑐) 𝑤𝑖 𝑒 37 • Minimizing the error function 𝐽𝑤𝑠𝑒 requires an expensive brute-force search over the possible weak learner ℎ𝑖𝑚 (𝑐) • Given the sharing set 𝑁, features (𝑟, 𝑡), and threshold 𝜃, a closed form exist for 𝑎, 𝑏, 𝑎𝑛𝑑 𝑘 𝑐 𝑐∉𝑁 : 𝑏= 𝑐∈𝑁 𝑐∈𝑁 𝑎+𝑏 = 𝑐 𝑐 𝑤 𝑖 𝑖 𝑧𝑖 𝑣(𝑖, 𝑟, 𝑡) ≤ 𝜃 𝑐 𝑤 𝑖 𝑖 𝑣(𝑖, 𝑟, 𝑡) ≤ 𝜃 𝑐∈𝑁 𝑐∈𝑁 𝑐 𝑐 𝑤 𝑖 𝑖 𝑧𝑖 𝑣(𝑖, 𝑟, 𝑡) > 𝜃 𝑐 𝑤 𝑖 𝑖 𝑣(𝑖, 𝑟, 𝑡) > 𝜃 𝑐 𝑐 𝑤 𝑖 𝑖 𝑧𝑖 𝑐 𝑘 = 𝑐 𝑤 𝑖 𝑖 by minimizing 𝐽𝑤𝑠𝑒 38 • Employ the quadratic-cost greed algorithm to speed up the search [A. Torralba et al.,2007] • Optimization over 𝜃 ∈ Θ can be made efficient by careful use of histograms of weighted feature responses: • By treating Θ as an ordered set, histograms of values𝑣 𝑟,𝑡 𝑖 , weighted appropriately by 𝑤𝑖𝑐 𝑧𝑖𝑐 and 𝑤𝑖𝑐 , are built over bin corresponding to the thresholds in Θ • These histogram are accumulated to give the thresholded sums for the calculation of a and b for all value of Θ at once 39 • Employ a random feature selection procedure to speed up the minimization over features • This algorithm examines only a randomly chosen fraction 𝜉 ≪ 1 of the possible features [ S. Baluja et al.] 40 41 Adding more texture-layout filters improve classification 42 • The effect of different model potentials (a): the original input image (b): only using the texture-layout potentials (c): with out color modeling (d): full CRF model 43 • Texton-dependent layout filter 44 • MSRC 21-class database result 45 • Accuracy of segmentation for the MSRC 21-class database 46 • Comparison with He et al. 47 • TV sequences 48

Texture-Layout Filters

Related documents

Products

Support

Texture-Layout Filters

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib