Lecture 3&4 Modelling vision problems: Applications & Insights Carsten Rother Comments and corrections form last lecture all slides are now online: http://research.microsoft.com/en-us/um/cambridge/projects/tutorials/course5F16/ Prior – 4x4 Grid Pure Prior model: P(x) = 1/f ∏ exp{-10|xi-xj|} i,j Є N4 It was incorrect: P(x) = 1/f ∏ 10 exp{-|xi-xj|} = 1/f ∏ exp{- 2.3 |xi-xj|} i,j Є N4 Probability Distribution 216 configurations i,j Є N4 Samples Recap: Decision Theory Assume w has been learned and P(x|z,w) is: Worst Solutions sorted by probability Probability Best Solutions sorted by probability 216 configurations Which solution x* would you choose? Recap: How to make a decision Assume model P(x|z,w) is known Goal: Choose x* which minimizes the risk R x* = argmin R(x) x Risk R is the expected loss: R(x) = ∑ P(x’|z,w) x’ Δ(x’,x) “loss function” Recap: Decision Theory Best Solutions sorted by probability Risk: R(x) = ∑ P(x’|z,w) Δ(x’,x) x’ Worst Solutions sorted by probability 0/1 loss: Δ(x,x’) = 0 if x’=x, 1 otherwise x* = argmin R(x) = argmin ∑ P(x’|z,w) Δ(x’,x) = 2^16 - P(x|z,w) x x x’ Therefore: x* = argmax P(x|z,w) x which is the MAP solution Question from last lecture: Loss-based learning and Marginals Minimize R = 1/|T| ∑ Δ(xt,x*t) t Search: Hamming loss (MAP) Testing: 1. Hamming Loss MAP (w=0.1) Error Hamming: 0.093 2. Hamming Loss Marginals (w=0.4) Error Hamming: 0.1039 x* = argmax P(x|z,w) (MAP) x Search: Hamming loss (Marginals) Marginals: x* = argmax P(xi|z,w) xi Big Picture: Statistical Models in Computer Vision Model : discrete or continuous variables? discrete or continuous space? Dependence between variables? … Optimisation/Prediction/inference Applications: 2D/3D Image segmentation Object Recognition 3D reconstruction Stereo matching Image denoising Texture Synthesis Pose estimation Panoramic Stitching … Combinatorial optimization: e.g. Graph Cut Message Passing: e.g. BP, TRW Iterated Conditional Modes (ICM) LP-relaxation: e.g. Cutting-plane Problem decomposition + subgradient … Learning: Maximum Likelihood Learning Pseudo-likelihood approximation Loss minimizing Parameter Learning Exhaustive search Constraint generation … Graphical models Write probability distribution as a Graphical model: - Directed graphical model (also called Bayesian Networks) - Undirected graphical model (also called Markov Random Fields) - Factor graph (most explicit representation) Basic idea: • A model represents a family of distributions • Key concept is conditional independency References: - Pattern Recognition and Machine Learning [Bishop ‘08, chapter 8] - several lectures at the Machine Learning Summer School 2009 (see video lectures) - ML Lecture Factor Graph: formal definition P(x) = 1/Z ∏ ΨF (xN(F)) with F ∊𝔽 Z = ∑ ∏ ΨF (xN(F)) x F ∊𝔽 Z: partition function F: Factor 𝔽: Set of all factors N(F): Neighbourhood of a factor ΨF : function (no distribution) depending on xN(F) Can be written as Gibbs distribution (last lecture): We set: ΨF (xN(F)) = exp{- θF(xN(F))} P(x) = 1/f exp{-E(x)} with Energy E(x) = ∑ θF(xN(F)) F ∊𝔽 Example - Factor Graphs P(x) ~ exp{-E(x)} “Gibbs distribution with E(x) = θ1(x1,x2,x3) + θ2(x2,x4) + θ3(x3,x4) + θ4(x3,x5) 4 factors” unobserved x1 x2 x3 x4 x5 Factor graph variables are in same factor. Definitions x1 x2 x3 x4 E(x) = θ1(x1,x2,x3) + θ2(x2,x4) + θ3(x3,x4) + θ4(x3,x5) arity 3 Definitions: arity 2 • Order: the arity (number of variables) of the largest factor • Markov Random Field: Random Field with low-order factors • Some people use “cliques” as “factors” (not the same since clique may or may not “decomposable”) x5 Factor graph with order 3 Comment: Undirected Graphical models Triple Clique x1 x2 x3 Undirected model E(X) = θ1(x1,x2,x3) + θ2(x1,x2) + θ3(x2,x3) + θ4(x1,x3) + θ5(x1) + θ6(x2) + θ7(x3) x1 x2 x1 Triple factor x3 Possible Factor graph E(X) = θ1(x1,x2,x3) + θ2(x2) x2 Pairwise factor x3 Possible Factor graph E(X) = θ1(x1,x2) + θ2(x2,x3) + θ3(x1,x3) Discrete domain Discrete label space (x∊ 𝓛) Examples - Order 4-connected; pairwise MRF E(x) = ∑ θij (xi,xj) i,j Є N4 Order 2 “Pairwise energy” higher(8)-connected; pairwise MRF E(x) = ∑ θij (xi,xj) i,j Є N8 Order 2 Higher-order RF E(x) = ∑ θij (xi,xj) i,j Є N4 +θ(x1,…,xn) Order n “higher-order energy” Random field models 4-connected; pairwise MRF E(x) = ∑ θij (xi,xj) i,j Є N4 Order 2 “Pairwise energy” higher(8)-connected; pairwise MRF E(x) = ∑ θij (xi,xj) i,j Є N8 Order 2 Higher-order RF E(x) = ∑ θij (xi,xj) i,j Є N4 +θ(x1,…,xn) Order n “higher-order energy” Example: Image segmentation P(x|z) ~ exp{-E(x)} E(x): {0,1}n → ℝ E(x) = ∑ θi (xi,zi) + ∑ θij (xi,xj) i i,j Є N4 Observed variable Unobserved (latent) variable zi xj Factor graph xi Stereo matching d=0 d=4 Image – left(a) Image – right(b) Ground truth depth • Images rectified • Ignore occlusion for now Energy: E(d): {0,…,D-1}n → ℝ Labels: d (depth/shift) di Stereo matching - Energy Energy: E(d): {0,…,D-1}n → R E(d) = ∑ θi (di) + ∑ θij (di,dj) i,j Є N4 i Unary: θi (di) = ∑ || lj-ri-di||2 “SSD; Sum of square differences” (many others possible, NCC,…) left i i-2 (di=2) right Pairwise: θij (di,dj) = g(|di-dj|) Right Image window Window is good to overcome differences in lighting, etc. Left Image Stereo matching - prior cost θij (di,dj) = g(|di-dj|) |di-dj| No truncation (global min.) [Olga Veksler PhD thesis, Daniel Cremers et al.] Stereo matching - prior cost θij (di,dj) = g(|di-dj|) |di-dj| discontinuity preserving potentials [Blake&Zisserman’83,’87] No truncation (global min.) with truncation (NP hard optimization) [Olga Veksler PhD thesis, Daniel Cremers et al.] Stereo matching see http://vision.middlebury.edu/stereo/ No MRF Pixel independent (WTA) No vertical links Efficient since independent chains Pairwise MRF [Boykov et al. ‘01] Ground truth Texture synthesis Input Output Bad case: Good case: b b a a b i a j i j E: {0,1}n → R E(x) = O 1 ∑ |xi-xj| i,j Є N4 [ |ai-bi|+|aj-bj| ] [Kwatra et. al. Siggraph ‘03 ] Video Synthesis Input Output Video (duplicated) Video Panoramic stitching Panoramic stitching Large alphabet Label-space has some ordering: - Example: depth (stereo), motion (optical flow). - Pairwise cost have (simple) parametric form cost Stereo example Optical flow estimation: Often continuous labels are used for this task |di-dj| Label-space has no ordering: - Examples: Panoramic stitching, in-painting, super-resolution - Label space can get very large (very hard problems) θij (di,dj) is an arbitrary table Exemplar-based in-painting [Freeman et al., MRF book, ch. 10] Input Various local minima E(x) = ∑ |Patch(xi) – Patch(xj)|2 i,j ∊ N4 xi∊ 𝓛 Higher-order model [Rother et al. TR ‘11] Exemplar-based in-painting Global techniques [Komodiakis et al. CVPR ‘06] Greedy techniques [Criminisi et al. CVPR ‘03] Recap: 4-connected MRFs • A lot of useful vision systems are based on 4-connected pairwise MRFs. • Possible Reason (see optimization part): a lot of fast and good (globally optimal) optimization techniques exist Random field models 4-connected; pairwise MRF E(x) = ∑ θij (xi,xj) i,j Є N4 Order 2 “Pairwise energy” higher(8)-connected; pairwise MRF E(x) = ∑ θij (xi,xj) i,j Є N8 Order 2 Higher-order RF E(x) = ∑ θij (xi,xj) i,j Є N4 +θ(x1,…,xn) Order n “higher-order energy” Why larger connectivity? We have seen… • “Knock-on” effect (each pixel influences each other pixel) • Many good systems What is missing: 1. Modelling real-world texture (images) 2. Reduce discretization artefacts 3. Encode complex prior knowledge 4. Use non-local parameters Reason 1: Texture modelling Training images Result MRF 4-connected (neighbours) Test image Result MRF 4-connected Test image (60% Noise) Result MRF 9-connected (7 attractive; 2 repulsive) Reason2: Discretization (metrication) artefacts 1 Length of the paths: 4-con. 8-con. 5.65 6.28 5.08 8 6.28 6.75 Eucl. Larger connectivity can model true Euclidean length (also other metric possible) [Boykov et al. ‘03, ‘05] Reason2: Discretization (metrication) artefacts 4-connected Euclidean 8-connected Euclidean (MRF) 8-connected geodesic (CRF) (Last lecture: θij (xi,xj,zi,zj) = |xi-xj| (-exp{-ß||zi-zj||}) [Boykov et al. ‘03; ‘05] 3D reconstruction Metrication artefacts more visible when no observation z is available [Slide credits: Daniel Cremers] Reason 3: Encode complex prior knowledge: Stereo with occlusion E(d): {1,…,D}2n → R Each pixel is connected to D pixels in the other image d=1 (∞ cost) 1 1 d=10 (match) θlr (dl,dr) = d d match D dl d=20 (0 cost) D dr Left view right view Stereo with occlusion Ground truth Stereo with occlusion [Kolmogrov et al. ‘02] Stereo without occlusion [Boykov et al. ‘01] Reason 4: Use Non-local parameters: Interactive Segmentation (GrabCut) w Model jointly segmentation and color model: E(x,w): {0,1}n x {GMMs}→ R E(x,w) = ∑ θi (xi,w) + ∑ θij (xi,xj) i i,j Є N4 Red Red An object is a compact set of colors: [Rother et al. Siggraph ’04] Reason 4: Use Non-local parameters: Object recognition & segmentation Sky Building Tree E(x,ω) = Grass ∑ θi (ω, xi) +∑ θi (xi) + ∑ θi ( xi) i (color) i (location) i xi ∊ {1,…,K} for K object classes ∑ θij (xi,xj) i,j (edge aware Ising prior) Class (boosted textons) Location sky (class) + grass [TextonBoost; Shotton et al. ‘06] Reason 4: Use Non-local parameters: Object recognition & segmentation Class+ location + edges + color [TextonBoost; Shotton et al, ‘06] Reason 4: Use Non-local parameters: Object recognition & segmentation Good results … [TextonBoost; Shotton et al, ‘06] Reason 4: Use Non-local parameters: Object recognition & segmentation Failure cases… Reason 4: Use Non-local parameters: Object recognition & segmentation Goal: Detect and segment test image: 1 w Large set of example segmentation: T(1) T(2) T(3) Up to 2.000.000 shape templates E(x,w): {0,1}n x {Exemplar} → R E(x,w) = ∑ |T(w)i-xi| + ∑ θij (xi,xj) i i,j Є N4 “Hamming distance” [Lempitsky et al. ECCV ’08] Reason 4: Use Non-local parameters: Object recognition & segmentation UIUC dataset; 98.8% accuracy [Lempisky et al. ECCV ’08] Reason 4: Use Non-local parameters: Recognition with Latent/Hidden CRFs “instance” “instance label” “parts” [LayoutCRF Winn et al. ’06] • Many other examples: • ObjCut [Kumar et. al. ’05] • Deformable Part Model [Felzenszwalb et al.; CVPR ’08] • PoseCut [Bray et al. ’06] • Branch&Mincut [Lempitsky et al. ECCV ‘08] • Minimization over hidden variables vs. marginalize over hidden variables Random field models 4-connected; pairwise MRF E(x) = ∑ θij (xi,xj) i,j Є N4 Order 2 “Pairwise energy” higher(8)-connected; pairwise MRF E(x) = ∑ θij (xi,xj) i,j Є N8 Order 2 Higher-order RF E(x) = ∑ θij (xi,xj) i,j Є N4 +θ(x1,…,xn) Order n “higher-order energy” Why Higher-order Functions? In general θ(x1,x2,x3) ≠ θ(x1,x2) + θ(x1,x3) + θ(x2,x3) Reasons for higher-order RFs: 1. Even better image(texture) models: – – Field-of Expert [FoE, Roth et al. ‘05] Curvature [Woodford et al. ‘08] 2. Use global Priors: – – – Connectivity [Vicente et al. ‘08, Nowozin et al. ‘09] Better encoding label statistics [Woodford et al. ‘09] Convert global variables to global factors [Vicente et al. ‘09] cost Reason1: Better Texture Modelling Higher Order Structure not Preserved Training images Test Image Test Image (60% Noise) Result pairwise MRF 9-connected Higher-order RF [Rother et al. CVPR ‘09] Reason 2: Use global Prior Foreground object must be connected: User input E(x) = P(x) + h(x) Standard MRF with connectivity { ∞ if not 4-connected with h(x)= 0 otherwise [Vicente et al. ’08] Reason 2: Use global Prior Remember, bias of Prior: |xi-xj| Results: increased pairwise strength Introduce a global term, which controls statistic Pair-wise Probability Ground truth Noisy input |xi-xj| [Woodford et. al. ICCV ‘09] Summary: Discrete labels and discrete domain • Order of the random field – 4-connected to higher-order • Global, auxiliary variables (e.g. color model of an object) • Number of labels – Can be a very large number, e.g. stereo, optical flow, in-painting => Discretization artefact/hierarchical approaches • Labels have a structure? – Depth (or Pixel intensity) is ordered => pairwise potential functions can have a simpler form – No order for labels: Panoramic stitching, in-painting, object recognition Pairwise potential: |xi-xj| • Comment: discrete labels are often relaxed to continuous for “better” optimization (… will come in later lectures) Discrete labels: Optimization • Combinatorial Optimization – – – – Binary, pairwise MRF: Graph cut, BHS (QPBO) Multiple label, pairwise: move-making; transformation Binary, higher-order factors: transformation Multi-label, higher-order factors: move-making + transformation • Dual/Problem Decomposition – Decompose (NP-)hard problem into tractable once. Solve with e.g. sub-gradient technique • Local search / Genetic algorithms – ICM, simulated annealing Discrete labels: Optimization • Message Passing Techniques – Methods can be applied to any model in theory (higher order, multi-label, etc.) – BP, TRW, TRW-S • Relaxations (e.g. LP) – Relax original problem (e.g. {0,1} to [0,1]) and solve with existing techniques (e.g. sub-gradient) – Can be applied any model (dep. on solver used) – Connections to message passing (TRW) and combinatorial optimization (QPBO) • Commercial solvers (e.g. CPLEX) Discrete labels: Higher-order models • Arbitrary potentials are only tractable for order <7 (memory, computation time) • For ≥7 potentials need some structure to be exploited in order to make them tractable (e.g. cost over number of labels) Discrete domain Continuous label space (x∊ ℝn) • Gaussian Random Fields • Arbitrary continuous label Random Fields Gaussian MRF (GMRF) P(x) = 1/√det(2π∑) exp{-1/2 (x-μ)T ∑-1 (x-μ)} x∊ ℝn One big Gaussian system Gibbs distribution: P(x) = 1/f exp{ -E(x)} Solution: Ax - b = 0 x* = (AT A)-1 AT b (Or other solvers) with E(x) = ½ xTAx – xTb Example - GMRF We would like to model: E(x) = ∑ (xi-xj)2 x1 i,j 1 -1 -1 2 -1 -1 2 =: ½ A -1 1 1 E(x) = (x1, x2, x3, x4) -1 1 1 -1 -1 1 1 -1 1 -1 -1 x2 E(x) = ∑ (xi-xj)2 i,j x4 for E(x) = ½ xTAx – xTb with b = 0 x1 x2 x3 x4 (x1-x2, x2-x3, x3-x4)T (x1-x2, x2-x3, x3-x4) x3 Example GMRF – image denoising GMRF: E(x) = ½ xTAx – xTb Ep Encode prior knowledge: Ep(x) = ∑ (xi-xj)2 i,j ∊ N |xi-xj| 4 Eu Encode data z dependency Eu(x) = ∑ (xi-zi)2 i = (x-z)T I (x-z) = xT I x – 2xTz + zTz original Input z continuous label x HBF [Szeliski ‘06]~ 15times faster zi xi 256 discrete labels x (TRW-S) Remember from last lecture Ideal output Label x Unary potential: |zi-xi| Noisy input image z Pairwise potential: |xi-xj| Learning these potentials (Loss-minimizing parameter learning) gives non-Gaussian potentials [Putting MAP back on the map, Pletscher et al. DAGM 2010] Detour: Data-term is often arbitrary (non-Gaussian) window lj-ri-di||2 left Not quadratic! cost Unary: θi (di) = ∑ || Ground truth depth Image – right(b) Image – left(a) i right Disparity d d i-2 (di=2) Scanline (black low cost) [Scharstein et al. IJCV ‘02] Detour: Data-term is expensive to evaluate everywhere Unary: θi (di) = ∑ || lj-ri-di||2 window left i i-2 (di=2) Discrete scanline right Continuous valued 3D plane scanline Insight: 3D aligned windows give an improved data-cost (also bigger windows can be used (3x3 discrete vs. 35x35 continuous 3D plane) Optimizing is hard: here Patchmatch [Barnes et al. ‘09] (even more challenging for optical flow (2D motion)) Detour: PatchMatch stereo [Bleyer et al. BMVC ‘11] Gaussian MRF for Matting [Levin et al. CVPR ‘06] Input z Output α Input Output α Input: z Output: α , F,B Constraint: zi = αi Fi + (1- αi) Bi Each pixel has 3 constraints and 7 unknowns E(α) = αT L α α* = argmin E(α) sb.t brush strokes α L called matting Laplacian. L(i,j) = 0 if outside a window (e.g. 3x3) otherwise: The true solution is a global optimum if in each window the fore (F)- and background (B) colors lie on an arbitrary line in RGB space … a more physic-based view [Rhemann et al. CVPR 10] Problem: E(α) has many solutions! A Spatially Varying PSF-based Prior Matting Database and competition http://www.alphamatting.com/ Non-Gaussian (arbitrary) RFs Image segmentation [Singaraju, Grady et al. MRF book, ch. 8] As with matting: no data term, just hard constraints (brush strokes) Energy: E(x) = ∑ (wij |xi-xj|)p i,j Є N4 Here: - 4-connected grid - wij = exp{-ß||zi-zj||}) xi ∊ℝ (as before) Foreground brush Background brush Optimize: x* = argmin E(x) x sb.t. xi=1 for foreground; xi=0 for background Output xo = round(x*) , i.e. xo = 0 if x* < 0.5, 1 otherwise For any p ≥ 1 this can be solved globally optimal (but not guaranteed unique solution). Here optimized with IRLS (iterated reweighted least squares)) Varying “p” Energy: E(x) = ∑ (wij |xi-xj|)p i,j Є N4 • p=1, the rounded solution is the same as the solution to discrete problem: xi ϵ {0,1} (i.e GraphCut [Boykov, Jolly, ICCV ‘01]) • p=2, Gaussian MRF (known as random walker solution [Grady PAMI ‘06]) • p -> ∞, solution gets very ambiguous: only one edge is important if x discrete x* = argmin ∑ (wij|xi-xj|)p->∞ = argmin max (wij|xi-xj|) p->∞ x i,j Є N4 x (one solution is shortest geodesic path [Bai et al. ICCV ‘07]) i,j Є N4 Metrication (discretization) artefacts see [Singaraju, Grady et al. MRF book, ch. 8] “Blockiness” of the segmentation due to the underlying pixel grid Larger p better Proximity bias for p-norm Sensitivity of the segmentation to the placement of the brush strokes Proximity bias Sensitivity of the segmentation to the placement of the brush strokes Small p may be better … but user can adapt to system! Shrinking bias Shortening of the segmentation due to regularization larger p may be better … again, user can adapt to system! x* = argmin ∑ (wij|xi-xj|)p->∞ = argmin max (wij|xi-xj|) p->∞ x i,j Є N4 x i,j Є N4 only one edge is important if x discrete Field of Expert [Roth et al. ‘05] Generative model for a noisy image: P(x|z,w) = P(z|x) P(x|w) 1/P(z) Label x MAP: x* = argmax P(z|x) P(x|w) x Likelihood: P(z|x,w) ~ ∏ exp{-1/(2σ2) (zi-xi)2} i Data z (pixel independent Gaussian noise) Prior for real images So far: E(x) = ∑ θ(xi-xj) i,j Linear filters JT k (with weight wk ) Pairwise potential: |xi-xj| Product over all filters FoE: P(x|w) = 1/f Π Πk F ∊𝔽 (1+1/2(JT k xN(F) -wk )2) “heavy-tailed student-t distribution Product over windows (e.g. 5x5) Results Even better: • Discriminative models [Tappen, MRF book, ch. 13] • (hand-crafted) Discriminative functions such as BM3D In-painting: Take advantage of a Generative model P(x|z,w) ~ p(z|x) p(x|w) Likelihood: p(z|x) is either given (observed area) or uniform (unobserved area) In-painting results Image (zoom) FoE Smoothing MRF [Bertalmio et al., Siggraph ‘00] Summary: Continuous labels • Most of the time used for problems where discrete labels are ordered: – image intensity, depth, motion, transparency(matting) – Regularization term (pairwise, higher-order) can be written in some parametric form (Gaussian, Filters, etc) • Data-term can be quite arbitrary and expansive to evaluate everywhere (e.g. 3D planes for depth estimation) • Often done in the context of continuous domain methods (comes next) Continuous labels: Optimization • Closed-form for Gaussian • Gradient-descent style optimization (e.g. BFGS (Broyden–Fletcher–Goldfarb–Shanno method)) • Convex Relaxation • Message passing • Fusion move (see later lectures) Continuous domain Continuous label space … just a few comments From continuous domain to discrete domain [Schelten and Roth CVPR’11] Image 𝒖 (zoom) domain: Ω ⊂ ℝ2 data 𝒖: Ω -> ℝ Piece-wise linear functions f (zoom) (many other realizations possible) function 𝒇: Ω -> ℝ Energy: E(𝒇) = Ω 𝒇−𝒖 𝟐 Convex data-term + Ω |𝛁𝒇| Total variation smoothness MRF factor graph (factors for smoothness term) Results Continuous domain (my take) • Main advantage: do continuous math and then commit to a certain discretization of the image • An argument is that the world is continuous … but then you have to model the image formation process (e.g. camera PSF) which leads to a higher-order model • No probabilistic view has been done available (only loss-based learning possible) • Fast (Primal-dual) GPU solvers are available. Relationship to discrete counter-parts would be good to establish. • Most continuous domain methods use first-order derivatives (e.g. total variation). This is inferior to high-order derivatives. It would be interesting to write an FoE model in the continuous domain • Data-term often has to be discretized • General: The connection to discrete domain and/or discrete label techniques have to be better explored Lecture Summary (my take) • Discrete domain, discrete labels: – Many applications (low-level and high-level vision) – A lot of research on optimization and learning • Discrete domain, continuous labels – less often used in vision – data-term often has to be discretized • Continuous domain, continuous labels – connections to discrete domain/labels have to be better understood – data-term often has to be discretized