Paper implementation Optimization Class, by Prof. Yu-Wing Tai EE 20105034 Seong-Heum Kim 1 Contents • Introduction to MVOS (Multiple View Object Segmentation) • Algorithm Overview • Contribution of the paper Optimizing MVOS in space and time Efficient 3D sampling with 2D superpixel representation • Implementation issues • Evaluation • Conclusion 2 Introduction to MVOS • What is “Multi-View Object Segmentation”? Methods Conditions Key ideas Multi-View Object Segmentation More than 2 views Sharing a common geometric model Interactive segmentation Single image with seeds Bounding-box (or strokes) priors Image co-segmentation More than 2 images Sharing a common appearance model 3 Introduction to MVOS • What is “Multi-View Object Segmentation”? Methods Conditions Key ideas Multi-View Object Segmentation More than 2 views Sharing a common geometric model Known projection relations (matrices) Bounding-boxes from camera poses No common appearance models needed • Problem Definition Given 1) Images, I = {I1, I2, I3, …, In} 2) Projection matrices, KRT = {KRT1, KRT2, KRT3, …, KRTn} (Known intrinsic&extrinsic viewpoints) Take Segmentation maps, X = {X1, X2, X3, …, Xn} Where 𝐼 𝑛 = {𝐼1𝑛 , 𝐼2𝑛 , 𝐼3𝑛 , … , 𝐼𝑘𝑛 }, 𝐼𝑘𝑛 : Colors(R,G,B) at the k-th pixel from the n-th viewpoint. 𝑋 𝑛 = {𝑋1𝑛 , 𝑋2𝑛 , 𝑋3𝑛 , … , 𝑋𝑘𝑛 }, 𝑋𝑘𝑛 ∈ {𝐹, 𝐵}: Binary labels at the k-th pixel in the n-th image. 4 Related works • Building segmentations consistent with a single 3D object Zeng04accv: Silhouette extraction from multiple images of an unknown background Yezzi03ijcv: Stereoscopic segmentation • Joint optimization of segmentation and 3D reconstruction Xiao07iccv: Joint affinity propagation for multiple view segmentation Campbell07bmvc: Automatic 3D object segmentation in multiple views using volumetric graph-cuts Guillemaut11ijcv: Joint multi-layer segmentation and reconstruction for free-viewpoint video application • Recent formulations for better results Djelouah12eccv: N-tuple color segmentation for multi-view silhouette extraction Kowdle12eccv: Multiple view object co-segmentation using appearance and stereo cues Lee11pami: Silhouette segmentation in multiple views • Optimizing MVOS in space and time Djelouah13iccv: Multi-view object segmentation in space and time (this paper) 5 Background MRF-based Segmentation Data term Smoothness term Slides from GraphCut-based Optimisation for Computer Vision, Ľubor Ladický’s tutorial at CVPR12 [2] 6 Background Data term Estimated using FG / BG colour models Smoothness term where Intensity dependent smoothness 7 Background Data term (Region) Smoothness term (Boundary) How to solve this optimization problem? • Transform MAP problem to MRF min. • Solve it using min-cut / max-flow algorithm 8 Background: Graph model (undirected) • Regularizing a result to make “strongly linked nodes” to have the same label. • The key questions are “1) how do we define nodes?,” “2) how do they link each other?” and “3) how strength?” Source 1 1 3 Edge Node Source 3 2 2 4 1 Solve the solution 4 3 5 1 3 2 3 2 5 2 3 1 1 3 Sink Draw a graph with energy terms 2 Source = {3,4} Sink = {1,2,5} 1 2 Sink Find a residual graph (No regularization) 9 Background: Graph model (undirected) • Maxflow algorithm (Ford & Fulkerson algorithm, 1956) • Iteratively “1) find active nodes,” “2) sum up bottleneck capacities” and “3) check there is no active flow” Source Source 1 1 1 1 2 3 3 3 2 2 4 3 1 5 2 3 1 1 Sink Link pixels with their similarity 1 1 1 2 3 2 3 3 2 2 4 3 1 3 1 1 5 2 3 Sink Flow = 1 10 Background: Graph model (undirected) • Maxflow algorithm (Ford & Fulkerson algorithm, 1956) • Iteratively “1) find active nodes,” “2) sum up bottleneck capacities” and “3) check there is no active flow” Source Source 1 1 2 2 3 3 2 2 3 2 2 4 2 4 3 1 2 1 1 Sink Flow = 2 5 2 1 1 2 3 2 3 1 1 2 5 2 3 Sink Flow = 3 11 Background: Graph model (undirected) • Maxflow algorithm (Ford & Fulkerson algorithm, 1956) • Iteratively “1) find active nodes,” “2) sum up bottleneck capacities” and “3) check there is no active flow” Source Source 1 1 2 2 2 2 2 2 2 2 4 2 4 3 1 5 2 2 3 2 Sink Flow = 4 1 1 2 3 1 5 2 1 2 Sink Flow = 5 12 Background: Graph model (undirected) • Maxflow algorithm (Ford & Fulkerson algorithm, 1956) • Iteratively “1) find active nodes,” “2) sum up bottleneck capacities” and “3) check there is no active flow” Source Source 1 1 2 2 4 3 2 1 1 2 2 5 2 Sink Flow = 6 3 5 2 2 1 2 1 4 1 1 1 Sink Flow = 7 13 Background: Graph model (undirected) • There is no more possible path • Globally optimum in the two-terminal case (bc. any sub-bounds in the maximum bound ≤ 8) Source 1 1 1 3 4 1 2 1 5 Solve the solution Source = {3,4,5} Sink = {1,2} 1 Sink Maxflow = 8 Maximum bound 14 Background: Graph-cut • Duality of the min-cut problem • Any cuts ≥ 8 + @ • Sub-modularity: E(0,1)+E(1,0) ≥ E(0,0)+E(1,1) (=0) Source 1 1 1 3 4 1 Sub-bound (Cut = 8+1) 5 1) Design a energy function for nodes, edges (linkages) 2 1 2) Solve it in MRF 1 Source = {3,4,5} Sink = {1,2} Sink Maxflow = 8 Min-cut 15 Contribution of the paper 1. MRF optimization over all the viewpoints and sequences at the same time Linkages between 3D samples and the projected superpixels Linkages between correspondences in frames 2. Sparse 3D sampling with superpixel representation Fast and simple 3D model Richer representation of texture information (appearance) Bag-of-Word (BoW) model in a small patch 16 MVOS in space and time • Multi-View Object Segmentation (MVOS) in space and time Methods Conditions Key ideas MVOS in space and time Known projection relations Bounding-boxes from camera poses More than 2 viewpoints Sharing a common 3D samples Temporal motions (SIFT-flow) Linking matched superpixels b.w frames. • Problem Definition Given 1) Set of superpixels p in images at time t, 𝑃𝑡 = {𝑃1𝑡 , 𝑃2𝑡 , 𝑃3𝑡 , … , 𝑃𝑛𝑡 }, 𝑝 ∈ 𝑃𝑖𝑡 2) Projection matrices, KRT = {KRT1, KRT2, KRT3, …, KRTn} (Fixed camera pose) Take superpixel segmentations 𝑋𝑝𝑛,𝑡 = {𝑋1𝑛,𝑡 , 𝑋2𝑛,𝑡 , 𝑋3𝑛,𝑡 , … , 𝑋𝑘𝑛,𝑡 } for all viewpoints n and time t. Where 𝑋𝑘𝑛,𝑡 ∈ {𝐹, 𝐵}: Binary labels at the k-th superpixel of the n-th image in t-th time 𝑅𝑝𝑛,𝑡 = {𝐼1𝑛,𝑡 , 𝐼2𝑛,𝑡 , 𝐼3𝑛,𝑡 , … , 𝐼𝑟𝑛,𝑡 }, Set of pixels in superpixel p. 𝑠 ∈ 𝑆 𝑡 , Set of 3D samples in time t 17 Big picture of the paper In the paper, we are formulating three physical concepts into the energy terms Time consistency Appearance model Geometric constraint 18 Big picture of the paper Appearance data-term : color + texture Appearance smoothness term : spatially neighboring superpixels Appearance smoothness term : non-locally connected superpixels 3D sample data-term : probabilistic occupancy Sample-superpixel junction term : sharing a coherent geometric model Sample-projection data-term : giving a projection constraint 19 Overview ▲ One of input images (1/8) ▲ Superpixels in the img. 20 Overview ▲ Neighboring linkages ▲ Non-local linkages 21 Overview ▲ Constraint from camera poses 22 Overview ▲ Update the geometric model 23 Overview ▲ Mean accuracy: 95% (±1%) 24 Superpixel linkages 25 Superpixel linkages • Directed graph for linking 3D sample-superpixel 3D samples Source 100 6 3 5 2 1 4 𝑷𝒕𝟏 𝑷𝒕𝟐 109 110 𝑺𝒕 Sink Sample-superpixel junction term : sharing a coherent geometric model 26 Superpixel linkages • Directed graph for linking 3D sample-superpixel 3D samples Source 100 ∞ (=1000) 6 0 5 𝒑 𝒔 0 Sink 27 Superpixel linkages • Linking temporal correspondences Source Temporal motion fields From KLT, SIFT-flow 6 3 5 2 1 𝑷𝒕𝒏 4 𝑷𝒕+𝟏 𝒏 Time consistency term Sink 28 Sparse 3D samples with superpixel representation • Why we need super-pixels (a group of pixels) for segmentation? Superpixels require a fewer number of 3D samples → Efficiently computing quick, rough segmentations. Fewer 3D samples needed 3D samples in a scene . 2D plane describing the scene Center of projection Lower resolution . Center of projection Colors in a single pixel are not enough information for encoding texture. • Texture is, by its definition, a vector or a histogram of certain measures (ex. gradient) in a local “patch.” • Gradient magnitude response for 4 scales, Laplacian for 2 scales • K-means for building texture vocabulary (60-150 words to create superpixel descriptors) • Similarity of textures are modeled by chi-squared distances between the two normalized histograms in superpixels. 29 Implementation issues • Work-in-progress Initializing a MVOS system Finding reliable matches between frames Sampling and keeping 3D points Making a better appearance model • Softwares as used in the paper Getting datasets: VisualSFM (by Changchang Wuhttp://ccwu.me/vsfm/) Making superpixels: SLIC (by Radhakrishna Achanta, http://ivrg.epfl.ch/research/superpixels) Finding temporal correspondences: SIFT, SIFT-flow (by Ce Liu, http://people.csail.mit.edu/celiu/SIFTflow/) Solving the constructed MRF: Maxflow (by Yuri Boykov, http://www.csd.uwo.ca/~yuri/) 30 Implementation issues • Initializing a MVOS system An object should be in the intersection of all the views Camera poses give a sort of bounding box (as an initial prior) → Eliminating about 20-25% pixels If not, 1) 5-10 pixels along the frame boundary can be additionally removed 2) User-given points in a few views might be required as an initial constraint # of view ↑ = intersecting space tightly ↓ 31 Implementation issues • Finding reliable matches between frames Accurate correspondences in foreground are a few SIFT matches in background clutters are effectively connected between frames. Not every superpixels are temporally linked in the current implementation. KLT, SIFT-flow are working well on the textured backgrounds Some blobs (human head) or a few strong points can be linked, but wrong pairs may degrade the overall performance. 32 Implementation issues • Sampling and keeping 3D samples Low-resolution images, superpixel representation reduce processing time and number of points needed. The visibility of 3D samples also removes unnecessary 3D points and helps right linking across views. Method Processing Time 3D reconstruction (SfS-based) [3] 3 min 3D ray (2D samples along epipolar lines) [4] 1 min 3D sparse samples [1] 5 sec 3D visible points 12 sec [3] Campbell07bmvc: Automatic 3D object segmentation in multiple views using volumetric graph-cuts [4] Lee11pami: Silhouette segmentation in multiple views [1] Djelouah13iccv: Multi-view object segmentation in space and time (this paper) 33 Implementation issues • Making a better appearance model Simple magnitudes of gradients are not very powerful with losing directional information. Slightly modified [5] for defining colors and textures. Given 𝐼𝑘 : Colors at the k-th pixel in an image, Take for color at a pixel, 1) the normalized L, a, b in Lab color-space (GMM) 2) Gaussians of R, G, B channel at two different scales for texture at a superpixel, 3) Derivatives of L (dx , dy , dxy , dyx) and derivatives of the Gaussian of L (BoW model) 4) Laplacian of L at three different scales I1 I2 I3 I1 I2 I3 I1 I2 I3 I1 I2 I3 I1 I2 I3 I4 I5 I6 I4 I5 I6 I4 I5 I6 I4 I5 I6 I4 I5 I6 I7 I8 I9 I7 I8 I9 I7 I8 I9 I7 I8 I9 I7 I8 I9 dx = I5-I6 dy = I5-I8 dxy = I5-I9 dyx = I5-I7 Laplacian of L = 4I5-I2-I4-I6-I8 [5] Shotton07IJCV, “TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context" 34 Implementation issues • Making a better appearance model Superpixel segmentation of single images using ground truth masks: 1) Given ground truth masks, build appearance models and again find the solutions with MRF regularization. 2) [Mean, Std.] of 27 images [6] of “Color(GMM) + b*Texture(BoW) + lambda*Regularization” (b, lambda) b (Texture) 0.0 0.2 0.4 0.6 0.8 1.0 lambda = 0 MEAN: 91.36% STD: 6.81% 0.9304 0.0499 0.9335 0.0499 0.9329 0.0509 0.9286 0.0510 0.9224 0.0517 lambda = 1 0.9164 0.0697 0.9382 0.0432 0.9415 0.0407 0.9415 0.0417 0.9418 0.0417 0.9379 0.0469 lambda = 2 0.9137 0.0713 0.9357 0.0457 0.9414 0.0400 0.9420 0.0385 0.9447 0.0357 0.9413 0.0435 lambda = 3 0.9097 0.0772 0.9319 0.0520 0.9359 0.0486 0.9416 0.0384 0.9449 0.0345 0.9438 0.0378 lambda = 4 0.9084 0.0783 0.9296 0.0537 0.9339 0.0509 0.9424 0.0381 0.9436 0.0356 0.9443 0.0358 Mean 3.1%↑, Std. 3.4%↓ in IOU (Intersection over union metric) = (mask&gt) / (mask|gt) [6] Christoph Rhemann, cvpr09, http://www.alphamatting.com/datasets.php 35 Experimental results • Implementation issues - Eliminating about 25% pixels by the initial constraint - λ1=2, λ2=4 (2D smoothness), λ3=0.05 (3D data term) in the iterative optimization - less than 10 iterations for the convergence, and each takes only 10sec • Dataset - COUCH, BEAR, CAR, CHAIR1 [7] for qualitative and quantitative evaluations - BUSTE, PLANT [4] for qualitative evaluation - DANCERS [8], HALF-PIPE [9] for the video segmentation • Comparisons - N-tuple color segmentation for multi-view silhouette extraction, Djelouah12eccv [10] - Multiple view object cosegmentation using appearance and stereo cues, Kowdle12eccv[7] - Object co-segmentation (without any multi-view constraints, Vicente11cvpr [11] 36 • Good enough 37 Experimental results • Evaluation: Mean, Std. in IOU (Intersection over union metric) = (mask&gt) / (mask|gt) • Little sensitivity to the number of viewpoints. → The visual hull constraint is strong at fewer number of viewpoints. • Still, more accurate depth information + plane detection shows better results with the SfM framework [7] 38 Experimental results • Evaluation: Mean, Std. in IOU (Intersection over union metric) = (mask&gt) / (mask|gt) • Superpixel segmentations in my initial implementation (Not refined at pixel-level) Name # of Imgs Mean Std. GT (Photoshop) 1. Lion1 12 94.81% 0.89% Matte 2. Lion2 8 92.30% 1.21% Matte 3. Rabbit 8 92.51% 2.05% Matte 4. Tree 10 90.49% 1.90% Matte 5. Kimono 10 93.92% 2.87% Matte 6. Earth 8 96.66% 1.71% Binary mask 7. Person 8 93.23% 1.75% Binary mask 8. Person (Seq.) 8x3 95.14% 1.19% Binary mask 9. Bear [1] 8 92.48% 2.08% [1] 93.5% 1.74% Avg. [1] Executable software was not available because they say it is the property of Technicolor, but the author sent me their datasets and ground truths (11/4 ) on which I am still evaluating the current implementation 39 Experimental results • 2. Lion2 • 4. Tree 40 Experimental results • 5. Kimono • 9. Bear [1] 41 Experimental results • 8. Person (Seq.) t1: t2: t3: 42 Discuss & Conclusion • An approach to solve the video MVOS in iterated joint graph cuts. • Efficient superpixel segmentations (with sparse 3D samples) in a short time. • It works well even at much fewer viewpoints presented. 43 References [1] Djelouah13iccv: Multi-view object segmentation in space and time (this paper) [2] Ľubor Ladický’s tutorial at CVPR12, “GraphCut-based Optimisation for Computer Vision” [3] Campbell07bmvc: Automatic 3D object segmentation in multiple views using volumetric graph-cuts [4] Lee11pami: Silhouette segmentation in multiple views [5] Shotton07IJCV, “TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context“ [6] Christoph Rhemann, cvpr09, http://www.alphamatting.com/datasets.php [7] Kowdle12eccv, “Multiple view object cosegmentation using appearance and stereo cues.” [8] Guillemaut11IJCV, “Joint multi-layer segmentation and reconstruction for free-viewpoint video applications.” [9] Hasler09cvpr, “Markerless motion capture with unsynchronized moving cameras.” [10] Djelouah12eccv, N-tuple color segmentation for multi-view silhouette extraction, [11] Vicente11cvpr, “Object co-segmentation” [12] Marco Alexander Treiber, springer2013, “Optimization for Computer Vision: An Introduction to Core Concepts and Methods” 44