Video Segment Proposals With Painless Occlusion Handling Zhengyang 1 Wu , Fuxin 1 Li , Rahul 2 Sukthankar , Learned Appearance Model Video Segment Proposals Definition of objects in video? Ambiguity exists in the definition. Even human annotators have differences among themselves. Target 1 πΏ1 = π»1 , πΆ1 Track a pool of image segment proposals to generate video segment proposals and πΏ2 = π»2 , πΆ2 πΏ∗1 = (π»1 , πΆ1 , πΆ2 + ππ‘π ππ‘ ) Target 2 π₯2 π€2 ≈ 1 π₯1 π€2 ≈ 0 π₯3 π€2 ≈ 0.55 Maintaining LSO is very time-consuming Problem: Too many LSO to maintain if we don’t merge them π€2 Problem: Occlusions happen frequently and break tracks into parts Well trained model (track) can generalize across video! Backtrack to merge LSO every K frames. e.g. In frame 20, merge πΏ10 to πΏ∗9 , then πΏ∗9 to πΏ∗8 …… Until πΏ∗2 to πΏ∗1 Test On Goal: Discover holistic video segment proposals that π₯2 π€ ≈ 0.7 Track Features/Image Segments Top 1 Top 2 x x x x x x π π₯5 π€ ≈ 0.3 Top 3 Top 4 Detailed Performance Time Complexity Sv: Averaged per Video. So: Averaged per GT Object A=X: X annotators agree on this GT Object L=X: this GT object last for X GT frames Comparison Approaches: [31]: F. Li et al. Video segmentation by tracking many figure-ground segments [19]: M. Grundmann et al. Efficient hierarchical graph based video segmentation [15]: F. Galasso et al. Spectral graph reduction for efficient image and streaming video segmentation Why not merge LSOs in every frame ? 0 Occluded tracks (targets) have zero overlap with all segments 2 F π π − π +λ π Require K frames to determine a track is reliable Top 5 Occlusion Handling Intuition: x Least Square Tracker min π₯4 π€ ≈ 0.4 2 F V Mat 0 πΆπ‘π£ π£ = πΆπ‘−1 + ππ‘π ππ‘ πΆπ‘0 0 = πΆπ‘−1 Cross Video Segmentation 10 Object Classes. Each has 4-13 videos. Below are selected top-3 retrieval and PR Curve Occlusion Detection and Re-discovery x Least Square Formulation: π₯3 π€ ≈ 0.5 Overall Performance Eliminate spurious tracks to prevent them contaminate LSO Prior Work Merge Superpixels/Supervoxels π₯1 π€ ≈ 0.85 Best Among Competitors Solution: W: trained model 1. catches all definition of objects in video automatically (unsupervised) 2. handles long-term complete occlusion efficiently 3. simultaneously learns appearance model of each tracked segment Our Result Can be merged to π€1 A pool of segment proposals is required for Video 100 Videos with pixel level annotation in every 20 frames. 4 annotators Two adjacent LSO only differ by one frame π₯3 π€1 ≈ 1 π₯2 π€1 ≈ 0.55 π₯1 π€1 ≈ 0 Experiments on VSB-100 Dataset Merge Move and Backtracking Predicts overlap of segment i (feature) and track j (target) π₯π π€π James M. 1 Rehg IEEE 2015 Conference on Computer Vision and Pattern Recognition Matrix Legend …… …… C Mat No Calculation! Normal Occluded Columns Columns Track 1 Image Segment Proposal Solution: Solution: π π»=π π= πΆ = πππ = Features (Training Example) (π + ππ)π = π π π₯ππ π₯π π=1 π π₯ππ π£π π=1 Spatial Overlap Matrix (Target) Trained Appearance Model L=(H,C), sufficient statistic Least Square Object (LSO) π»π‘ = π»π‘−1 + ππ‘π ππ‘ Online Update πΆπ‘ = πΆπ‘−1 + ππ‘π ππ‘ π₯π : ith row of X π£π : ith row of V Track 2 PR Curve Frame 1 Frame 14 Occlusion Detection using frame 3-13 πΆ11 , πΆ12 + π2π π2 1 2 πΆπ‘−1 , πΆπ‘−1 + ππ‘π ππ‘ πΆπ¦π₯ Track x Frame y Frame 2 - 13 C Matrix Normal 1 πΆ13 π + π14 π14 2 πΆ13 +0 Occluded Frame 15-25 1 πΆπ‘−1 + ππ‘π ππ‘ 2 πΆ13 +0 Frame 26 Re-discovery 1 πΆ25 π + π26 π26 2 πΆ13 π + π26 π26 Frame 27 … 1 2 πΆπ‘−1 , πΆπ‘−1 + ππ‘π ππ‘ Training Video Top 1 Conclusion Top 2 Top 3 Code will be available 1. Performance already close to image segmentation 2. Can handle long-term multiple complete occlusion 3. Deep learning (LSTM) for cross video segmentation http://www.cc.gatech.edu/~fli/Seg is the future direction Track2/Occlusion/index.html