Poster (PPT 9.5MB)

Video Segment Proposals With Painless Occlusion Handling Zhengyang 1 Wu , Fuxin 1 Li , Rahul 2 Sukthankar , Learned Appearance Model Video Segment Proposals Definition of objects in video? Ambiguity exists in the definition. Even human annotators have differences among themselves. Target 1 𝐿1 = 𝐻1 , 𝐶1 Track a pool of image segment proposals to generate video segment proposals and 𝐿2 = 𝐻2 , 𝐶2 𝐿∗1 = (𝐻1 , 𝐶1 , 𝐶2 + 𝑋𝑡𝑇 𝑉𝑡 ) Target 2 𝑥2 𝑤2 ≈ 1 𝑥1 𝑤2 ≈ 0 𝑥3 𝑤2 ≈ 0.55 Maintaining LSO is very time-consuming Problem: Too many LSO to maintain if we don’t merge them 𝑤2 Problem: Occlusions happen frequently and break tracks into parts Well trained model (track) can generalize across video! Backtrack to merge LSO every K frames. e.g. In frame 20, merge 𝐿10 to 𝐿∗9 , then 𝐿∗9 to 𝐿∗8 …… Until 𝐿∗2 to 𝐿∗1 Test On Goal: Discover holistic video segment proposals that 𝑥2 𝑤 ≈ 0.7 Track Features/Image Segments Top 1 Top 2 x x x x x x 𝐖 𝑥5 𝑤 ≈ 0.3 Top 3 Top 4 Detailed Performance Time Complexity Sv: Averaged per Video. So: Averaged per GT Object A=X: X annotators agree on this GT Object L=X: this GT object last for X GT frames Comparison Approaches: [31]: F. Li et al. Video segmentation by tracking many figure-ground segments [19]: M. Grundmann et al. Efficient hierarchical graph based video segmentation [15]: F. Galasso et al. Spectral graph reduction for efficient image and streaming video segmentation Why not merge LSOs in every frame ? 0 Occluded tracks (targets) have zero overlap with all segments 2 F 𝐗 𝐖 − 𝐕 +λ 𝐖 Require K frames to determine a track is reliable Top 5 Occlusion Handling Intuition: x Least Square Tracker min 𝑥4 𝑤 ≈ 0.4 2 F V Mat 0 𝐶𝑡𝑣 𝑣 = 𝐶𝑡−1 + 𝑋𝑡𝑇 𝑉𝑡 𝐶𝑡0 0 = 𝐶𝑡−1 Cross Video Segmentation 10 Object Classes. Each has 4-13 videos. Below are selected top-3 retrieval and PR Curve Occlusion Detection and Re-discovery x Least Square Formulation: 𝑥3 𝑤 ≈ 0.5 Overall Performance Eliminate spurious tracks to prevent them contaminate LSO Prior Work Merge Superpixels/Supervoxels 𝑥1 𝑤 ≈ 0.85 Best Among Competitors Solution: W: trained model 1. catches all definition of objects in video automatically (unsupervised) 2. handles long-term complete occlusion efficiently 3. simultaneously learns appearance model of each tracked segment Our Result Can be merged to 𝑤1 A pool of segment proposals is required for Video 100 Videos with pixel level annotation in every 20 frames. 4 annotators Two adjacent LSO only differ by one frame 𝑥3 𝑤1 ≈ 1 𝑥2 𝑤1 ≈ 0.55 𝑥1 𝑤1 ≈ 0 Experiments on VSB-100 Dataset Merge Move and Backtracking Predicts overlap of segment i (feature) and track j (target) 𝑥𝑖 𝑤𝑗 James M. 1 Rehg IEEE 2015 Conference on Computer Vision and Pattern Recognition Matrix Legend …… …… C Mat No Calculation! Normal Occluded Columns Columns Track 1 Image Segment Proposal Solution: Solution: 𝑇 𝐻=𝑋 𝑋= 𝐶 = 𝑋𝑇𝑉 = Features (Training Example) (𝐇 + 𝛌𝐈)𝐖 = 𝐂 𝑛 𝑥𝑖𝑇 𝑥𝑖 𝑖=1 𝑛 𝑥𝑖𝑇 𝑣𝑖 𝑖=1 Spatial Overlap Matrix (Target) Trained Appearance Model L=(H,C), sufficient statistic Least Square Object (LSO) 𝐻𝑡 = 𝐻𝑡−1 + 𝑋𝑡𝑇 𝑋𝑡 Online Update 𝐶𝑡 = 𝐶𝑡−1 + 𝑋𝑡𝑇 𝑉𝑡 𝑥𝑖 : ith row of X 𝑣𝑖 : ith row of V Track 2 PR Curve Frame 1 Frame 14 Occlusion Detection using frame 3-13 𝐶11 , 𝐶12 + 𝑋2𝑇 𝑉2 1 2 𝐶𝑡−1 , 𝐶𝑡−1 + 𝑋𝑡𝑇 𝑉𝑡 𝐶𝑦𝑥 Track x Frame y Frame 2 - 13 C Matrix Normal 1 𝐶13 𝑇 + 𝑋14 𝑉14 2 𝐶13 +0 Occluded Frame 15-25 1 𝐶𝑡−1 + 𝑋𝑡𝑇 𝑉𝑡 2 𝐶13 +0 Frame 26 Re-discovery 1 𝐶25 𝑇 + 𝑋26 𝑉26 2 𝐶13 𝑇 + 𝑋26 𝑉26 Frame 27 … 1 2 𝐶𝑡−1 , 𝐶𝑡−1 + 𝑋𝑡𝑇 𝑉𝑡 Training Video Top 1 Conclusion Top 2 Top 3 Code will be available 1. Performance already close to image segmentation 2. Can handle long-term multiple complete occlusion 3. Deep learning (LSTM) for cross video segmentation http://www.cc.gatech.edu/~fli/Seg is the future direction Track2/Occlusion/index.html

Poster (PPT 9.5MB)

Related documents

Products

Support

Poster (PPT 9.5MB)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib