Poster (PPT 9.5MB)

advertisement
Video Segment Proposals With Painless Occlusion Handling
Zhengyang
1
Wu ,
Fuxin
1
Li ,
Rahul
2
Sukthankar ,
Learned Appearance Model
Video Segment Proposals
Definition of objects in video? Ambiguity exists in the definition.
Even human annotators have differences among themselves.
Target 1
𝐿1 = 𝐻1 , 𝐢1
Track a pool of image segment proposals to generate video segment proposals
and
𝐿2 = 𝐻2 , 𝐢2
𝐿∗1 = (𝐻1 , 𝐢1 , 𝐢2 + 𝑋𝑑𝑇 𝑉𝑑 )
Target 2
π‘₯2 𝑀2 ≈ 1
π‘₯1 𝑀2 ≈ 0
π‘₯3 𝑀2 ≈ 0.55
Maintaining LSO is very time-consuming
Problem:
Too many LSO to maintain if we don’t merge them
𝑀2
Problem: Occlusions happen frequently and break tracks into parts
Well trained model (track) can
generalize across video!
Backtrack to merge LSO
every K frames.
e.g. In frame 20,
merge 𝐿10 to 𝐿∗9 ,
then 𝐿∗9 to 𝐿∗8 ……
Until 𝐿∗2 to 𝐿∗1
Test On
Goal: Discover holistic video segment proposals that
π‘₯2 𝑀 ≈ 0.7
Track Features/Image Segments
Top 1
Top 2
x
x
x
x
x
x
𝐖
π‘₯5 𝑀 ≈ 0.3
Top 3
Top 4
Detailed Performance
Time Complexity
Sv: Averaged per Video. So: Averaged per GT Object
A=X: X annotators agree on this GT Object
L=X: this GT object last for X GT frames
Comparison Approaches:
[31]: F. Li et al. Video segmentation by tracking many figure-ground segments
[19]: M. Grundmann et al. Efficient hierarchical graph based video segmentation
[15]: F. Galasso et al. Spectral graph reduction for efficient image and streaming
video segmentation
Why not merge LSOs in every frame ?
0
Occluded tracks (targets) have zero
overlap with all segments
2
F
𝐗 𝐖 − 𝐕
+λ 𝐖
Require K frames to determine a track is reliable
Top 5
Occlusion Handling Intuition:
x
Least Square Tracker
min
π‘₯4 𝑀 ≈ 0.4
2
F
V
Mat
0
𝐢𝑑𝑣
𝑣
= 𝐢𝑑−1
+ 𝑋𝑑𝑇 𝑉𝑑
𝐢𝑑0
0
= 𝐢𝑑−1
Cross Video Segmentation
10 Object Classes. Each has 4-13 videos. Below are selected top-3 retrieval and PR Curve
Occlusion Detection and Re-discovery
x
Least Square Formulation:
π‘₯3 𝑀 ≈ 0.5
Overall Performance
Eliminate spurious tracks to prevent them contaminate LSO
Prior Work
Merge Superpixels/Supervoxels
π‘₯1 𝑀 ≈ 0.85
Best Among Competitors
Solution:
W: trained model
1. catches all definition of objects in video automatically (unsupervised)
2. handles long-term complete occlusion efficiently
3. simultaneously learns appearance model of each tracked segment
Our Result
Can be merged to
𝑀1
A pool of segment proposals is required for Video
100 Videos with pixel level annotation in every 20 frames. 4 annotators
Two adjacent LSO only differ by one frame
π‘₯3 𝑀1 ≈ 1
π‘₯2 𝑀1 ≈ 0.55
π‘₯1 𝑀1 ≈ 0
Experiments on VSB-100 Dataset
Merge Move and Backtracking
Predicts overlap of segment
i (feature) and track j (target)
π‘₯𝑖 𝑀𝑗
James M.
1
Rehg
IEEE 2015 Conference on
Computer Vision and
Pattern Recognition
Matrix Legend
……
……
C Mat
No
Calculation!
Normal Occluded
Columns Columns
Track
1
Image Segment Proposal
Solution:
Solution:
𝑇
𝐻=𝑋 𝑋=
𝐢 = 𝑋𝑇𝑉 =
Features
(Training
Example)
(𝐇 + π›Œπˆ)𝐖 = 𝐂
𝑛
π‘₯𝑖𝑇 π‘₯𝑖
𝑖=1
𝑛
π‘₯𝑖𝑇 𝑣𝑖
𝑖=1
Spatial
Overlap
Matrix
(Target)
Trained
Appearance
Model
L=(H,C), sufficient statistic
Least Square Object (LSO)
𝐻𝑑 = 𝐻𝑑−1 +
𝑋𝑑𝑇 𝑋𝑑
Online Update
𝐢𝑑 = 𝐢𝑑−1 +
𝑋𝑑𝑇 𝑉𝑑
π‘₯𝑖 : ith row of X
𝑣𝑖 : ith row of V
Track
2
PR Curve
Frame 1
Frame 14
Occlusion Detection using frame 3-13
𝐢11 , 𝐢12 + 𝑋2𝑇 𝑉2
1
2
𝐢𝑑−1
, 𝐢𝑑−1
+ 𝑋𝑑𝑇 𝑉𝑑
𝐢𝑦π‘₯ Track x Frame y
Frame 2 - 13
C
Matrix Normal
1
𝐢13
𝑇
+ 𝑋14
𝑉14
2
𝐢13
+0
Occluded
Frame 15-25
1
𝐢𝑑−1
+ 𝑋𝑑𝑇 𝑉𝑑
2
𝐢13
+0
Frame 26
Re-discovery
1
𝐢25
𝑇
+ 𝑋26
𝑉26
2
𝐢13
𝑇
+ 𝑋26
𝑉26
Frame 27 …
1
2
𝐢𝑑−1
, 𝐢𝑑−1
+ 𝑋𝑑𝑇 𝑉𝑑
Training Video
Top 1
Conclusion
Top 2
Top 3
Code
will be
available
1. Performance already close to image segmentation
2. Can handle long-term multiple complete occlusion
3. Deep learning (LSTM) for cross video segmentation
http://www.cc.gatech.edu/~fli/Seg
is the future direction
Track2/Occlusion/index.html
Download