slides

advertisement
Summarization of ego-centric video Object driven Vs. Story driven
Presented By: Elad Osherov Jan 2013
Today’s talk
Motivation
Related Work
Object driven summarization
Story driven summarization
Results
Future Development






2
What is Egocentric Video Anyway ?
3
http://xkcd.com/1235/
What is Egocentric Video Anyway ?
4
Motivation
Goal - Generate a visual summary of an unedited
egocentric video

Input: Egocentric video of camera wearer’s day
5
Output: Storyboard (or skim video) summary
Potential Applications of
Egocentric Video Summarization
Memory aid
6
Law enforcement
Mobile robot discovery
Egocentric Video Properties

Long unedited video

Constant head motion – blur

Moving camera – unstable background

Frequent changes in people and objects

Hand occlusion
7
Today’s talk
Motivation
Related Work
Object driven summarization
Story driven summarization
Results
Future Development






8
Related Work
Object recognition in egocentric video

[Egocentric Recognition of Handled Objects: Benchmark and Analysis
X.Ren, M.Philipose -CVPR 2009]
Detection and recognition of first person actions

[Detecting activities of daily living in first-person camera views
H.Pirsiavash, D.Ramanan CVPR 2012]
Data summarization – Today !

[Rav-Acha,Y. Pritch, and S. Peleg, Making a Long Video
Short: Dynamic Video Synopsis, CVPR 06]
9
Related Work
[ Rav-Acha,Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis,
CVPR 06 ]
10
http://www.vision.huji.ac.il/video-synopsis/
A Few Words About the Authors
Prof. Kristen Grauman
University of Texas at
Austin (department of
CS)
Prof. Zheng Lu
City university of
Hong Kong
(department of CS)
Dr. Yong Jae Lee UC
Berkeley
(departments of EE
& CS)
Prof. Joydeep Ghosh
University of Texas at
Austin. Director of
IDEAL (Intelligent
Data Exploration and
Analysis Lab)

Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee,
Joydeep Ghosh, and Kristen Grauman CVPR 2012

Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR
2013
11
Today’s talk






Motivation
Related Work
Object driven summarization
Story driven summarization
Results
Future Development
12
Object Driven Video Summarization

Goal - create a storyboard summary of a person’s day
that is driven by the important people and objects
Important things - significant interaction

Several problems arise




13
Important is a subjective index !
What does significant interaction really mean ?
No priors on People and objects
Algorithm Overview

Train a category-independent important person/object
detector
Train
regression
model to
predict region
importance
Segment the
video into
temporal
events
Group regions
of same object
Generate a
storyboard
Train
Test
Test
Test
14
[Discovering Important People and Objects for Egocentric Video Summarization.Yong Jae Lee,
Joydeep Ghosh, and Kristen Grauman CVPR 2012]
Annotating Important Regions in Training
Video

Data collection –



10 videos each 3-5 hours long-total of 37 hrs
4 subjects
Crowd source annotations using Mturk

Object’s degree of importance will highly depend on what the
camera wearer is doing before, while and after the
object/person appears

The object must be seen in the context of the camera wearer’s
activity to properly gauge its importance
15www.looxcie.com
www.mturk.com/mturk/
Annotating Important Regions in Training
Video
Man wearing a
blue shirt in a
Café

Yellow notepad
on a table
Coffee mug that
cameraman
drinks
For about 3-5 hours of video they get 700 object segmentations
16
Smartphone the
cameraman holds
Training a Regression Model
General purpose category-independent model predicts
important regions in any egocentric video:

1.
Segment each frame into regions
2.
For each region, compute a set of candidate features
that could describe it’s importance

3.
17
Egocentric, Object & Region features
Train a regressor to predict region importance
Egocentric Features

Interaction feature –


Euclidean distance of the region’s centroid to the closest
detected hand
Classify region as a hand according to color likelihoods and a
naïve bayes classifier trained on ground-truth hand annotations
Distance to hand
19
Egocentric Features

Gaze feature –


A coarse estimate of how likely the region is being focused
upon
Euclidean distance of the region’s centroid to the frame center
Distance to frame center
20
Egocentric Features

Frequency feature –
Region matching - Color dissimilarity between the region
and each region in surrounding frames
Points matching - Match SIFT features between each
region and frame in surrounding frames
Frequency
21
Object Features

Object-like appearance

22
Using region ranking function that ranks each region according
to Gestalt cues:
[J. Carreira and C. Sminchisescu. Constrained Parametric Min-Cuts for Automatic Object
Segmentation. In CVPR, 2010.]
Object Features

Object-like motion


Rank each region according to the difference of motion
patterns in comparison to the nearby regions
High scores to regions that “stand-out” of their surroundings
during motion
Object-like motion
23
[Key-Segments for Video Object Segmentation
Yong Jae Lee, Jaechul Kim, and Kristen Grauman ICCV 2011]
Object features

Likelihood of a person’s face

Compute the maximum overlap score between the region r
and any detected face q in the frame
Overlap with face detection
24
Train a regressor to predict region
importance

Size, centroid, bounding box centroid, bounding box,
width, bounding box height – Region features

Solve using least squares
25
Algorithm Overview

Train a category-independent important person/object
detector
Train
regression
model to
predict region
importance
Segment the
video into
temporal
events
Group regions
of same object
Generate a
storyboard
Train
Test
Test
Test
26
[Discovering Important People and Objects for Egocentric Video Summarization.Yong Jae Lee,
Joydeep Ghosh, and Kristen Grauman CVPR 2012]
Segmenting the video into temporal events
Pair-wise distance
matrix


Events allow summary to include multiple instances of the
person or object that is central in multiple contexts in the
video
Group frames until the smallest maximum inter-frame
distance is larger than two STDs beyond the mean
27
Algorithm Overview

Train a category-independent important person/object
detector
Train
regression
model to
predict region
importance
Segment the
video into
temporal
events
Group regions
of same object
Generate a
storyboard
Train
Test
Test
Test
28
[Discovering Important People and Objects for Egocentric Video Summarization.Yong Jae Lee,
Joydeep Ghosh, and Kristen Grauman CVPR 2012]
Discovering an Event’s Key People and
Objects





Score each frame region using the regressor
Group instances of the same object/person together
Set a pool of high scoring clusters
Remove clusters with affinity to a higher I(r) cluster
For each remaining cluster select the region with the highest
importance as its representative
29
Generating a Storyboard Summary

Each event can display different number of frames,
depending on how many unique important things the
method discovers
30
Results
Important region prediction accuracy
31
Results
Important region prediction accuracy
32
Results
Which cues matter most for
predicting importance ?
Top 28 features with highest learned weights

33
Low scores on
 Interaction and frequency pair
 Object-like region that is frequent
Results
Egocentric video summarization
accuracy
34
Results
User studies to evaluate summaries


35
Let the camera wearer answer 2 quality questions:
 Important objects/people captured
 Overall summary quality
Better results in ~69% of the summaries
Today’s talk






Motivation
Related Work
Object driven summarization
Story driven summarization
Results
Future Development
36
Story Driven Video Summarization

Good summary captures the progress of the story!



37
Segment video temporally into subshots
Select chain of k subshots that maximize both weakest link’s
influence and object importance
Each subshot ”leads to” the next through some subset of
influential objects
[Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013]
Document – Document Influence
[Shahaf & Guestrin, KDD 2010]
Connecting the dots between news articles. D. Shahaf and
C. Guestrin. In KDD, 2010.
38
Egocentric Subshot Detection

Define 3 generic ego-activities



Static
In transit
Head moving

Train classifiers to predict these activity types

Features based on Blur and Optical flow

Classify using SVM classifier
39
Temporal Subshot Segmentation
Tailored to egocentric video – detects ego-activities
Provides an over-segmentation - A typical subshot lasts ~15 Sec
40
Subshot Selection Objective

Given a set series of subshots segmented from the input
video, our goal is to select the optimal K-node chain
of subshots
42
Story Progress Between Subshots

A good story – a coherent chain of subshots, where each
strongly influences the next one
43
Predicting Influence Between Subshots
0.2
0.01
44
0.2
0.1
0.003
0.1
0.1
Predicting Influence Between Subshots
Sink node

45
Captures how reachable subshot j is from subshot i,
via object o.
Subshot Selection Objective

Given a set series of subshots segmented from the input
video, our goal is to select the optimal K-node chain
of subshots
46
Predicting diversity among transitions


Compute GIST and color histograms for each frame in
each subshot, quantize them into 55 scene types
Compute for each two adjacent subshots in the chain
1 2
D( s )   j 1 (1  exp(  (s j ,s j 1 )))

K 1
47
Coherent Object Activation Patterns
Story driven


Uniform sampling
Prefer activating few objects at once and, coherent (smooth)
entrance/exit patterns
Solve with linear programing and priority queue
48
Today’s talk






Motivation
Related Work
Object driven summarization
Story driven summarization
Results
Future Development
50
Results
4 videos, each 3-5 hours long,
uncontrolled setting
51
20 videos, each 20-60 minutes, daily
activities in house
Results

Baselines
1.
2.
3.

Uniform sampling of K subshots
Shortest path – K subshots with minimal bag-of-objects
distance between each other
Object driven – Only for UTE set
Parameters

K=4...8
s  1, i  d  0.5

Simultaneous active objects : 80-UTE 15-ADL

52
Results

Test methodology





34 human subjects, ages 18-60
12 hours of original video
Each comparison done by 5 subjects
Total 535 tasks, 45 hours of subject time
Probably the most comprehensive egocentric
summarization test ever established!
53
Results

Blind taste test






54
Show sped up version of original video
Show Story-driven summary and one of the baselines
Which summary better shows the progress of the story?
Pay attention to the relationship among sub-events,redundancy, and
representativeness of each sub-event
In 51% of the comparisons all 5 subjects voted Story-driven
Only in 9% Story-driven won by only one vote
Results
Discovering influential objects
3 workers
on Mturk
N=42 objects
as GT

Baseline frequency of
objects in the
video
Results show the method’s advantage. The most influential
objects need not be the most frequent!
55
Results
Where does the method fail ?

Where the story is uneventful

Where there are multiple interwoven threads
56
Further Development

Better use of machine learning techniques instead of
simple pair-wise regression

Extend subshot descriptions to detect actions

Augment the summary with a location service such as
GPS

Improve success ratio
57
Automatic storyboard maps
58
Pros and Cons

Pros:






Well written
Well referenced
Novel solution
Large and detailed Human experiment
Detailed Website
Cons:




59
Very complicated material
No source code publically available
No real competition
Computationally demanding
60
Download