Summarization of ego-centric video Object driven Vs. Story driven Presented By: Elad Osherov Jan 2013 Today’s talk Motivation Related Work Object driven summarization Story driven summarization Results Future Development 2 What is Egocentric Video Anyway ? 3 http://xkcd.com/1235/ What is Egocentric Video Anyway ? 4 Motivation Goal - Generate a visual summary of an unedited egocentric video Input: Egocentric video of camera wearer’s day 5 Output: Storyboard (or skim video) summary Potential Applications of Egocentric Video Summarization Memory aid 6 Law enforcement Mobile robot discovery Egocentric Video Properties Long unedited video Constant head motion – blur Moving camera – unstable background Frequent changes in people and objects Hand occlusion 7 Today’s talk Motivation Related Work Object driven summarization Story driven summarization Results Future Development 8 Related Work Object recognition in egocentric video [Egocentric Recognition of Handled Objects: Benchmark and Analysis X.Ren, M.Philipose -CVPR 2009] Detection and recognition of first person actions [Detecting activities of daily living in first-person camera views H.Pirsiavash, D.Ramanan CVPR 2012] Data summarization – Today ! [Rav-Acha,Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR 06] 9 Related Work [ Rav-Acha,Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR 06 ] 10 http://www.vision.huji.ac.il/video-synopsis/ A Few Words About the Authors Prof. Kristen Grauman University of Texas at Austin (department of CS) Prof. Zheng Lu City university of Hong Kong (department of CS) Dr. Yong Jae Lee UC Berkeley (departments of EE & CS) Prof. Joydeep Ghosh University of Texas at Austin. Director of IDEAL (Intelligent Data Exploration and Analysis Lab) Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012 Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013 11 Today’s talk Motivation Related Work Object driven summarization Story driven summarization Results Future Development 12 Object Driven Video Summarization Goal - create a storyboard summary of a person’s day that is driven by the important people and objects Important things - significant interaction Several problems arise 13 Important is a subjective index ! What does significant interaction really mean ? No priors on People and objects Algorithm Overview Train a category-independent important person/object detector Train regression model to predict region importance Segment the video into temporal events Group regions of same object Generate a storyboard Train Test Test Test 14 [Discovering Important People and Objects for Egocentric Video Summarization.Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012] Annotating Important Regions in Training Video Data collection – 10 videos each 3-5 hours long-total of 37 hrs 4 subjects Crowd source annotations using Mturk Object’s degree of importance will highly depend on what the camera wearer is doing before, while and after the object/person appears The object must be seen in the context of the camera wearer’s activity to properly gauge its importance 15www.looxcie.com www.mturk.com/mturk/ Annotating Important Regions in Training Video Man wearing a blue shirt in a Café Yellow notepad on a table Coffee mug that cameraman drinks For about 3-5 hours of video they get 700 object segmentations 16 Smartphone the cameraman holds Training a Regression Model General purpose category-independent model predicts important regions in any egocentric video: 1. Segment each frame into regions 2. For each region, compute a set of candidate features that could describe it’s importance 3. 17 Egocentric, Object & Region features Train a regressor to predict region importance Egocentric Features Interaction feature – Euclidean distance of the region’s centroid to the closest detected hand Classify region as a hand according to color likelihoods and a naïve bayes classifier trained on ground-truth hand annotations Distance to hand 19 Egocentric Features Gaze feature – A coarse estimate of how likely the region is being focused upon Euclidean distance of the region’s centroid to the frame center Distance to frame center 20 Egocentric Features Frequency feature – Region matching - Color dissimilarity between the region and each region in surrounding frames Points matching - Match SIFT features between each region and frame in surrounding frames Frequency 21 Object Features Object-like appearance 22 Using region ranking function that ranks each region according to Gestalt cues: [J. Carreira and C. Sminchisescu. Constrained Parametric Min-Cuts for Automatic Object Segmentation. In CVPR, 2010.] Object Features Object-like motion Rank each region according to the difference of motion patterns in comparison to the nearby regions High scores to regions that “stand-out” of their surroundings during motion Object-like motion 23 [Key-Segments for Video Object Segmentation Yong Jae Lee, Jaechul Kim, and Kristen Grauman ICCV 2011] Object features Likelihood of a person’s face Compute the maximum overlap score between the region r and any detected face q in the frame Overlap with face detection 24 Train a regressor to predict region importance Size, centroid, bounding box centroid, bounding box, width, bounding box height – Region features Solve using least squares 25 Algorithm Overview Train a category-independent important person/object detector Train regression model to predict region importance Segment the video into temporal events Group regions of same object Generate a storyboard Train Test Test Test 26 [Discovering Important People and Objects for Egocentric Video Summarization.Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012] Segmenting the video into temporal events Pair-wise distance matrix Events allow summary to include multiple instances of the person or object that is central in multiple contexts in the video Group frames until the smallest maximum inter-frame distance is larger than two STDs beyond the mean 27 Algorithm Overview Train a category-independent important person/object detector Train regression model to predict region importance Segment the video into temporal events Group regions of same object Generate a storyboard Train Test Test Test 28 [Discovering Important People and Objects for Egocentric Video Summarization.Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012] Discovering an Event’s Key People and Objects Score each frame region using the regressor Group instances of the same object/person together Set a pool of high scoring clusters Remove clusters with affinity to a higher I(r) cluster For each remaining cluster select the region with the highest importance as its representative 29 Generating a Storyboard Summary Each event can display different number of frames, depending on how many unique important things the method discovers 30 Results Important region prediction accuracy 31 Results Important region prediction accuracy 32 Results Which cues matter most for predicting importance ? Top 28 features with highest learned weights 33 Low scores on Interaction and frequency pair Object-like region that is frequent Results Egocentric video summarization accuracy 34 Results User studies to evaluate summaries 35 Let the camera wearer answer 2 quality questions: Important objects/people captured Overall summary quality Better results in ~69% of the summaries Today’s talk Motivation Related Work Object driven summarization Story driven summarization Results Future Development 36 Story Driven Video Summarization Good summary captures the progress of the story! 37 Segment video temporally into subshots Select chain of k subshots that maximize both weakest link’s influence and object importance Each subshot ”leads to” the next through some subset of influential objects [Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013] Document – Document Influence [Shahaf & Guestrin, KDD 2010] Connecting the dots between news articles. D. Shahaf and C. Guestrin. In KDD, 2010. 38 Egocentric Subshot Detection Define 3 generic ego-activities Static In transit Head moving Train classifiers to predict these activity types Features based on Blur and Optical flow Classify using SVM classifier 39 Temporal Subshot Segmentation Tailored to egocentric video – detects ego-activities Provides an over-segmentation - A typical subshot lasts ~15 Sec 40 Subshot Selection Objective Given a set series of subshots segmented from the input video, our goal is to select the optimal K-node chain of subshots 42 Story Progress Between Subshots A good story – a coherent chain of subshots, where each strongly influences the next one 43 Predicting Influence Between Subshots 0.2 0.01 44 0.2 0.1 0.003 0.1 0.1 Predicting Influence Between Subshots Sink node 45 Captures how reachable subshot j is from subshot i, via object o. Subshot Selection Objective Given a set series of subshots segmented from the input video, our goal is to select the optimal K-node chain of subshots 46 Predicting diversity among transitions Compute GIST and color histograms for each frame in each subshot, quantize them into 55 scene types Compute for each two adjacent subshots in the chain 1 2 D( s ) j 1 (1 exp( (s j ,s j 1 ))) K 1 47 Coherent Object Activation Patterns Story driven Uniform sampling Prefer activating few objects at once and, coherent (smooth) entrance/exit patterns Solve with linear programing and priority queue 48 Today’s talk Motivation Related Work Object driven summarization Story driven summarization Results Future Development 50 Results 4 videos, each 3-5 hours long, uncontrolled setting 51 20 videos, each 20-60 minutes, daily activities in house Results Baselines 1. 2. 3. Uniform sampling of K subshots Shortest path – K subshots with minimal bag-of-objects distance between each other Object driven – Only for UTE set Parameters K=4...8 s 1, i d 0.5 Simultaneous active objects : 80-UTE 15-ADL 52 Results Test methodology 34 human subjects, ages 18-60 12 hours of original video Each comparison done by 5 subjects Total 535 tasks, 45 hours of subject time Probably the most comprehensive egocentric summarization test ever established! 53 Results Blind taste test 54 Show sped up version of original video Show Story-driven summary and one of the baselines Which summary better shows the progress of the story? Pay attention to the relationship among sub-events,redundancy, and representativeness of each sub-event In 51% of the comparisons all 5 subjects voted Story-driven Only in 9% Story-driven won by only one vote Results Discovering influential objects 3 workers on Mturk N=42 objects as GT Baseline frequency of objects in the video Results show the method’s advantage. The most influential objects need not be the most frequent! 55 Results Where does the method fail ? Where the story is uneventful Where there are multiple interwoven threads 56 Further Development Better use of machine learning techniques instead of simple pair-wise regression Extend subshot descriptions to detect actions Augment the summary with a location service such as GPS Improve success ratio 57 Automatic storyboard maps 58 Pros and Cons Pros: Well written Well referenced Novel solution Large and detailed Human experiment Detailed Website Cons: 59 Very complicated material No source code publically available No real competition Computationally demanding 60