Video Mining Workshop 2002 Video Indexing and Summarization using Combinations of the MPEG-7 Motion Activity Descriptor with other MPEG-7 audio-visual descriptors Ajay Divakaran MERL - Mitsubishi Electric Research Labs Murray Hill, NJ MERL - Mitsubishi Electric Research Laboratories Video Mining Workshop 2002 Outline • Introduction – MPEG-7 Standard – Motivation for proposed techniques • Video Summarization using Motion Activity • Audio Assisted Video Summarization – Principal Cast Detection with MPEG-7 Audio Features – Automatic generation of Sports Highlights • Target Applications – Personal Video Recorder • Demonstration • Initial work on Video Mining • Conclusion MERL - Mitsubishi Electric Research Laboratories 2 Video Mining Workshop 2002 Team • • • • • • • Yours Truly Kadir A. Peker – Colleague and Ex-Doctoral Student Regunathan Radhakrishnan – Current Doctoral Student Romain Cabasson – Summer Intern Ziyou Xiong – Summer Intern and Current Collaborator Padma Akella – Initial Demo designer and developer Pradubkiat Bouklee – Initial Software developer MERL - Mitsubishi Electric Research Laboratories 3 Video Mining Workshop 2002 MPEG-7 Objectives • To develop a standard to identify and describe the multimedia content • Formal name: Multimedia Content Description Interface • Enable quick access to desired content whether local or not MERL - Mitsubishi Electric Research Laboratories 4 Video Mining Workshop 2002 MPEG-7: Key Technologies and Scope Description consumption Description Production feature extraction standard description search engine scope of MPEG-7 MERL - Mitsubishi Electric Research Laboratories 5 Video Mining Workshop 2002 MPEG-7 and other Standards Emphasis on Subjective Representation Rate Emphasis on Semantic Conveyance MPEG-2 Studio, DTV MPEG-1 H.263 MPEG-4 SNHC Object-Based JPEG JPEG-2000 MPEG-7 Descriptors Hybrid Content Interactive TV, Video Conferencing Indexing Retrieving Browsing Visualization Abstract Representation Virtual Reality Functionality MERL - Mitsubishi Electric Research Laboratories 6 Video Mining Workshop 2002 MPEG-7 framework • MPEG-7 standardizes: – Descriptors (Ds): representations of features • to describe various types of features of multimedia information • to define the syntax and the semantics of each feature representation – Description Schemes (DSs) • to specify pre-defined structures and semantics of descriptors and their relationship – Description Definition Language (DDL) • to allow the creation of new DSs and, possibly, Ds and to allows the extension and modification of existing DSs – XML MPEG-7 Schema MERL - Mitsubishi Electric Research Laboratories 7 Video Mining Workshop 2002 MPEG-7 Motion Activity Descriptor • Feature Extraction from Video – Uncompressed Domain • Color Histograms - Zhang et al • Motion Estimation - Kanade et al – Compressed Domain • DC Images - Yeo et al, Kobla et al • Motion Vector Based - Zhang et al • Bit Allocation - Feng et al, Divakaran et al MERL - Mitsubishi Electric Research Laboratories 8 Video Mining Workshop 2002 Motivation for Compressed Domain Extraction • Compressed domain feature extraction is fast. • Block-matched motion vectors are sufficient for gross description. • Motion vector based calculation can be easily normalized w.r.t. encoding parameters. MERL - Mitsubishi Electric Research Laboratories 9 Video Mining Workshop 2002 Motivation for Descriptor • Need to capture “pace” or Intensity of activity • For example, draw distinction between – “High Action” segments such as chase scenes. – “Low Action” segments such as talking heads • Emphasize simple extraction and matching • Use Gross Motion Characteristics thus avoiding object segmentation, tracking etc. • Compressed domain extraction is important MERL - Mitsubishi Electric Research Laboratories 10 Video Mining Workshop 2002 Proposed Motion Activity Descriptor • Attributes of Motion Activity Descriptor – – – – Intensity/Magnitude - 3 bits Spatial Characteristics - 16 bits Temporal Characteristics - 30 bits Directional Characteristics - 3 bits MERL - Mitsubishi Electric Research Laboratories Video Mining Workshop 2002 MPEG-7 Intensity of Motion Activity • Expresses “pace” or Intensity of Action • Uses scale of 1-5, very low - low - medium - high - very high • Extracted by suitably quantizing variance of motion vector magnitude • Motion Vectors extracted from compressed bitstream • Successfully tested with subjectively constructed Ground Truth MERL - Mitsubishi Electric Research Laboratories 12 Video Mining Workshop 2002 Video Summarization using Motion Activity • Video sequence V:{f1, f2, … fN} set of temporally ordered frames • Any temporally ordered subset of V is a summary • Previous work: Color dominant – Cluster frames based on image similarity – Select representative frames from clusters MERL - Mitsubishi Electric Research Laboratories 13 Video Mining Workshop 2002 Motion Activity as Summarizability • Hypothesis: – Motion activity measures intensity of motion – hence it measures change in the video – Therefore it indicates Summarizability • Test of the Hypothesis – Examine relationship between Fidelity of Summary and motion activity – Results show close correlation and motivate novel summarization strategy MERL - Mitsubishi Electric Research Laboratories 14 Video Mining Workshop 2002 Fidelity of a Summary Let the set of key-frames be S, and the set of frames be R. Let the distance between two frames Si and Ri be d(Si,Ri). Define di for each frame Ri as di min(d ( S k , Ri )), k 0..m Then the Semi-Hausdorff distance between S and R is given by d sh ( S , R) max(di ), i 1..n MERL - Mitsubishi Electric Research Laboratories 15 Video Mining Workshop 2002 Test of Hypothesis • • • • • Segment the test sequence into shots Use the first frame of each shot as its Key-Frame (KF) Compute the fidelity of each key-frame as described Compute the motion activity of each shot For each MPEG-7 motion activity threshold – Identify shots that have the same or lower motion activity – Find the percentage p of shots with unacceptable fidelity (>0.2) • Plot p vs the MPEG-7 motion activity thresholds MERL - Mitsubishi Electric Research Laboratories 16 Video Mining Workshop 2002 Motion Activity as a Measure of Summarizability MERL - Mitsubishi Electric Research Laboratories 17 Video Mining Workshop 2002 Conclusions from Experiment • The percentage of shots with unacceptable fidelity grows monotonically with motion activity • In other words, as motion activity grows, the shots become increasingly difficult to summarize • Hence, motion activity is a direct indicator of summarizability • Question: Is the first frame the best choice as a keyframe? MERL - Mitsubishi Electric Research Laboratories 18 Video Mining Workshop 2002 Optimal Key-Frame Selection Using Motion Activity • Summarizability is an indication of change in the shot • The cumulative motion activity is therefore an indication of the cumulative change in the shot MERL - Mitsubishi Electric Research Laboratories 19 Video Mining Workshop 2002 Optimal Key-Frame Extraction Using Motion Activity 1 Cumulative Motion Activity 0.5 Optimal Key-Frame 0 0.5 Time (Frame Number) 1 Optimal Single Key-Frame Simple generalization for N key-frames MERL - Mitsubishi Electric Research Laboratories 20 Video Mining Workshop 2002 Comparison with Opt. Fidelity KF Mot. Activity Ddsh First Frame Ddsh proposed KF Number of Shots Very Low 0.0116 0.0080 25 Low 0.0197 0.0110 133 Medium 0.0406 0.0316 73 High 0.0950 0.0576 28 0.0430 0.0216 Very High Overall avg. MERL - Mitsubishi Electric Research Laboratories 21 Video Mining Workshop 2002 Optimal Key-Frame Selection Based on Cumulative Motion Activity Number of KeyFrames N=1 S1 A(S1)=0.5 Number of KeyFrames N=2 S1 A(S1)=0.25 S2 A(S2)=0.75 Number of KeyFrames N=3 S1 A(S1)=0.167 S2 A(S2)=0.5 S3 A(S3)=0.83 Number of KeyFrames N=4 S1 A(S1)=0.125 S2 A(S2)=0.375 S3 A(S3)=0.625 S4 A(S4)=0.875 A(Si)= Normalized Cumulative Motion Activity at the location of FrameSi MERL - Mitsubishi Electric Research Laboratories 22 Video Mining Workshop 2002 Audio Assisted Video Browsing: Motivation • Baseline MHL visual summarization works well only when semantic segment boundaries are well defined • Semantic segment boundaries cannot be located easily using visual features alone • Audio is a rich source of content semantics • Should use audio features to locate semantic segment boundaries MERL - Mitsubishi Electric Research Laboratories 23 Video Mining Workshop 2002 Past Work • Principal Cast Identification using Audio – Wang et al • Topic Detection using Speech Recog. – Hanjalic etc • Semantic Scene Segmentation using Audio – Sundaram et al • Past work has emphasized classification of audio into crisp categories • We would like both a crisp categorization and a feature vector that allows softer classification • Generalized Sound Recognition Framework – Casey et al • Casey’s work provides a rich audio-semantic framework for our research MERL - Mitsubishi Electric Research Laboratories 24 Video Mining Workshop 2002 MPEG-7 Feature Extraction for Generalized Sound Recognition Window Audio Spectrum Envelope Extraction: SVD / ICA Stored Basis Functions Power Envelope Basis Projection MERL - Mitsubishi Electric Research Laboratories 25 Video Mining Workshop 2002 Our approach to Principal Cast Detection MPEG-7 Generalized Sound Recognition State Duration Histograms Our Enhancement MERL - Mitsubishi Electric Research Laboratories Principal Cast26 Video Mining Workshop 2002 Proposed Audio-Assisted Video Browsing Framework Audio News or other Video Video Audio Feature Extraction, Classification and Segmentation 1. First level of summariization achieved by playing a short portion of each audio segment. 2. Second level of summarization achieved by summarizing the collection of video shots contained in an audio segment. Detect Shots and Extract Motion Features of Shots MERL - Mitsubishi Electric Research Laboratories 27 Video Mining Workshop 2002 Audio-Assisted Video Browsing Framework Video-Audio Stream Audio Segment 1 Play Skip Audio Segment 2 Skip Audio Segment 3 Skip Audio Based Skim Chosen Audio Segment Motion Activity based Visual Summary MERL - Mitsubishi Electric Research Laboratories 28 Video Mining Workshop 2002 MHL application of Casey’s approach to News Video Browsing • Classify the audio segments of the news video into speech and non-speech categories in first pass • Classify the speech segments into male and female speech • Using K-means clustering find the “principal” speakers in each category • The occurrence of each of the principal speakers provides a natural semantic boundary • Apply baseline visual summarization technique to semantic segments obtained above • There is thus a two-level summarization of the news video MERL - Mitsubishi Electric Research Laboratories 29 Video Mining Workshop 2002 Clustering Results for Male Principal Cast Speaker Cluster1 Speaker Cluster2 Speaker Cluster3 Speaker Cluster4 Speaker Cluster5 Speaker Cluster6 Speaker Cluster7 Male speaker1 11 8 2 0 19 5 4 Male speaker2 18 15 0 0 8 13 0 Male speaker3 0 2 15 9 0 0 0 Male speaker4 10 0 6 11 2 0 0 Male speaker5 6 4 0 0 7 6 3 MERL - Mitsubishi Electric Research Laboratories 30 Video Mining Workshop 2002 Results and Challenges • Moderate accuracy so far. • Results are thus promising but not satisfactory • Lack of noise robustness and content dependence of training process represent major hurdle • Currently working on eliminating such problems through extensive training • Feature extraction too complex – currently investigating compressed domain audio feature extraction • Also examining alternative architectures that preserve basic spirit of framework MERL - Mitsubishi Electric Research Laboratories 31 Video Mining Workshop 2002 Automatic Extraction of Sports Highlights • • • • • Rapid Sports Highlights extraction is critical Past work has made use of color, camera motion etc. MPEG-7 Motion Activity Descriptor is simple Can use it to extract high action segments for example Should be useful in highlight extraction MERL - Mitsubishi Electric Research Laboratories 32 Video Mining Workshop 2002 Essential Strategy • Sports are governed by a set of rules • Key events lead to surges and dips in motion activity (perceived motion) • Thus, for a given sport, we can look for certain temporal patterns of motion activity that would indicate an interesting event • In sports highlights, the emphasis is on key-events and not on key-frames MERL - Mitsubishi Electric Research Laboratories 33 Video Mining Workshop 2002 Motion Activity Curve • Shot Detection not meaningful for our purpose • Compute motion activity (avg. mag. Of mv’s) for each Pframe • Smooth the values using a 10 point MA filter followed by a median filter • Quantize into binary levels of high and low motion using threshold • Low threshold for Golf, High for Soccer MERL - Mitsubishi Electric Research Laboratories 34 Video Mining Workshop 2002 Activity Curves for Golf MERL - Mitsubishi Electric Research Laboratories 35 Video Mining Workshop 2002 Activity Curve for Soccer MERL - Mitsubishi Electric Research Laboratories 36 Video Mining Workshop 2002 Highlights extraction : Golf • Play consists of long stretches of low activity interspersed with bursts of interesting high activity • Look for rising edges in the quantized motion activity curve • Concatenate ten second segments beginning at each of the points of interest marked above • The concatenation forms the desired summary MERL - Mitsubishi Electric Research Laboratories 37 Video Mining Workshop 2002 Highlights Extraction: Soccer • Play consists of long stretches of high activity • Interesting events lead to non-trivial stops in play leading to a short stretch of low MA • Thus we look for falling edges followed by a non-trivially long stretch of low motion activity • We are able to find the interesting events this way but have many false alarms • With our interface false alarms are easy to skip MERL - Mitsubishi Electric Research Laboratories 38 Video Mining Workshop 2002 Strengths and Limitations of Our Approach • The extraction is rapid and can be done in real time • We use an adaptively computed threshold that is suited to the content • An interface such as ours helps skip false alarms easily • There are too many false alarms MERL - Mitsubishi Electric Research Laboratories 39 Video Mining Workshop 2002 Current Approach to Extraction of Soccer Highlights MERL - Mitsubishi Electric Research Laboratories 40 Video Mining Workshop 2002 Motion activity feature extraction MA (5) Quantization : Audio magnitude extraction Select : MM (12) if >mean/2 then 1 else 0 Falling edge Volume contour (44KHz 1Hz) > 0.4s > 4s Peak detection : Transform : 0.4 Falling edge localM ax-localM in > (globalM ax-globalM in)/3 w nd size 1mn 4s 0.35 0.3 Mixing : 0.25 peak 0.2 15 16 17 18 19 20 21 22 23 Detecting patterns : <10s and / or then highlights and / or uninteresting <2s MERL - Mitsubishi Electric Research Laboratories 41 Video Mining Workshop 2002 Summary of Sports Highlights Generation • Motion Activity provides a quick way to generate sports highlights • We use a different strategy with each sport • The simplicity of the technique allows real-time tuning of thresholds to modify highlights • Interactive interfaces enable effective use MERL - Mitsubishi Electric Research Laboratories 42 Video Mining Workshop 2002 PVR: Personal Video Recorder Local Storage Feature Extraction & MPEG-7 Indexing Video Codec Browsing & Summarization Enhanced User Interface With Massive Amounts of Locally Stored Content, Need to Locate & Customize Content According to User MERL - Mitsubishi Electric Research Laboratories 43 Video Mining Workshop 2002 Blind Summarization – A Video Mining Approach to Video Summarization Ajay Divakaran and Kadir A. Peker Mitsubishi Electric Research Laboratories Murray Hill, NJ MERL - Mitsubishi Electric Research Laboratories Video Mining Workshop 2002 Content Mining • What is Data Mining? – It is the discovery of patterns and relationships in data. – Makes heavy use of statistical learning techniques such as regression and classification • Has been successfully applied to numerical data • Application to multimedia content is the next logical step • Most applicable to stored surveillance video and home video since patterns are not known a priori • Should enable anomalous event detection leading to highlight generation • Not applicable at first glance to consumer video MERL - Mitsubishi Electric Research Laboratories 45 Video Mining Workshop 2002 Content Mining vs. Typical Data Mining • Commonalities – Large data sets. Video is well known to produce huge volumes of data – Amenable to statistical analysis – Many of the machine learning tools work well with both kinds of data as can be seen in the literature and our research as well • Differences – Number of features not necessarily as large as conventional data mining data sets – Size of dataset not necessarily as large as conventional data mining data sets – Popular data mining techniques such as CART may not be directly applicable and may need modification • In summary, new mining techniques that retain the basic philosophy while customizing the details will have to be developed MERL - Mitsubishi Electric Research Laboratories 46 Video Mining Workshop 2002 Summarization cast as a Content Mining Problem • • DVD “Auto-Summarization” mode inspires “blind Summarization” Content Summarization can be cast as follows: – Classify segments into common and uncommon events without necessarily knowing the domain • • • • • Common patterns – what this video is about Rare patterns – possibly interesting events May help to categorize video, detect style... The Summary is then a combination of common and rare events Can hybridize with domain-dependent techniques MERL - Mitsubishi Electric Research Laboratories 47 Video Mining Workshop 2002 Data Mining Basics • • • • • Associations Time series similarity Sequential patterns Clustering “How does region A and B differ”, “Any anomaly in A”, “What goes with item x” – Marketing, molecular biology, etc. MERL - Mitsubishi Electric Research Laboratories 48 Video Mining Workshop 2002 Associations • A set of items i1..im; a set of transactions containing subset of items; a database of transactions: – – – – Rule X Y (X, Y items) : Support s: s% of transactions have X,Y together Confidence c: c% of the time buying X implies buying Y Improvement: Ratio of P(X,Y) to P(X)*P(Y) • Find all rules with support, confidence and improvement larger than specified thresholds. • Continuous-valued extension exists MERL - Mitsubishi Electric Research Laboratories 49 Video Mining Workshop 2002 Some Basic Aspects • Unsupervised learning – Similar to clustering vs. classification • Estimation of joint probability density – Find values of (i1,i2,…,in) where P(i1, i2,…,in) is high MERL - Mitsubishi Electric Research Laboratories 50 Video Mining Workshop 2002 Current Direction • As a starting point, try to discover the temporal patterns we used in detecting golf highlights • Then generalize to patterns across multiple features – Associations between changes, e.g. activity level change, speaker change, scene change, etc. MERL - Mitsubishi Electric Research Laboratories 51 Video Mining Workshop 2002 Previously observed pattern: Extended segments of very low activity followed by a jump in activity. Corresponds to a player preparing for a swing, then hitting the ball and the camera following the ball. MERL - Mitsubishi Electric Research Laboratories 52 Video Mining Workshop 2002 Time sequence mining • Find all similar sub-sequences in a given time sequence – E.g. motion activity of a video sequence • Previous work mostly query of a given sub-sequence in a larger sequence MERL - Mitsubishi Electric Research Laboratories 53 Video Mining Workshop 2002 Mining for Temporal Patterns • Given a sequence S(i) and window size w, construct the set of all subsequences of size w: S(1:w), S(2:w+1), …, S(N-w+1:N) • Find the cross-distances between each pair and cluster • Problem: How can we search for similar sub-sequences for different window sizes? MERL - Mitsubishi Electric Research Laboratories 54 Video Mining Workshop 2002 Point Distance Matrix • Let the distance between two sub-sequences of size w be: w 1 Dw ( xi , x j ) ( xi k x j k ) 2 k 0 • The distance between two points is: D1 ( xi , x j ) ( xi k x j k )2 • Then w 1 Dw ( xi , x j ) D1 ( xi k , x j k ) k 0 MERL - Mitsubishi Electric Research Laboratories 55 Video Mining Workshop 2002 Point Distance Matrix w 1 Dw ( xi , x j ) D1 ( xi k , x j k ) xi-xi+w k 0 xj-xj+w MERL - Mitsubishi Electric Research Laboratories 56 Video Mining Workshop 2002 Advantages of Using Point Distance Matrix • Search for diagonal lines of low point-distance • Not limited to a given window size, look for the longest possible diagonal line of low point-distance values • By allowing non diagonal lines and curves, we can utilize “Time Warping” – Matching of sub-sequences of different lengths MERL - Mitsubishi Electric Research Laboratories 57 Video Mining Workshop 2002 Multi-resolution Pattern Discovery • Multi-resolution analysis: – Smooth and sub-sample time series (conventional multiscale, e.g. wavelets) – Analysis with various window sizes, matching across different window sizes (our method automatically handles this) MERL - Mitsubishi Electric Research Laboratories 58 Video Mining Workshop 2002 Illustration: Segmenting Haiden Video Repeating temporal patterns MERL - Mitsubishi Electric Research Laboratories 59 Video Mining Workshop 2002 Other Issues • Clustering segments after finding similarities • Extend to other features, multiple dimensions – Currently using motion activity only – Extend to multi-dimensional feature vectors (e.g. color histogram) – Extend to multiple features, multiple modalities (e.g. video + audio) • Using a normalized Euclidean distance measure – Normalization based on local variance of data MERL - Mitsubishi Electric Research Laboratories 60 Video Mining Workshop 2002 Block-diagram of time-series mining Compute point crossdistances Point crossdistance matrix Find curve segments in the matrix Mining using feature 1 Labeling wrt feature 1 . . . . . . . . Mining using feature N Labeling wrt feature N MERL - Mitsubishi Electric Research Laboratories Similar subsegments, distances Clustering and labeling Mine for associations, higher level patterns Labeled patterns, Summary wrt one feature Higher-level patterns, Summary 61 Video Mining Workshop 2002 Target Applications • Surveillance Video – Can detect unusual events through video mining in stored video • Home Video – Can use event detection and other pattern discovery to manage home video • Entertainment Quality Video – Blind Summarization – Genre Independent yet event-aware processing • Content Management for Large Video Databases – All of the above at a very large scale MERL - Mitsubishi Electric Research Laboratories 62 Video Mining Workshop 2002 Future Extension - Model Based Matching • Use more sophisticated statistical techniques to fuse label streams MERL - Mitsubishi Electric Research Laboratories 63 Video Mining Workshop 2002 Conclusion • System Features – Unique, simple and flexible summarization – Integrated Player-Browser • Enable rapid and convenient browsing • Video Summarization using – Motion Activity as Summarizability – Audio-based principal cast detection – Audio-visual feature based sports highlights extraction • Further Possibilities – Refine Audio-assisted browsing – Incorporate other visual features – Video Mining MERL - Mitsubishi Electric Research Laboratories 64