An Overview of MPEG-4 Object-Based Encoding Algorithms

advertisement
Video Mining Workshop 2002
Video Indexing and Summarization using
Combinations of the MPEG-7 Motion Activity
Descriptor with other MPEG-7 audio-visual
descriptors
Ajay Divakaran
MERL - Mitsubishi Electric Research Labs
Murray Hill, NJ
MERL - Mitsubishi Electric Research Laboratories
Video Mining Workshop 2002
Outline
• Introduction
– MPEG-7 Standard
– Motivation for proposed techniques
• Video Summarization using Motion Activity
• Audio Assisted Video Summarization
– Principal Cast Detection with MPEG-7 Audio Features
– Automatic generation of Sports Highlights
• Target Applications
– Personal Video Recorder
• Demonstration
• Initial work on Video Mining
• Conclusion
MERL - Mitsubishi Electric Research Laboratories
2
Video Mining Workshop 2002
Team
•
•
•
•
•
•
•
Yours Truly
Kadir A. Peker – Colleague and Ex-Doctoral Student
Regunathan Radhakrishnan – Current Doctoral Student
Romain Cabasson – Summer Intern
Ziyou Xiong – Summer Intern and Current Collaborator
Padma Akella – Initial Demo designer and developer
Pradubkiat Bouklee – Initial Software developer
MERL - Mitsubishi Electric Research Laboratories
3
Video Mining Workshop 2002
MPEG-7 Objectives
• To develop a standard to identify and describe the
multimedia content
• Formal name: Multimedia Content Description
Interface
• Enable quick access to desired content whether
local or not
MERL - Mitsubishi Electric Research Laboratories
4
Video Mining Workshop 2002
MPEG-7: Key Technologies and Scope
Description consumption
Description Production
feature extraction
standard description
search engine
scope of MPEG-7
MERL - Mitsubishi Electric Research Laboratories
5
Video Mining Workshop 2002
MPEG-7 and other Standards
Emphasis on Subjective
Representation
Rate
Emphasis on Semantic
Conveyance
MPEG-2
Studio, DTV
MPEG-1
H.263
MPEG-4
SNHC
Object-Based
JPEG
JPEG-2000
MPEG-7
Descriptors
Hybrid
Content
Interactive
TV, Video
Conferencing
Indexing
Retrieving
Browsing
Visualization
Abstract Representation
Virtual Reality
Functionality
MERL - Mitsubishi Electric Research Laboratories
6
Video Mining Workshop 2002
MPEG-7 framework
• MPEG-7 standardizes:
– Descriptors (Ds): representations of features
• to describe various types of features of multimedia information
• to define the syntax and the semantics of each feature representation
– Description Schemes (DSs)
• to specify pre-defined structures and semantics of descriptors and their
relationship
– Description Definition Language (DDL)
• to allow the creation of new DSs and, possibly, Ds and to allows the extension
and modification of existing DSs – XML MPEG-7 Schema
MERL - Mitsubishi Electric Research Laboratories
7
Video Mining Workshop 2002
MPEG-7 Motion Activity Descriptor
• Feature Extraction from Video
– Uncompressed Domain
• Color Histograms - Zhang et al
• Motion Estimation - Kanade et al
– Compressed Domain
• DC Images - Yeo et al, Kobla et al
• Motion Vector Based - Zhang et al
• Bit Allocation - Feng et al, Divakaran et al
MERL - Mitsubishi Electric Research Laboratories
8
Video Mining Workshop 2002
Motivation for Compressed Domain Extraction
• Compressed domain feature extraction is fast.
• Block-matched motion vectors are sufficient for gross
description.
• Motion vector based calculation can be easily
normalized w.r.t. encoding parameters.
MERL - Mitsubishi Electric Research Laboratories
9
Video Mining Workshop 2002
Motivation for Descriptor
• Need to capture “pace” or Intensity of activity
• For example, draw distinction between
– “High Action” segments such as chase scenes.
– “Low Action” segments such as talking heads
• Emphasize simple extraction and matching
• Use Gross Motion Characteristics thus avoiding object
segmentation, tracking etc.
• Compressed domain extraction is important
MERL - Mitsubishi Electric Research Laboratories
10
Video Mining Workshop 2002
Proposed Motion Activity Descriptor
• Attributes of Motion Activity Descriptor
–
–
–
–
Intensity/Magnitude - 3 bits
Spatial Characteristics - 16 bits
Temporal Characteristics - 30 bits
Directional Characteristics - 3 bits
MERL - Mitsubishi Electric Research Laboratories
Video Mining Workshop 2002
MPEG-7 Intensity of Motion Activity
• Expresses “pace” or Intensity of Action
• Uses scale of 1-5, very low - low - medium - high - very
high
• Extracted by suitably quantizing variance of motion
vector magnitude
• Motion Vectors extracted from compressed bitstream
• Successfully tested with subjectively constructed Ground
Truth
MERL - Mitsubishi Electric Research Laboratories
12
Video Mining Workshop 2002
Video Summarization using Motion Activity
• Video sequence V:{f1, f2, … fN} set of temporally
ordered frames
• Any temporally ordered subset of V is a summary
• Previous work: Color dominant
– Cluster frames based on image similarity
– Select representative frames from clusters
MERL - Mitsubishi Electric Research Laboratories
13
Video Mining Workshop 2002
Motion Activity as Summarizability
• Hypothesis:
– Motion activity measures intensity of motion
– hence it measures change in the video
– Therefore it indicates Summarizability
• Test of the Hypothesis
– Examine relationship between Fidelity of Summary and motion activity
– Results show close correlation and motivate novel summarization
strategy
MERL - Mitsubishi Electric Research Laboratories
14
Video Mining Workshop 2002
Fidelity of a Summary
Let the set of key-frames be S, and the set of frames be R.
Let the distance between two frames Si and Ri be d(Si,Ri). Define di for each frame Ri as
di  min(d ( S k , Ri )), k  0..m
Then the Semi-Hausdorff distance between S and R is given by
d sh ( S , R)  max(di ), i  1..n
MERL - Mitsubishi Electric Research Laboratories
15
Video Mining Workshop 2002
Test of Hypothesis
•
•
•
•
•
Segment the test sequence into shots
Use the first frame of each shot as its Key-Frame (KF)
Compute the fidelity of each key-frame as described
Compute the motion activity of each shot
For each MPEG-7 motion activity threshold
– Identify shots that have the same or lower motion activity
– Find the percentage p of shots with unacceptable fidelity (>0.2)
• Plot p vs the MPEG-7 motion activity thresholds
MERL - Mitsubishi Electric Research Laboratories
16
Video Mining Workshop 2002
Motion Activity as a Measure of Summarizability
MERL - Mitsubishi Electric Research Laboratories
17
Video Mining Workshop 2002
Conclusions from Experiment
• The percentage of shots with unacceptable fidelity grows
monotonically with motion activity
• In other words, as motion activity grows, the shots
become increasingly difficult to summarize
• Hence, motion activity is a direct indicator of
summarizability
• Question: Is the first frame the best choice as a keyframe?
MERL - Mitsubishi Electric Research Laboratories
18
Video Mining Workshop 2002
Optimal Key-Frame Selection Using
Motion Activity
• Summarizability is an indication of change in the shot
• The cumulative motion activity is therefore an indication
of the cumulative change in the shot
MERL - Mitsubishi Electric Research Laboratories
19
Video Mining Workshop 2002
Optimal Key-Frame Extraction Using Motion
Activity
1
Cumulative
Motion
Activity
0.5
Optimal Key-Frame
0
0.5
Time (Frame Number)
1
Optimal Single Key-Frame
Simple generalization for N key-frames
MERL - Mitsubishi Electric Research Laboratories
20
Video Mining Workshop 2002
Comparison with Opt. Fidelity KF
Mot. Activity
Ddsh First
Frame
Ddsh proposed
KF
Number of
Shots
Very Low
0.0116
0.0080
25
Low
0.0197
0.0110
133
Medium
0.0406
0.0316
73
High
0.0950
0.0576
28
0.0430
0.0216
Very High
Overall avg.
MERL - Mitsubishi Electric Research Laboratories
21
Video Mining Workshop 2002
Optimal Key-Frame Selection Based on
Cumulative Motion Activity
Number of KeyFrames N=1
S1
A(S1)=0.5
Number of KeyFrames N=2
S1
A(S1)=0.25
S2
A(S2)=0.75
Number of KeyFrames N=3
S1
A(S1)=0.167
S2
A(S2)=0.5
S3
A(S3)=0.83
Number of KeyFrames N=4
S1
A(S1)=0.125
S2
A(S2)=0.375
S3
A(S3)=0.625
S4
A(S4)=0.875
A(Si)= Normalized Cumulative Motion Activity at the location of FrameSi
MERL - Mitsubishi Electric Research Laboratories
22
Video Mining Workshop 2002
Audio Assisted Video Browsing: Motivation
• Baseline MHL visual summarization works well only
when semantic segment boundaries are well defined
• Semantic segment boundaries cannot be located easily
using visual features alone
• Audio is a rich source of content semantics
• Should use audio features to locate semantic segment
boundaries
MERL - Mitsubishi Electric Research Laboratories
23
Video Mining Workshop 2002
Past Work
• Principal Cast Identification using Audio – Wang et al
• Topic Detection using Speech Recog. – Hanjalic etc
• Semantic Scene Segmentation using Audio – Sundaram
et al
• Past work has emphasized classification of audio into
crisp categories
• We would like both a crisp categorization and a feature
vector that allows softer classification
• Generalized Sound Recognition Framework – Casey et
al
• Casey’s work provides a rich audio-semantic framework
for our research
MERL - Mitsubishi Electric Research Laboratories
24
Video Mining Workshop 2002
MPEG-7 Feature Extraction for Generalized
Sound Recognition
Window
Audio
Spectrum
Envelope

Extraction:
SVD /
ICA
Stored
Basis
Functions
Power
Envelope
Basis
Projection
MERL - Mitsubishi Electric Research Laboratories
25
Video Mining Workshop 2002
Our approach to Principal Cast Detection
MPEG-7 Generalized Sound Recognition
State Duration Histograms
Our Enhancement
MERL - Mitsubishi Electric Research Laboratories
Principal Cast26
Video Mining Workshop 2002
Proposed Audio-Assisted Video Browsing
Framework
Audio
News or other Video
Video
Audio Feature Extraction,
Classification and
Segmentation
1. First level of
summariization achieved by
playing a short portion of
each audio segment.
2. Second level of
summarization achieved by
summarizing the collection
of video shots contained in
an audio segment.
Detect Shots and Extract
Motion Features of Shots
MERL - Mitsubishi Electric Research Laboratories
27
Video Mining Workshop 2002
Audio-Assisted Video Browsing Framework
Video-Audio Stream
Audio Segment 1
Play
Skip
Audio Segment 2
Skip
Audio Segment 3
Skip
Audio Based Skim
Chosen Audio Segment
Motion Activity based
Visual Summary
MERL - Mitsubishi Electric Research Laboratories
28
Video Mining Workshop 2002
MHL application of Casey’s approach to
News Video Browsing
• Classify the audio segments of the news video into
speech and non-speech categories in first pass
• Classify the speech segments into male and female
speech
• Using K-means clustering find the “principal” speakers in
each category
• The occurrence of each of the principal speakers
provides a natural semantic boundary
• Apply baseline visual summarization technique to
semantic segments obtained above
• There is thus a two-level summarization of the news
video
MERL - Mitsubishi Electric Research Laboratories
29
Video Mining Workshop 2002
Clustering Results for Male Principal Cast
Speaker
Cluster1
Speaker
Cluster2
Speaker
Cluster3
Speaker
Cluster4
Speaker
Cluster5
Speaker
Cluster6
Speaker
Cluster7
Male
speaker1
11
8
2
0
19
5
4
Male
speaker2
18
15
0
0
8
13
0
Male
speaker3
0
2
15
9
0
0
0
Male
speaker4
10
0
6
11
2
0
0
Male
speaker5
6
4
0
0
7
6
3
MERL - Mitsubishi Electric Research Laboratories
30
Video Mining Workshop 2002
Results and Challenges
• Moderate accuracy so far.
• Results are thus promising but not satisfactory
• Lack of noise robustness and content dependence of
training process represent major hurdle
• Currently working on eliminating such problems through
extensive training
• Feature extraction too complex – currently investigating
compressed domain audio feature extraction
• Also examining alternative architectures that preserve
basic spirit of framework
MERL - Mitsubishi Electric Research Laboratories
31
Video Mining Workshop 2002
Automatic Extraction of Sports Highlights
•
•
•
•
•
Rapid Sports Highlights extraction is critical
Past work has made use of color, camera motion etc.
MPEG-7 Motion Activity Descriptor is simple
Can use it to extract high action segments for example
Should be useful in highlight extraction
MERL - Mitsubishi Electric Research Laboratories
32
Video Mining Workshop 2002
Essential Strategy
• Sports are governed by a set of rules
• Key events lead to surges and dips in motion activity
(perceived motion)
• Thus, for a given sport, we can look for certain temporal
patterns of motion activity that would indicate an
interesting event
• In sports highlights, the emphasis is on key-events and
not on key-frames
MERL - Mitsubishi Electric Research Laboratories
33
Video Mining Workshop 2002
Motion Activity Curve
• Shot Detection not meaningful for our purpose
• Compute motion activity (avg. mag. Of mv’s) for each Pframe
• Smooth the values using a 10 point MA filter followed by
a median filter
• Quantize into binary levels of high and low motion using
threshold
• Low threshold for Golf, High for Soccer
MERL - Mitsubishi Electric Research Laboratories
34
Video Mining Workshop 2002
Activity Curves for Golf
MERL - Mitsubishi Electric Research Laboratories
35
Video Mining Workshop 2002
Activity Curve for Soccer
MERL - Mitsubishi Electric Research Laboratories
36
Video Mining Workshop 2002
Highlights extraction : Golf
• Play consists of long stretches of low activity
interspersed with bursts of interesting high activity
• Look for rising edges in the quantized motion activity
curve
• Concatenate ten second segments beginning at each of
the points of interest marked above
• The concatenation forms the desired summary
MERL - Mitsubishi Electric Research Laboratories
37
Video Mining Workshop 2002
Highlights Extraction: Soccer
• Play consists of long stretches of high activity
• Interesting events lead to non-trivial stops in play leading
to a short stretch of low MA
• Thus we look for falling edges followed by a non-trivially
long stretch of low motion activity
• We are able to find the interesting events this way but
have many false alarms
• With our interface false alarms are easy to skip
MERL - Mitsubishi Electric Research Laboratories
38
Video Mining Workshop 2002
Strengths and Limitations of Our Approach
• The extraction is rapid and can be done in real time
• We use an adaptively computed threshold that is suited
to the content
• An interface such as ours helps skip false alarms easily
• There are too many false alarms
MERL - Mitsubishi Electric Research Laboratories
39
Video Mining Workshop 2002
Current Approach to Extraction of Soccer
Highlights
MERL - Mitsubishi Electric Research Laboratories
40
Video Mining Workshop 2002
Motion activity
feature
extraction
MA (5)
Quantization :
Audio magnitude
extraction
Select :
MM (12)
if >mean/2 then 1
else 0
Falling edge
Volume contour
(44KHz  1Hz)
> 0.4s
> 4s
Peak detection :
Transform :
0.4
Falling edge
localM ax-localM in
> (globalM ax-globalM in)/3
w nd size
1mn
4s
0.35
0.3
Mixing :
0.25
peak
0.2
15
16
17
18
19
20
21
22
23
Detecting patterns :
<10s
and / or
then
highlights
and / or
uninteresting
<2s
MERL - Mitsubishi Electric Research Laboratories
41
Video Mining Workshop 2002
Summary of Sports Highlights Generation
• Motion Activity provides a quick way to generate sports
highlights
• We use a different strategy with each sport
• The simplicity of the technique allows real-time tuning of
thresholds to modify highlights
• Interactive interfaces enable effective use
MERL - Mitsubishi Electric Research Laboratories
42
Video Mining Workshop 2002
PVR: Personal Video Recorder
Local
Storage
Feature Extraction
& MPEG-7 Indexing
Video
Codec
Browsing &
Summarization
Enhanced
User Interface
With Massive Amounts of Locally Stored Content,
Need to Locate & Customize Content According to User
MERL - Mitsubishi Electric Research Laboratories
43
Video Mining Workshop 2002
Blind Summarization – A Video Mining
Approach to Video Summarization
Ajay Divakaran and Kadir A. Peker
Mitsubishi Electric Research Laboratories
Murray Hill, NJ
MERL - Mitsubishi Electric Research Laboratories
Video Mining Workshop 2002
Content Mining
• What is Data Mining?
– It is the discovery of patterns and relationships in data.
– Makes heavy use of statistical learning techniques such as regression
and classification
• Has been successfully applied to numerical data
• Application to multimedia content is the next logical step
• Most applicable to stored surveillance video and home
video since patterns are not known a priori
• Should enable anomalous event detection leading to
highlight generation
• Not applicable at first glance to consumer video
MERL - Mitsubishi Electric Research Laboratories
45
Video Mining Workshop 2002
Content Mining vs. Typical Data Mining
• Commonalities
– Large data sets. Video is well known to produce huge volumes of data
– Amenable to statistical analysis – Many of the machine learning tools work well
with both kinds of data as can be seen in the literature and our research as well
• Differences
– Number of features not necessarily as large as conventional data mining data
sets
– Size of dataset not necessarily as large as conventional data mining data sets
– Popular data mining techniques such as CART may not be directly applicable
and may need modification
• In summary, new mining techniques that retain the basic philosophy
while customizing the details will have to be developed
MERL - Mitsubishi Electric Research Laboratories
46
Video Mining Workshop 2002
Summarization cast as a Content Mining
Problem
•
•
DVD “Auto-Summarization” mode inspires “blind
Summarization”
Content Summarization can be cast as follows:
– Classify segments into common and uncommon events without
necessarily knowing the domain
•
•
•
•
•
Common patterns – what this video is about
Rare patterns – possibly interesting events
May help to categorize video, detect style...
The Summary is then a combination of common and
rare events
Can hybridize with domain-dependent techniques
MERL - Mitsubishi Electric Research Laboratories
47
Video Mining Workshop 2002
Data Mining Basics
•
•
•
•
•
Associations
Time series similarity
Sequential patterns
Clustering
“How does region A and B differ”, “Any anomaly in A”,
“What goes with item x”
– Marketing, molecular biology, etc.
MERL - Mitsubishi Electric Research Laboratories
48
Video Mining Workshop 2002
Associations
• A set of items i1..im; a set of transactions containing
subset of items; a database of transactions:
–
–
–
–
Rule X  Y (X, Y items) :
Support s: s% of transactions have X,Y together
Confidence c: c% of the time buying X implies buying Y
Improvement: Ratio of P(X,Y) to P(X)*P(Y)
• Find all rules with support, confidence and improvement
larger than specified thresholds.
• Continuous-valued extension exists
MERL - Mitsubishi Electric Research Laboratories
49
Video Mining Workshop 2002
Some Basic Aspects
• Unsupervised learning
– Similar to clustering vs. classification
• Estimation of joint probability density
– Find values of (i1,i2,…,in) where P(i1, i2,…,in) is high
MERL - Mitsubishi Electric Research Laboratories
50
Video Mining Workshop 2002
Current Direction
• As a starting point, try to discover the temporal patterns
we used in detecting golf highlights
• Then generalize to patterns across multiple features
– Associations between changes, e.g. activity level change, speaker
change, scene change, etc.
MERL - Mitsubishi Electric Research Laboratories
51
Video Mining Workshop 2002
Previously observed pattern: Extended segments of very low activity followed by a jump in activity.
Corresponds to a player preparing for a swing, then hitting the ball and the camera following the ball.
MERL - Mitsubishi Electric Research Laboratories
52
Video Mining Workshop 2002
Time sequence mining
• Find all similar sub-sequences in a given time sequence
– E.g. motion activity of a video sequence
• Previous work mostly query of a given sub-sequence in
a larger sequence
MERL - Mitsubishi Electric Research Laboratories
53
Video Mining Workshop 2002
Mining for Temporal Patterns
• Given a sequence S(i) and window size w, construct the
set of all subsequences of size w: S(1:w), S(2:w+1), …,
S(N-w+1:N)
• Find the cross-distances between each pair and cluster
• Problem: How can we search for similar sub-sequences
for different window sizes?
MERL - Mitsubishi Electric Research Laboratories
54
Video Mining Workshop 2002
Point Distance Matrix
• Let the distance between two sub-sequences of size w be:
w 1
Dw ( xi , x j )   ( xi  k  x j  k ) 2
k 0
• The distance between two points is:
D1 ( xi , x j )  ( xi k  x j k )2
• Then
w 1
Dw ( xi , x j )   D1 ( xi  k , x j  k )
k 0
MERL - Mitsubishi Electric Research Laboratories
55
Video Mining Workshop 2002
Point Distance Matrix
w 1
Dw ( xi , x j )   D1 ( xi  k , x j  k )
xi-xi+w
k 0
xj-xj+w
MERL - Mitsubishi Electric Research Laboratories
56
Video Mining Workshop 2002
Advantages of Using Point Distance Matrix
• Search for diagonal lines of low point-distance
• Not limited to a given window size, look for the longest
possible diagonal line of low point-distance values
• By allowing non diagonal lines and curves, we can utilize
“Time Warping”
– Matching of sub-sequences of different lengths
MERL - Mitsubishi Electric Research Laboratories
57
Video Mining Workshop 2002
Multi-resolution Pattern Discovery
• Multi-resolution analysis:
– Smooth and sub-sample time series (conventional multiscale, e.g.
wavelets)
– Analysis with various window sizes, matching across different window
sizes (our method automatically handles this)
MERL - Mitsubishi Electric Research Laboratories
58
Video Mining Workshop 2002
Illustration: Segmenting Haiden Video
Repeating
temporal
patterns
MERL - Mitsubishi Electric Research Laboratories
59
Video Mining Workshop 2002
Other Issues
• Clustering segments after finding similarities
• Extend to other features, multiple dimensions
– Currently using motion activity only
– Extend to multi-dimensional feature vectors (e.g. color histogram)
– Extend to multiple features, multiple modalities (e.g. video + audio)
• Using a normalized Euclidean distance measure
– Normalization based on local variance of data
MERL - Mitsubishi Electric Research Laboratories
60
Video Mining Workshop 2002
Block-diagram of time-series mining
Compute point
crossdistances
Point crossdistance
matrix
Find curve
segments in
the matrix
Mining using feature
1
Labeling wrt
feature 1
.
.
.
.
.
.
.
.
Mining using feature
N
Labeling wrt
feature N
MERL - Mitsubishi Electric Research Laboratories
Similar subsegments,
distances
Clustering and
labeling
Mine for
associations,
higher level patterns
Labeled
patterns,
Summary wrt
one feature
Higher-level
patterns,
Summary
61
Video Mining Workshop 2002
Target Applications
• Surveillance Video
– Can detect unusual events through video mining in stored video
• Home Video
– Can use event detection and other pattern discovery to manage home
video
• Entertainment Quality Video
– Blind Summarization
– Genre Independent yet event-aware processing
• Content Management for Large Video Databases
– All of the above at a very large scale
MERL - Mitsubishi Electric Research Laboratories
62
Video Mining Workshop 2002
Future Extension - Model Based Matching
• Use more sophisticated statistical techniques to fuse
label streams
MERL - Mitsubishi Electric Research Laboratories
63
Video Mining Workshop 2002
Conclusion
• System Features
– Unique, simple and flexible summarization
– Integrated Player-Browser
• Enable rapid and convenient browsing
• Video Summarization using
– Motion Activity as Summarizability
– Audio-based principal cast detection
– Audio-visual feature based sports highlights extraction
• Further Possibilities
– Refine Audio-assisted browsing
– Incorporate other visual features
– Video Mining
MERL - Mitsubishi Electric Research Laboratories
64
Download