Discovering Recurrent Events in Multi

advertisement
Discovering Recurrent Events in
Multi-channel Data Streams
using Unsupervised Methods
Milind R. Naphade,
Chung-Sheng Li
Thomas S. Huang
Image Formation & Processing Group
Beckman Institute for Advanced
Pervasive Media Management Group
Science & Technology,
IBM T. J. Watson Research Center
University of Illinois at UrbanaHawthorne, NY
Champaign, Urbana, IL
Contact: naphade@us.ibm.com
Contact: huang@ifp.uiuc.edu
Naphade Li & Huang, NGDM 02
1
Organization
 Mining
in Multimodal Data Streams
 Detecting Structure/Recurring Events
 Ergodic+Non-ergodic
HMMs
 Experiments
with Different Domains
 Concluding Remarks
Naphade Li & Huang, NGDM 02
2
Multimedia Semantics

The Semantics of Contents

Objects, Sites and Events of Interest in the Video
(ICIP 02)
The Semantics of Context
 The Semantics of Structure/Recurrence

Scenes
 Context Changes
 Recurring Temporal Patterns
 Structural Syntax

Naphade Li & Huang, NGDM 02
3
State of the ART
• Content Analysis:
• Image/Video Classification: Naphade (UIUC), Vailaya (Michigan
State), Iyengar & Vasconcelos (MIT), Smith (IBM)
• Semantic Audiovisual Analysis: Naphade (UIUC), Chang (Columbia).
• Learning and Multimedia:
• Statistical Media Learning: Naphade (UIUC), Forsyth (Berkeley),
Fisher & Jebara (MIT), V. Iyengar (IBM).
• Learning in Image Retrieval: Chang et al. (UCSB), Zhang et al
(Microsoft Research), Naphade et al. (UIUC) Viola et al. (MIT, MERL).
• Linking Clusters in Media Feature: Barnard & Forsyth (Berkeley),
Slaney (IBM).
• Vision and Speech:
• Computer Vision in Media Analysis: Bolle (IBM), Mallik (Berkeley)
• Auditory Scene Analysis & Discriminant ASR Models: Ellis
(MIT), Nadas et al. (IBM), Gopalkrishnan et al (IBM), Woodland et al.
(Cambridge), Naphade et al (UIUC) Wang et al (NYU), Kuo et al. (USC)
Naphade Li & Huang, NGDM 02
4
Media Learning: A Perspective
Supervision
SVM, NN, GMM, HMM-based
classification, Multijects, Multinet,
Supervised Segmentation, ASR,
CASA
Future of
Multimodal
Mining
Semantics
• More Supervision  More Semantics
• Semi-Autonomous Learning  Clever techniques for
supervision that reduce amount of user input
Naphade Li & Huang, NGDM 02
5
Extracting Semantics: What Options?
Signals
Features
Semantics
Today
Past
Semi-automatic
Manual
Most accurate
Most time consuming
Expensive
Static
Possible
Adaptive
Challenging
Autonomous
and User Friendly
Goal
Fully Automated
Future: For this to be
possible and useful need
Autonomous Learning.
Challenge: In this realm use “intelligence” and “learning” to move
from left to right without compromising on performance.
Naphade Li & Huang, NGDM 02
6
Challenges of Multimedia Learning
Problem
Approach
Tremendous variability and uncertainty
Framework must take uncertainty into
account
Small number of training examples
(relative to feature dimensionality)
Exhaustive training techniques such as
those for ASR not possible
Complex distributions, highly non-linear
decision boundaries, high-dimensional
feature spaces
Employ feature selection and
dimensionality reduction. Linear
classifiers not sufficient.
Manual annotation is time-consuming
expensive, human barrier
Learning needs to be user-centric
Dependence on a host of scientific
disciplines for extracting good features,
none of which have been perfected
Must get around imperfect segmentation,
single-channel auditory non-separability
Multiple Channels with possible
relationships that are unknown
Need to fuse information
• Challenging problems not easily addressed by
traditional approaches.
Naphade Li & Huang, NGDM 02
7
Media Learning: Proposed Architecture
Knowledge
Repository
Active Learning
Active
Sample
Selection
Multiple Instance Learning
Granularity
Resolution
Learning
Annotation
Multimedia
Repository
SVM, GMM, HMM
models
features
Audio
Features
Audio
Models
Fusion
Visual
Features
Speech
Models
Retrieval/
Summarization
Segmentation
Visual
Models
Feedback
Graphical
Models
for
Decision
Fusion
Discovering Structures and Recurring Patterns
Naphade Li & Huang, NGDM 02
8
Detecting the Semantics of Structure

Examples
News: The Anchor Person
 Sports e.g. Baseball: Homerun, Pitch, Strike-Out
 Talk-shows: Monologue, Laughter, Applause, Music
 Movies: e.g. Action Movies: Explosions, Gunshots.


Challenges
Mapping features to semantics.
 Evaluating a finite set of predefined hypotheses.
 Granularity: Structure exists at different granularities.
 Multimodal Fusion.

Naphade Li & Huang, NGDM 02
9
Related Literature
Early Use of HMMs for capturing stationarity and
transition and its application to clustering: A. B.
Poritz, Levenson et al.
 Scene Segmentation (using HMMs): Wolf, Ferman &
Tekalp; Kender & Yeo; Liu, Huang & Wang;
Sundaram and Chang, Divakaran & Chang.
 Multimodal scene similarity: Nakamura & Kanade;
Nam Cetin & Tewfik; Naphade, Wang & Huang;
Srinivasan, Ponceleon; Amir and Petkovic; Adams et
al.

Naphade Li & Huang, NGDM 02
10
Ergodic HMMs
Poritz showed how an ergodic model could capture repetitive
patterns in the speech signals through unsupervised clustering.
1
2
3
A Possible State Sequence:
1
Naphade Li & Huang, NGDM 02
11
Ergodic HMMs
Poritz showed how an ergodic model could capture repetitive
patterns in the speech signals through unsupervised clustering.
1
2
3
A Possible State Sequence:
1
1
Naphade Li & Huang, NGDM 02
12
Ergodic HMMs
Poritz showed how an ergodic model could capture repetitive
patterns in the speech signals through unsupervised clustering.
1
2
3
A Possible State Sequence:
1
1
2
Naphade Li & Huang, NGDM 02
13
Ergodic HMMs
Poritz showed how an ergodic model could capture repetitive
patterns in the speech signals through unsupervised clustering.
1
2
3
A Possible State Sequence:
1
1
2
Naphade Li & Huang, NGDM 02
3
14
Ergodic HMMs
Poritz showed how an ergodic model could capture repetitive
patterns in the speech signals through unsupervised clustering.
1
2
3
A Possible State Sequence:
1
1
2
Naphade Li & Huang, NGDM 02
3
1
15
Ergodic HMMs
Poritz showed how an ergodic model could capture repetitive
patterns in the speech signals through unsupervised clustering.
1
2
3
A Possible State Sequence:
1
1
2
Naphade Li & Huang, NGDM 02
3
1
16
Non Ergodic HMMs
 Transition
from any state to any other state
not permitted as in the Ergodic Case
1
2
3
A Possible State Sequence:
1
Naphade Li & Huang, NGDM 02
17
Non Ergodic HMMs
 Transition
from any state to any other state
not permitted as in the Ergodic Case
1
2
3
A Possible State Sequence:
1
1
Naphade Li & Huang, NGDM 02
18
Non Ergodic HMMs
 Transition
from any state to any other state
not permitted as in the Ergodic Case
1
2
3
A Possible State Sequence:
1
1
2
Naphade Li & Huang, NGDM 02
19
Non Ergodic HMMs
 Transition
from any state to any other state
not permitted as in the Ergodic Case
1
3
2
A Possible State Sequence:
1
1
2
Naphade Li & Huang, NGDM 02
3
20
Capturing Short Term Stationarity and
Long-Term Structure
D
1
2
3
1
2
3
1
2
3
D
• Each branch: non-ergodic
• All branches embedded in a hierarchical ergodic structure
Naphade Li & Huang, NGDM 02
21
Capturing Short Term Stationarity and
Long-Term Structure
D
1
2
3
1
2
3
1
2
3
D
• Each branch: non-ergodic
• All branches embedded in a hierarchical ergodic structure
Naphade Li & Huang, NGDM 02
22
Experimental Setup
 Domains
 Action
Videos (20 clips from “Specialist”)
 Late Night Shows (20 min of Dave Letterman)
 Features
 Visual
(30 frames/sec)
 Color
(HSV histogram)
 Structure (Edge Direction histogram)
 Audio
(30 audio frames/sec to sync with video)
 32
Mel Frequency Cepstral Coefficients (10 ms
overlap)
Naphade Li & Huang, NGDM 02
23
Results: Recurring Patterns in Video
Movie: Specialist
Discovered Recurring Pattern: Explosion
Naphade Li & Huang, NGDM 02
24
Results: Recurring Patterns in Video
Late night Show with Dave Letterman
Discovered Patterns: Applause, Laughter, Speech, Music
Applause
Naphade Li & Huang, NGDM 02
Laughter
25
Observations





Completely UNSUPERVISED
In case of recurring temporal event patterns this scheme is
capable of discovering them if there is a sufficient number of
these patterns in the set.
In case of repetitive anchoring events such as Applause in
Comedy Shows, scheme capable of discovering these events.
Segmentation and Pattern Discovery very helpful in
annotation. E.g. to manually annotate Dave Letterman’s jokes,
just look before the applause
Anyone who has done manual audio annotation knows how
useful it is to get the right segment boundaries especially at
the micro and macro level.
Naphade Li & Huang, NGDM 02
26
Summary


Problem:
 Automatic discovery of recurring temporal patterns
without supervision.
Approach:



Clustering: Use of unsupervised temporal clustering using a
hierarchical ergodic model with non-ergodic temporal pattern models
Interaction: User then needs to analyze only the extracted recurring set
to quickly propagate annotation.
Results:


Automatic extraction of recurring patterns (laughter, explosion,
monologue etc.) and regular structure
Near-complete elimination of manual annotation. Orders of magnitude
reduction in annotation of clusters than annotation of content.
Naphade Li & Huang, NGDM 02
27
Future Directions
Experiment with different non-ergodic
branches as well as across branch transitions
 Use this to bootstrap training of semantic
events that can be detected using HMMs/DBNs
(ICIP 98, NIPS 2000).
 Explore visual features extracted regionally to
model richer class of recurring patterns.
 Experimenting with the Sports Domain
(possible interaction with Prof. Chang and his
group)

Naphade Li & Huang, NGDM 02
28
Download