Combined Gesture-Speech Analysis and Synthesis

advertisement
Combined GestureSpeech Analysis
and Synthesis
M. Emre Sargın, Ferda Ofli, Yelena
Yasinnik, Oya Aran, Alexey Karpov,
Stephen Wilson,Engin Erzin, Yücel Yemez,
A. Murat Tekalp
Outline
• Project Objective
• Technical Details
–
–
–
–
–
Preparation of Gesture-Speech Database
Determination of Gestural – Auditory Events
Detection of Gestural – Auditory Events
Gesture-Speech Correlation Analysis
Synthesis of Gestures Accompanying Speech
• Resources
• Concluding Remarks and Future Work
• Demonstration
Project Objective
• The production of speech and gesture is interactive throughout
the entire communication process.
• Computer-Human Interaction systems should be interactive
such that, for an edutainment application, animated person’s
speech should be aided and complemented by it’s gestures.
• Two main goals of this project:
– Analysis and modeling of correlation between speech and
gestures.
– Synthesis of correlated natural gestures accompanying speech.
Technical Details
•
•
•
•
•
Preparation of Gesture-Speech Database
Determination of Gestural – Auditory Events
Detection of Gestural – Auditory Events
Gesture-Speech Correlation Analysis
Synthesis of Gestures Accompanying Speech
Preparation of Database
• Gestures and Speech of a specific subject (Can-Ann) was
investigated.
• 25 minutes video of a native English speaker giving directions,
25 fps, 38249 frames.
Determination of Gestural –
Auditory Events
• Database is manually examined to find specific, repetitive
gestural and auditory events.
• Note that, the events found for one specific subject is personal
and can vary from culture to culture.
– During the refusal phrases
• Turkish Style → Upward Movement of Head
• European Style → Left-Right Movement of Head
– The Can-Ann does not use these gestural events at all.
• Auditory Events:
– Semantic Information (Keywords): “Left”, “Right” and “Straight”.
– Prosodic Information: “Accent”.
• Gestural Events:
– Head Movements: “Down”, “Tilt”.
– Hand Movements: “Left”, “Right”, “Straight”.
Correlation Results
Directional word-gesture alignment
13%
Match
9%
Close
Confused
13%
65%
Wrong
Pitch Accents
19%
Phrase Initial
No Gesture
No Gesture
16%
65%
Gesturemarked
Detection of Gesture Elements
• In this project, we consider arm and head gestures.
• Gesture features are selected as:
– Head Gesture Features: Global Motion Parameters calculated
within head region.
– Hand Gesture Features: Hand center of mass position and
calculated velocity.
• Main tasks included in detection of gesture elements:
– Tracking of head region.
• Optical Flow Based
– Tracking of hand region.
• Kalman Filter Based
• Particle Filter Based
– Extraction of gesture features.
– Recognition and labeling of gestures.
Detection of Auditory Elements
• In this project, we consider semantic and prosodic events.
• Main tasks included in detection of gesture elements:
– Extraction of Speech Features:
• MFCC
• Pitch
• Intensity
– Keyword Spotting
• HMM Based
• Dynamic Time Warping Based
– Accent Detection
• HMM Based
• Sliding Window Based
Keyword Spotting (HMM Based): Training
Training speech Labels for keywords
- Speaker-dependent speech
recognition system
Training
- Hidden Markov Toolkit (HTK)
was used as base technology for
development of keyword spotter
Testing
Grammar:
left
Unknown speech Labels for keywords
- 20 minutes of speech were
labelled manually and used for
training
- each keyword was pronounced in
training speech at least 30 times
right
straight
silence
garbage
Keyword Spotting (HMM Based): Testing
- 5.5 minutes of speech were used for testing
- Speech fragment contains aproximately 600 words of which 35 are keywords
First experiments: keyword spotter was able to find almost all keywords in test
speech, but it gives many false alarms.
120
Keyword Spotting Rate
False Alarm Rate
110
%
100
90
80
70
60
2
4
6
8
10
12
14
Mixtures of Gaussians in HMMs
16
18
20
Keyword Spotting (Dynamic Time
Warping)
Sliding
Keyword Template
Speech
Discrete
deviation
• MFCC Parameters are used for parameterization
• Dynamic time warping method is used to find an optimal match
between two given sequences (e.g. time series).
• Results: Recognized keywords Missed words False alarms
33
2
22
Accent Detection (Sliding
Window Based)
• Parameters are calculated given a sliding window:
– Pitch contour
– Number of local minimum and maximum in pitch contour
– Intensity
• Windows that has high intensity values are selected.
• Median Filtering is used to remove short windows.
• The candidate accent windows are labeled using connected
component analysis.
• The candidate accent regions that contain few or many local
minimums and maximums are eliminated.
• Remaining candidate regions are selected as accents.
• Proposed method detects %68 of accents and gives 25% F.A.
Synthesis of Gestures
Accompanying Speech
•
Based on the methodology used in correlation analysis given a
speech signal:
– Features will be extracted.
– Most probable speech label will be designated to speech patterns.
– Gesture pattern that is most correlated with speech pattern will be
used to animate a stick model of a person.
Hand Gesture Models
Original Hand Trajectories
Generated trajectories
based on HMM
Resources
•
Database Preparation and Labeling
– VirtualDub
– Anvil
– Paraat
•
Image Processing and Feature Extraction:
– Matlab Image Processing Toolbox
– OpenCV Image Processing Library
•
Gesture-Speech Correlation Analysis
– HTK HMM Toolbox
– Torch Machine Learning Library
Concluding Remarks and
Future Work
•
•
•
•
•
Database will be extended with new subjects.
Algorithms and methods will be tested using new databases.
HMM based accent detector will be implemented.
Keyword and event sets will be extended.
Database scenarios will be extended.
Demonstration I
Demonstration II
Download