Modeling Human Communication Dynamics

advertisement
Modeling Human
Communication Dynamics
Louis-Philippe Morency
USC Multimodal Communication and
Machine Learning Lab [MultiComp Lab]
PhD students: Derya Ozkan and Sunghyun Park
Master student: Moitreya Chatterjee
Post-doctoral researcher: Stefan Scherer
Research programmer: Giota Stratou
Project manager: Alesia Egan
Multimodal Communication and Machine Learning Lab
© Keith Schaffer
Multimodal Perception of User State: Applications
Medical
Disorders
Distress Indicators
Suicide prevention
with MIT and CogitoHealth
with Cincinnati Hospital
Education
Depression
PTSD
 Distress


Social
Group learning analytics
with Stanford and UCSD
Virtual Learning Peer
with CMU



Engagement
Dominance
Empathy
Affect
Business
Negotiation outcomes
YouTube: Opinion mining
with USC business school
with UNT and UT Dallas



Frustration
Agreement
Sentiment
Multimodal Perception of Distress Indicators
Multimodal Communicative Behaviors
Verbal

Lexicon
 Words

Syntax
 Part-of-speech
 Dependencies



Gestures
 Head gestures
 Eye gestures
 Arm gestures

Body language
 Body posture
 Proxemics

Eye contact
 Head gaze
 Eye gaze

Facial expressions
 FACS action units
 Smile, frowning
Pragmatics
 Discourse acts
Auditory

Visual
Prosody
 Intonation
 Voice quality
Vocal expressions
 Laughter, moans
From Audio-Visual Signals to Perceived User State
Verbal

Lexicon
 Words

Syntax
 Part-of-speech
 Dependencies



Gestures
 Head gestures
 Eye gestures
 Arm gestures

Body language
 Body posture
 Proxemics
Pragmatics
 Discourse acts
Auditory

Visual
Prosody
 Intonation
 Voice quality
Vocal expressions
 Laughter, moans


Eye contact
 Head gaze
 Eye gaze
Facial expressions
 FACS action units
 Smile, frowning
Disorders
Depression
PTSD
 Distress


Social



Engagement
Dominance
Empathy
Affect



Frustration
Agreement
Sentiment
From Audio-Visual Signals to Perceived User State
Audio signals

Voice pitch
Visual signals

Head pose
Perceived User State



Distress
Engagement
Sentiment
From Audio-Visual Signals to Perceived User State
Audio signals

Voice pitch
Visual signals

Head pose
• Low-cost depth sensor
• Articulated body tracking
[FG 2008, best paper award]
[CVPR 2012]
• 3D head pose estimation
• Real-time facial feature tracker
• Pitch, energy and speaking rate
• Automatic voice quality analysis
From Audio-Visual Signals to Perceived User State
Audio signals

Perceived User State
Voice pitch



Visual signals

Head pose
Distress
Engagement
Sentiment
• Audio
• Visual
• Verbal
Human Communication Dynamics
Computational Behavior Indicators
 Human in the loop:
 Identify behaviors useful to human task
 Behavior quantification
 Quantify changes in human behaviors
Visualization of Computational Behavior Indicators
 Transparent comparisons of observed behavioral indicators
 Analogous to medical lab result sheets
Low
Normal
High
Detection and Computational Analysis of
Psychological Signals (DCAPS)
Albert (Skip) Rizzo
Clinical Expert
Jonathan Gratch
Evaluation
Arno Hartholt
Integration
Louis-Philippe Morency
Multimodal Perception
Stacy Marsella
Animation
Visualization & Audio Analysis
David Traum
Dialogue
+ 27 researchers,
programmers, artists
and clinicians
Psychological Distress: Datasets
Aims:
 Study behaviors associated to psychological distress
 Depression (PHQ-9)
 Trait Anxiety (STAI)
 PTSD (PCL-C/PCL-M)
 Examine how these cues may differ
 Across different interaction settings
 Face-to-face: expected to evoke strongest indicators
 Computer-mediated (TeleCoach): intended use case
 Human-computer (SimSensei): intended use case
 Across different populations
 General Los Angeles population (Craigslist)
 Recent veterans (US Vets)
Multimodal Perception of Distress Indicators
Psychological Distress Indicators
[IEEE FG 2013 – Best paper award]
Joy – Facial expr.
• Distress
Distress No-distress
• Anxiety
Hand self-adaptor
Sad – Facial expr.
Distress No-distress
Legs fidgeting
Vertical eye gaze
Distress No-distress
Voice energy std.
Smile intensity
Distress No-distress
Voice quality (NAQ)
• Depression
• PTSD
Distress No-distress
Distress No-distress Distress No-distress
Distress No-distress
Effect of Gender on Distress Indicators
[ACII 2013]
• Distress
• Anxiety
• Depression
• PTSD
Effect of Gender on Distress Indicators
[ACII 2013]
• Distress
• Anxiety
• Depression
• PTSD
Suicide Prevention [ICASSP 2013]
 Nonverbal indicators of suicidal ideations
 Dataset: 30 suicidal adolescents/30 non-suicidal adolescents
 Suicidal teenagers use more breathy tones
Open Quotient (OQ)
Std. Open Quotient (OQ std.) Std. Norm. Amplitude Quotient (NAQ std.)
**
**
0.9
0.4
0.16
0.3
0.12
0.2
0.08
0.1
0.04
**
0.7
Source: CDC
0.5
0.3
0.1
0
Non-Suicidal
Suicidal
0
Non-Suicidal
Suicidal
Non-Suicidal
Suicidal
MultiSense: Multimodal Perception Library
CONSUMER
TOOLS\MODULES
TRANSFORMERS:
Facetrackers
•Gavam
•CLM/CLMZ
•Okao
•Shore
Real-time hCRF
ActiveMQ –VHMessenger
EmoVoice
PML:
Perception
Markup
language
Sensing Layer
<person id=“subjectA”>
<sensingLayer>
<headPose>
<position z="223" y="345" x="193" />
<rotation rotZ="15" rotY="35" rotX="10" />
<confidence>0.34<confidence/>
</headPose>
...
</sensingLayer>
</person>
PROVIDERS:
Audio Capture
Webcam Capture
Mouse
Kinect( Depth/Intensity/IR
image/Skeleton)
CONSUMERS:
Image Painter
Signal Painter
*Each one exported as .dll
Behavior Layer
<person id=“subjectB”>
<behaviorLayer>
<behavior>
<type>attention</type>
<level>high</level>
<value>0.6</value>
<confidence>0.46<confidence/>
</behavior>
...
</behaviorLayer>
</person>
Temporal Probabilistic Learning
Naïve Bayes Classifier
• Audio
• Visual
• Verbal
Maximum Entropy Model &
Support Vector Machine
y1
y2
y3
yn
y1
y2
y3
yn
x1
x2
x3
xn
x1
x2
x3
xn
Hidden Markov Model
I. Temporal dynamic
Conditional Random Field
h1
h2
h3
hn
y1
y2
y3
yn
x1
x2
x3
xn
x1
x2
x3
xn
Caption
yi
xi
Labels (e.g., head-nod,other-gesture)
Observations (e.g., yaw, roll pitch)
Generative models
Discriminative models
Temporal Probabilistic Learning
Hidden
Conditional Random Field
• Audio
• Visual
• Verbal
Latent-Dynamic
Conditional Random Field
y
y1
y2
y3
yn
h1
h2
h3
hn
h1
h2
h3
hn
x1
x2
x3
xn
x1
x2
x3
xn
[CVPR 2006,PAMI 2007]
[CVPR 2007]
II. Hidden substructure
Model
Accuracy
HMM
84.2%
CRF
86.0%
HCRF (w=0)
91.6%
HCRF (w=1)
93.9%
True positive rate
I. Temporal dynamic
LDCRF
HMM
SVM
CRF
0
0.2
0.4
0.6
0.8
False positive rate
1
Latent-Dynamic Conditional Neural Field[FG 2013, submitted]
100
LDCRF
• Audio
• Visual
• Verbal
y1
y2
y3
yn
90
h1
h3
h3
hn
80
g1
g2
g3
gn
x1
x2
x1
I. Temporal dynamic
x3
x2
x1
x2
70
xn
x3
Rapport
dataset
xn
x3
LDCNF
xn
II. Hidden substructure
Taskar
dataset
Audio-visual sub-challenge
80
Accuracy (%)
III. Nonlinear input fusion
AVEC dataset
(95 videos)
75
70
65
60
Audio
only
Visual
only
Early
fusion
Late
fusion
LDCNF
Multi-View Hidden Conditional Random Field[CVPR 2012]
• Audio
• Visual
• Verbal
I. Temporal dynamic
y
h1
x1
h2
h1
x1
x2
y1
h3
h2
x2
x3
hn
h3
x3
xn
h1
hn
x1
xn
II. Hidden substructure
h2
h1
x1
Multi-View HCRF
y2
x2
y3
h3
h2
x2
x3
yn
hn
h3
x3
xn
hn
xn
Multi-View LDCRF
70
III. Nonlinear input fusion
65
IV. Multi-stream models
60
Canal 9
debate dataset
55
50
HMM
CRF
HCRF
MV-HCRF
Multi-View Hidden Conditional Random Field[ICMI 2012]
• Audio
• Visual
• Verbal
I. Temporal dynamic
y
h1
x1
h2
h1
x1
x2
y1
h3
h2
x2
x3
hn
h3
x3
hn
x1
xn
h2
h1
x1
Multi-View HCRF
II. Hidden substructure
x2
y3
h3
h2
x2
x3
yn
hn
h3
x3
xn
hn
xn
Multi-View LDCRF
70
III. Nonlinear input fusion
65
IV. Multi-stream models
V. Multimodal synchrony
xn
h1
y2
60
Canal 9
debate dataset
55
50
HMM
CRF
HCRF
MV-HCRF MV-HCRF
+KCCA
Multimodal Machine Learning
• Audio
• Visual
• Verbal
I.
Temporal dynamic in label sequence
II. Hidden substructure in label sequence
III. Nonlinear modeling of instantaneous input features
IV. Multi-stream modeling of hidden substructure
V. Synchrony in multimodal input streams
VI. Multi-label structure and correlation
VII. Symbol and signal integration
VIII.Uncertainty in behavior labels
HCRF Library:
ICT open-source machine learning library
http://hcrf.sf.net
Perception
Physical World
Social Cues
Affective Cues
Understanding
SimSensei Virtual Human Interaction Loop
Cognition
Decision-making
Physical Cues
Dialog
Action
Emotion
Affective Cues
Virtual World
Physical Cues
Generation
Social Cues
MultiSense
Perception
Physical World
Social Cues
Affective Cues
Understanding
SimSensei Virtual Human Interaction Loop
Flores
Cognition
Dialogue manager
Decision-making
Physical Cues
Dialog
Action
Cerebella + Smartbody
Emotion
Affective Cues
Virtual World
Physical Cues
Generation
Social Cues
MultiSense + SimSensei: Video Demonstration
MultiSense + SimSensei: Video Demonstration
Collaborations and Available Technologies
 MultiSense: standardized perception framework



Real-time facial tracking, articulated body tracking and auditory analysis
Modular and multi-threaded architecture for easy extension
http://multicomp.ict.usc.edu/
 hCRF library, machine learning library



Matlab and C++ implementations of LDCRF, HCRF and CRF.
2559 downloads during the last year
http://hcrf.sf.net/
 GAVAM + CLM-Z, real-time nonverbal behavior recognition



Real-time head position and orientation estimation
66 facial feature tracking with automatic initialization
http://multicomp.ict.usc.edu/
Thank you!
Download