Modeling Human Communication Dynamics Louis-Philippe Morency USC Multimodal Communication and Machine Learning Lab [MultiComp Lab] PhD students: Derya Ozkan and Sunghyun Park Master student: Moitreya Chatterjee Post-doctoral researcher: Stefan Scherer Research programmer: Giota Stratou Project manager: Alesia Egan Multimodal Communication and Machine Learning Lab © Keith Schaffer Multimodal Perception of User State: Applications Medical Disorders Distress Indicators Suicide prevention with MIT and CogitoHealth with Cincinnati Hospital Education Depression PTSD Distress Social Group learning analytics with Stanford and UCSD Virtual Learning Peer with CMU Engagement Dominance Empathy Affect Business Negotiation outcomes YouTube: Opinion mining with USC business school with UNT and UT Dallas Frustration Agreement Sentiment Multimodal Perception of Distress Indicators Multimodal Communicative Behaviors Verbal Lexicon Words Syntax Part-of-speech Dependencies Gestures Head gestures Eye gestures Arm gestures Body language Body posture Proxemics Eye contact Head gaze Eye gaze Facial expressions FACS action units Smile, frowning Pragmatics Discourse acts Auditory Visual Prosody Intonation Voice quality Vocal expressions Laughter, moans From Audio-Visual Signals to Perceived User State Verbal Lexicon Words Syntax Part-of-speech Dependencies Gestures Head gestures Eye gestures Arm gestures Body language Body posture Proxemics Pragmatics Discourse acts Auditory Visual Prosody Intonation Voice quality Vocal expressions Laughter, moans Eye contact Head gaze Eye gaze Facial expressions FACS action units Smile, frowning Disorders Depression PTSD Distress Social Engagement Dominance Empathy Affect Frustration Agreement Sentiment From Audio-Visual Signals to Perceived User State Audio signals Voice pitch Visual signals Head pose Perceived User State Distress Engagement Sentiment From Audio-Visual Signals to Perceived User State Audio signals Voice pitch Visual signals Head pose • Low-cost depth sensor • Articulated body tracking [FG 2008, best paper award] [CVPR 2012] • 3D head pose estimation • Real-time facial feature tracker • Pitch, energy and speaking rate • Automatic voice quality analysis From Audio-Visual Signals to Perceived User State Audio signals Perceived User State Voice pitch Visual signals Head pose Distress Engagement Sentiment • Audio • Visual • Verbal Human Communication Dynamics Computational Behavior Indicators Human in the loop: Identify behaviors useful to human task Behavior quantification Quantify changes in human behaviors Visualization of Computational Behavior Indicators Transparent comparisons of observed behavioral indicators Analogous to medical lab result sheets Low Normal High Detection and Computational Analysis of Psychological Signals (DCAPS) Albert (Skip) Rizzo Clinical Expert Jonathan Gratch Evaluation Arno Hartholt Integration Louis-Philippe Morency Multimodal Perception Stacy Marsella Animation Visualization & Audio Analysis David Traum Dialogue + 27 researchers, programmers, artists and clinicians Psychological Distress: Datasets Aims: Study behaviors associated to psychological distress Depression (PHQ-9) Trait Anxiety (STAI) PTSD (PCL-C/PCL-M) Examine how these cues may differ Across different interaction settings Face-to-face: expected to evoke strongest indicators Computer-mediated (TeleCoach): intended use case Human-computer (SimSensei): intended use case Across different populations General Los Angeles population (Craigslist) Recent veterans (US Vets) Multimodal Perception of Distress Indicators Psychological Distress Indicators [IEEE FG 2013 – Best paper award] Joy – Facial expr. • Distress Distress No-distress • Anxiety Hand self-adaptor Sad – Facial expr. Distress No-distress Legs fidgeting Vertical eye gaze Distress No-distress Voice energy std. Smile intensity Distress No-distress Voice quality (NAQ) • Depression • PTSD Distress No-distress Distress No-distress Distress No-distress Distress No-distress Effect of Gender on Distress Indicators [ACII 2013] • Distress • Anxiety • Depression • PTSD Effect of Gender on Distress Indicators [ACII 2013] • Distress • Anxiety • Depression • PTSD Suicide Prevention [ICASSP 2013] Nonverbal indicators of suicidal ideations Dataset: 30 suicidal adolescents/30 non-suicidal adolescents Suicidal teenagers use more breathy tones Open Quotient (OQ) Std. Open Quotient (OQ std.) Std. Norm. Amplitude Quotient (NAQ std.) ** ** 0.9 0.4 0.16 0.3 0.12 0.2 0.08 0.1 0.04 ** 0.7 Source: CDC 0.5 0.3 0.1 0 Non-Suicidal Suicidal 0 Non-Suicidal Suicidal Non-Suicidal Suicidal MultiSense: Multimodal Perception Library CONSUMER TOOLS\MODULES TRANSFORMERS: Facetrackers •Gavam •CLM/CLMZ •Okao •Shore Real-time hCRF ActiveMQ –VHMessenger EmoVoice PML: Perception Markup language Sensing Layer <person id=“subjectA”> <sensingLayer> <headPose> <position z="223" y="345" x="193" /> <rotation rotZ="15" rotY="35" rotX="10" /> <confidence>0.34<confidence/> </headPose> ... </sensingLayer> </person> PROVIDERS: Audio Capture Webcam Capture Mouse Kinect( Depth/Intensity/IR image/Skeleton) CONSUMERS: Image Painter Signal Painter *Each one exported as .dll Behavior Layer <person id=“subjectB”> <behaviorLayer> <behavior> <type>attention</type> <level>high</level> <value>0.6</value> <confidence>0.46<confidence/> </behavior> ... </behaviorLayer> </person> Temporal Probabilistic Learning Naïve Bayes Classifier • Audio • Visual • Verbal Maximum Entropy Model & Support Vector Machine y1 y2 y3 yn y1 y2 y3 yn x1 x2 x3 xn x1 x2 x3 xn Hidden Markov Model I. Temporal dynamic Conditional Random Field h1 h2 h3 hn y1 y2 y3 yn x1 x2 x3 xn x1 x2 x3 xn Caption yi xi Labels (e.g., head-nod,other-gesture) Observations (e.g., yaw, roll pitch) Generative models Discriminative models Temporal Probabilistic Learning Hidden Conditional Random Field • Audio • Visual • Verbal Latent-Dynamic Conditional Random Field y y1 y2 y3 yn h1 h2 h3 hn h1 h2 h3 hn x1 x2 x3 xn x1 x2 x3 xn [CVPR 2006,PAMI 2007] [CVPR 2007] II. Hidden substructure Model Accuracy HMM 84.2% CRF 86.0% HCRF (w=0) 91.6% HCRF (w=1) 93.9% True positive rate I. Temporal dynamic LDCRF HMM SVM CRF 0 0.2 0.4 0.6 0.8 False positive rate 1 Latent-Dynamic Conditional Neural Field[FG 2013, submitted] 100 LDCRF • Audio • Visual • Verbal y1 y2 y3 yn 90 h1 h3 h3 hn 80 g1 g2 g3 gn x1 x2 x1 I. Temporal dynamic x3 x2 x1 x2 70 xn x3 Rapport dataset xn x3 LDCNF xn II. Hidden substructure Taskar dataset Audio-visual sub-challenge 80 Accuracy (%) III. Nonlinear input fusion AVEC dataset (95 videos) 75 70 65 60 Audio only Visual only Early fusion Late fusion LDCNF Multi-View Hidden Conditional Random Field[CVPR 2012] • Audio • Visual • Verbal I. Temporal dynamic y h1 x1 h2 h1 x1 x2 y1 h3 h2 x2 x3 hn h3 x3 xn h1 hn x1 xn II. Hidden substructure h2 h1 x1 Multi-View HCRF y2 x2 y3 h3 h2 x2 x3 yn hn h3 x3 xn hn xn Multi-View LDCRF 70 III. Nonlinear input fusion 65 IV. Multi-stream models 60 Canal 9 debate dataset 55 50 HMM CRF HCRF MV-HCRF Multi-View Hidden Conditional Random Field[ICMI 2012] • Audio • Visual • Verbal I. Temporal dynamic y h1 x1 h2 h1 x1 x2 y1 h3 h2 x2 x3 hn h3 x3 hn x1 xn h2 h1 x1 Multi-View HCRF II. Hidden substructure x2 y3 h3 h2 x2 x3 yn hn h3 x3 xn hn xn Multi-View LDCRF 70 III. Nonlinear input fusion 65 IV. Multi-stream models V. Multimodal synchrony xn h1 y2 60 Canal 9 debate dataset 55 50 HMM CRF HCRF MV-HCRF MV-HCRF +KCCA Multimodal Machine Learning • Audio • Visual • Verbal I. Temporal dynamic in label sequence II. Hidden substructure in label sequence III. Nonlinear modeling of instantaneous input features IV. Multi-stream modeling of hidden substructure V. Synchrony in multimodal input streams VI. Multi-label structure and correlation VII. Symbol and signal integration VIII.Uncertainty in behavior labels HCRF Library: ICT open-source machine learning library http://hcrf.sf.net Perception Physical World Social Cues Affective Cues Understanding SimSensei Virtual Human Interaction Loop Cognition Decision-making Physical Cues Dialog Action Emotion Affective Cues Virtual World Physical Cues Generation Social Cues MultiSense Perception Physical World Social Cues Affective Cues Understanding SimSensei Virtual Human Interaction Loop Flores Cognition Dialogue manager Decision-making Physical Cues Dialog Action Cerebella + Smartbody Emotion Affective Cues Virtual World Physical Cues Generation Social Cues MultiSense + SimSensei: Video Demonstration MultiSense + SimSensei: Video Demonstration Collaborations and Available Technologies MultiSense: standardized perception framework Real-time facial tracking, articulated body tracking and auditory analysis Modular and multi-threaded architecture for easy extension http://multicomp.ict.usc.edu/ hCRF library, machine learning library Matlab and C++ implementations of LDCRF, HCRF and CRF. 2559 downloads during the last year http://hcrf.sf.net/ GAVAM + CLM-Z, real-time nonverbal behavior recognition Real-time head position and orientation estimation 66 facial feature tracking with automatic initialization http://multicomp.ict.usc.edu/ Thank you!