Modelling and Analyzing
Multimodal Dyadic Interactions
Using Social Networks
Sergio Escalera, Petia Radeva, Jordi Vitrià,
Xavier Barò and Bogdan Raducanu
Audio – Visual cues extraction and fusion
Social Network extraction and analysis
Experimental Results
Conclusions and future work
1. Introduction
Social interactions play a very important role in
people’s daily lives.
Present trend: analysis of human behavior based on
electronic communications: SMS, e-mails, chat
New trend: analysis of human behavior based on
nonverbal communication: social signals
Quantification of social signals represents a powerful cue to
characterize human behavior: facial expression, hand and body
gestures, focus of attention, voice prosody, etc.
Social Network Analysis (SNA) has been developed as a
tool to model social interactions in terms of a graphbased structure:
- ‘Nodes’ represent the ‘actors’: persons, communities, institutions, etc.
- ‘Links’ represent a specific type of interdepency: friendship, familiarity,
business transactions, etc.
A common way to characterize the information ‘encoded’ in
a SNA is to use several centrality measures.
Our contribution:
- In this work, we propose an integrated framework for
extraction and analysis of a SNA from multimodal (A/V)
dyadic interactions*
- The advantage is represented by the fact that it is based
on a totally non-intrunsive technology
- First: we perform speech segmentation through an
audio/visual fusion scheme
- In the audio domain, speech is detected through clusterization of audio
- In the visual domain, speech is detected through differential-based feature
extraction from the segmented mouth region
- The fusion scheme is based on stacked sequential learning
*We used a set of videos belonging to the New York Times’ Blogging heads opinion blog. The videos depict two
persons talking on different subject in front of a webcam
- Second: To quantify the dyadic interaction, we used the
‘Influence Model’, whose states encode previously
integrated audio-visual data
- Third: The Social Network is extracted based on the
estimated influences* and its properties are
characterized based on several centrality measures.
Block-diagram representation of our integrated framework
* The use of term ‘influence’ is inspired by the previous work of Choudhury:
T. Choudhury, 2003. “Sensing and Modelling Human Networks”, Ph.D. Thesis, MIT Media Lab
2. Audio – Visual cues extraction and fusion
• Audio cue
– Description
• 12 first MFCC coefficients
• Signal energy
• Temporal cepstral derivatives (Δ and Δ2 )
• Audio cue
– Diarization process
• Segmentation
– Coarse segmentation according Generalized Likelihood
ratio between consecutive windows
• Clustering
– Agglomerative hierarchical clustering with a BIC stopping
• Segments boundaries are adjusted at the end
• Visual cue
– Description:
• Face segmentation based on Viola-Jones detector
• Mouth region segmentation
• Vector of HOG descriptors for for the mouth region
• Visual cue
– Classification:
• Non-Speech class modelling
• One-class Dynamic Time warping based on the
following dynamic programming equation
• Fusion scheme
– Stacked sequential learning (suitable for
problems characterized by long runs of
identical labels)
• Fusion of audio-visual modalities
• Determining temporal relations of both feature sets
for learning a two-stage classifier (based on AdaBoost)
3. Social Network extraction and analysis
- Influence Model (IM), was a tool introduced for
quantification of interacting processes using a coupled
Hidden Markov Model (HMM)
- In the case of social interaction, the states of IM encode
automatically extracted audio-visual features
parameters represent the ‘influences’
Influence Model Architecture
- The construction of the Social Network is based on
‘influences’ values
- A directed link between two nodes A and B (designated
by A → B) implies that ‘A has influence over B’
- The SNA is based on several centrality measures:
- degree centrality (indegree and outdegree)
- Refers to the number of direct connections with other persons
- closeness centrality
- Refers to the facility between two persons to communicate
- betweeness centrality
- Refers to the relevance of a person to act as a ‘bridge’ between two sub-groups
of the network
- eigenvector centrality
- Refers to the importance of a person in the network
4. Experimental results
- We collected a subset of videos from the New York
Blogging Heads’ opinion blog
- We used 17 videos from 15 persons
- Videos depict two persons having a conversation in front of
their webcam on different topics (politics, economy,…)
- The conversations have an informal character and
sometimes frequent interruptions can occur
Snapshot from a video
- Audio features
- The audio stream has been analyzed using sliding windows of 25 ms with
an overlapping factor of 50%.
- Each window is characterized by 13 features (12 MFCC +E),
complemented with Δ and Δ2
- The shortest length of a valid audio segment was set to 2.5 ms
- Video features
- 32 oriented features (corresponding to the mouth region) have been
extracted using the HOG descriptor
- the length of the DTW sequences has been set to 18 frames (which
corresponds to 1.5 s)
- Fusion process
- stacked sequential learning was used to fusion the audio-visual features
- Adaboost was chosen as classifier
Visual and audio-visual speaker segmentation accuracy
The extracted social network showing participants’
label and influence directions
Centrality measures table
5. Conclusions and future work
- We presented an integrated framework for automatic
extraction and analysis of a social network from implicit input (multimodal dyadic interactions), based on the
integration of audio/visual features.
- In the future, we are planning to extend the current work to
study the problem of social interactions at larger scale and in
different scenarios
- Starting from the premise that people's lives are more
structured than it might seem a priori, we plan to study longterm interactions between persons, with the aim to discover
underlying behavioral patterns present in our day-to-day