[TALK] Automatic Classification of Married Couples

advertisement
Automatic Classification of Married
Couples’ Behavior using Audio Features
Matthew Black, Athanasios Katsamanis, Chi-Chun Lee,
Adam C. Lammert, Brian R. Baucom, Andrew Christensen,
Panayiotis G. Georgiou, and Shrikanth Narayanan
September 29, 2010
(Common Slides)
• Insert here …
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
2 / 21
Overview of Study
Married couples discussing a problem in their relationship
Trained
Human
Evaluator
Data Coding
Available Data (e.g., audio, video, text)
Judgments
(e.g., how much
is one spouse
blaming the
other spouse?)
Feedback
9/29/2010
Signal Processing
(e.g., feature extraction)
Computational Modeling
(e.g., machine learning)
Automatic Classification of Married Couples' Behavior using Audio Features
3 / 21
Motivation
• Psychology research depends on perceptual judgments
• Interaction recorded for offline hand coding
• Rely on a variety of established coding standards [Margolin et al. 1998]
• Manual coding process expensive and time consuming
• Creating coding manual
• Training coders
• Coder reliability
• Technology can help code audio-visual data
• Certain measurements are difficult for humans to make
• Computers can extract these low-level descriptors (LLDs) [Schuller et al. 2007]
• Consistent way to quantify human behavior from objective signals
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
4 / 21
Corpus
• Real couples in 10-minute problem-solving dyadic interactions
• Longitudinal study at UCLA and U. of Washington [Christensen et al. 2004]
• 134 distressed couples received couples therapy for 1 year
• 574 sessions (96 hours)
• Split-screen video (704x480 pixels, 30 fps)
• Single channel of far-field audio
• Data originally only intended for manual coding
• Recording conditions not ideal
• Video angle, microphone placement, and background noise varied
• Access to word transcriptions with speaker explicitly labeled
• No indications of timing or speech overlap regions
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
5 / 21
Sample Transcript
Husband: what did I tell you about you can spend uh
everything that we uh earn
Wife:
then why did you ask
Husband: and spend more and get us into debt
Wife:
yeah why did you ask see my question is
Husband: mm hmmm
Wife:
9/29/2010
if if you told me this and I agree I would
keep the books and all my expenses and
everything
Automatic Classification of Married Couples' Behavior using Audio Features
6 / 21
Manual Coding
• Each spouse evaluated by 3-4 trained coders
•
•
•
•
•
33 session-level codes (all on 1 to 9 scale)
Utterance- and turn-level ratings were not obtained
Social Support Interaction Rating System
Couples Interaction Rating System
All evaluators underwent a training period to standardize the coding process
• We analyzed 6 codes for this study
•
•
•
•
•
•
Level of acceptance (“acc”)
Level of blame (“bla”)
Global positive affect (“pos”)
Global negative affect (“neg”)
Level of sadness (“sad”)
Use of humor (“hum”)
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
7 / 21
Speaker Segmentation
• Segment the sessions into meaningful regions
• Exploit the known lexical content (transcriptions with speaker labels)
• Recursive automatic speech-text alignment technique [Moreno 1998]
AM = Acoustic Model
LM = Language Model
Dict = Dictionary
MFCC = Mel-Frequency Cepstral Coefficients
ASR = Automatic Speech Recognition
HYP = ASR Hypothesized Transcript
• Session split into regions: wife/husband/unknown
• Aligned >60% of sessions’ words for 293/574 sessions
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
8 / 21
Goals
• Goals
Wife
• Separate extreme cases of session-level perceptual judgments using
speaker-dependent features derived from (noisy) audio signal
123456789
blame
123456789
positive
123456789
negative
123456789
sadness
123456789
humor
123456789
123456789
123456789
123456789
123456789
123456789
Husband
123456789
acceptance
• Extraction and analysis of relevant acoustic (prosodic/spectral) features
• Relevance
• Automatic coding of real data using objective features
• Extraction of high-level speaker information from complex interactions
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
9 / 21
Feature Extraction (1/3)
• Explore the use of 10 acoustic low-level descriptors (LLDs)
• Broadly useful in psychology and engineering research
1) Speaking rate
• Extracted for each aligned word [words/sec, letters/sec]
2) Voice Activity Detection (VAD) [Ghosh et al. 2010]
• Trained on 30-second clip from held-out session
• Separated speech from non-speech regions
• Extracted 2 session-level first-order Markov chain features
Pr(xi = speech | xi-1 = speech)
Pr(xi = non-speech | xi-1 = speech)
• Extracted durations of each speech and non-speech segment to use as LLD
for later feature extraction
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
10 / 21
Feature Extraction (2/3)
• Extracted over each voiced region (every 10ms with 25ms window)
3) Pitch
4) Root-Mean-Square (RMS) Energy
5) Harmonics-to-Noise Ratio (HNR)
6) “Voice Quality” (zero-crossing rate of autocorrelation function)
7) 13 MFCCs
8) 26 magnitude of Mel-frequency bands (MFBs)
9) Magnitude of the spectral centroid
10) Magnitude of the spectral flux
• LLDs 4-10 extracted with openSMILE [Eyben et al. 2009]
• Pitch extracted with Praat [Boersma 2001]
• Median filtered (N=5) and linearly interpolated
• No interpolation across speaker-change points (using automatic alignment)
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
11 / 21
Pitch Example & Normalization
• Normalized the pitch stream 2 ways:
• Mean pitch value, µFo, computed across session using automatic alignment
• Unknown regions treated as coming from one “unknown” speaker
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
12 / 21
Feature Extraction (3/3)
• Each session split into 3 “domains”
• Wife (when wife was speaker)
• Husband (when husband was speaker)
• Speaker-independent (full session)
• Extracted 13 functionals across each domain for each LLD
• Mean, standard deviation, skewness, kurtosis, range, minimum, minimum
location, maximum, maximum location, lower quartile, median, upper
quartile, interquartile range
• Final set of 2007 features
• To capture global acoustic properties of the spouses and interaction
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
13 / 21
Classification Experiment
• Binary classification task
– Only analyzed sessions that had mean scores evaluator scores in the
top/bottom 20% of the code range
– Goal: separate the 2 extremes automatically
– Leave-one-couple-out cross-validation
– Trained wife and husband models separately [Christensen 1990]
– Error metric: % of misclassified sessions
• Classifier: Fisher’s linear discriminant analysis (LDA)
– Forward feature selection to choose which features to train the LDA
– Empirically better than other common classifiers (SVM, logistic
regression)
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
14 / 21
% Misclassified
Classification Results (1/2)
50
45
40
35
30
25
20
15
10
5
0
Wife
Acc
Husband
Bla
Pos
Neg
Sad
Hum
AVG
• Separated extreme behaviors better than chance (50% error) for
3 of the 6 codes
– Acceptance, global positive affect, global negative affect
– Global speaker-dependent cues captured evaluators’ perception well
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
15 / 21
% Misclassified
Classification Results (2/2)
50
45
40
35
30
25
20
15
10
5
0
Wife
Acc
Husband
Bla
Pos
Neg
Sad
Hum
AVG
Wife
• Other codes
Husband
123456789
123456789
123456789
123456789
123456789
123456789
– Need
to be
modeled
with more
dynamic
methods
acceptance
blame
positive
negative
sadness
humor
– Depend more crucially on non-acoustic cues
– Inherently less separable
123456789
9/29/2010
123456789
123456789
123456789
123456789
123456789
Automatic Classification of Married Couples' Behavior using Audio Features
16 / 21
Results: Feature Selection
• LDA classifier feature selection
– Mean = 3.4 features, Std. Dev. = 0.81 features
– Domain breakdown
– Wife models: Independent=56%, Wife=32%, Husband=12%
– Husband models: Independent=41%, Husband=37%, Wife=22%
– Feature breakdown
– Pitch=30%, MFCC=26%, MFB=26%, VAD=12%, Speaking Rate=4%, Other=2%
• Also explored using features from a single domain (make bar graphs!)
As expected, performance
decreases in mismatched
conditions
Good performance using speakerindependent features: session is
inherently interactive (need to use
more dynamic methods)
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
17 / 21
Conclusions
• Work represents initial analysis of a novel and challenging corpus
consisting of real couples interacting about problems in their
relationship
• Showed we could train binary classifiers using only audio features
that separated spouses’ behavior significantly better than chance
for 3 of the 6 codes we analyzed
• Provides a partial explanation of coders’ subjective judgments by
objective acoustic signals
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
18 / 21
Future Work
• More automation
– Front-end: speaker segmentation using only audio signal
• Multimodal fusion
– Acoustic features + Lexical features
• Saliency detection
– Using supervised methods (manual coding at a finer temporal scale)
– Using unsupervised “signal-driven” methods
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
19 / 21
References
P. Boersma, “Praat, a system for doing phonetics by computer,” Glot International, vol. 5, no. 9/10, pp. 341–345, 2001.
A. Christensen, D.C. Atkins, S. Berns, J. Wheeler, D. H. Baucom, and L.E. Simpson. “Traditional versus integrative behavioral couple
therapy for significantly and chronically distressed married couples.” J. of Consulting and Clinical Psychology, 72:176-191, 2004.
F. Eyben, M. W¨ollmer, and B. Schuller, “openEAR–Introducing the Munich open-source emotion and affect recognition toolkit,” in Proc.
IEEE ACII, 2009.
P. K. Ghosh, A. Tsiartas, and S. S. Narayanan, “Robust voice activity detection using long-term signal variability,” IEEE Trans. Audio, Speech,
and Language Processing, 2010, accepted.
C. Heavey, D. Gill, and A. Christensen. Couples interaction rating system 2 (CIRS2). University of California, Los Angeles, 2002.
J. Jones and A. Christensen. Couples interaction study: Social support interaction rating system. University of California, Los Angeles,
1998.
G. Margolin, P.H. Oliver, E.B. Gordis, H.G. O'Hearn, A.M. Medina, C.M. Ghosh, and L. Morland. “The nuts and bolts of behavioral
observation of marital and family interaction.” Clinical Child and Family Psychology Review, 1(4):195-213, 1998.
P.J. Moreno, C. Joerg, J.-M. van Thong, and O. Glickman. “A recursive algorithm for the forced alignment of very long audio segments.” In
Proc. ICSLP, 1998.
V. Rozgić, B. Xiao, A. Katsamanis, B. Baucom, P. G. Georgiou, and S. Narayanan, “A new multichannel multimodal dyadic interaction
database,” in Proc. Interspeech, 2010.
B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, and L. Kessous. “The relevance of feature
type for automatic classification of emotional user states: Low level descriptors and functionals.” In Proc. Interspeech, 2007.
A. Vinciarelli, M. Pantic, and H. Bourlard. “Social signal processing: Survey of an emerging domain.” Image and Vision Computing,
27:1743-1759, 2009.
9/29/2010
Automatic Classification of Married Couples' Behavior using Audio Features
20 / 21
Thank you!
Questions?
Download