Recommendations Based on Speech Classification

advertisement
Recommendations Based on
Speech Classification
(and examples of what recommender
systems can learn from signal processing)
Christian Müller
German Research Center for Artificial Intelligence
International Computer Science Institute, Berkeley, CA
Overview
 Speech as a source of information for non-intrusive user
modeling
Speech/signal processing
Take-away messages
 Vocal aging -> features for
 Knowledge-driven feature
speaker age recognition
selection
 GMM/SVM supervector
approach for acoustic
 Classification methods for
speech features
independent “bag of
recommender
systems can
observations”
features
 Detection task(and
andexamples of what
learn applicationfrom signal processing)
pseudo-NIST evaluation
 Valid
procedure
independent evaluation
 Rank and polynomial rank
normalization
 Feature space warping
normalization
Recommendations Based on
Speech Classification
 Conclusions
Christian Müller
Speech as a Source for Non-Intrusive UM
Information about
the user
speaker
classification
?
Now it’s time to
get to gate 38.
adaptive
speech dialog system
A
user model
speech = sensor
inference from
sensors
(not intrusive)
adapts it's dialog
behavior
(e.g. detailed map with
shops vs. arrows)
B
explicit statement
(intrusive)
provides
recommendations
(e.g. a different route
to the gate)
Christian Müller
Speaker Classification Systems
Cognitive Load
Best Research Paper Award
UM 2001
Age and Gender
Audio segment
(telephone quality)
S
y
s
t
e
m
Voice Award 2007
Telekom live operation 2009
Language
14 languages + dialects
NIST evaluation 2007
Identity
Project with BKA 2009
NIST* Evaluation 2008
Acoustic Events
Project with VW 2008
Interspeech 2008
Christian Müller
Recommendations Based on
Speech Classification
products
media
services
actions
strategies
age





gender





emotions





language





dialect





accent





identity







acoustic
events
Christian Müller
Product Recommendations Based
on Age and Gender
Zur Anzeige wird der QuickTime™
Dekompressor „svq1“
benötigt.
Christian Müller
Product Recommendations Based
on Age and Gender
AM
Michael Feld and Christian Müller. Speaker Classification for Mobile Devices.
In Proceedings of the 2nd IEEE International Interdisciplinary Conference on
Portable Information Devices (Portable 2008). 2008
Christian Müller
 How can you find features for building
your models by explicitly studying
the underlying phenomena?
 Proposing Knowledge-driven feature
select the example of features for
speaker age recognition
Christian Müller
Speaker Classification as an
Interdisciplinary Area of Research
Which
are
the
speaker
Which
How
are
can
therequirements
the
manifestations
age (and of
thea gender)
of age classification
of
(and
a speaker
gender)system
be
in the
and how can
they
be solved
on the
implementation
layer ?
speaker’s
recognized
voice and
automatically
speaking
style
? ?
Speech
Technology /
Artificial
Intelligence
Speaker
Classification
Phonetics
Voice Pathology
SoftwareTechnology
Christian Müller
Impact of Aging on the Human Speech
Production
Speech breathing
effects:
lower expirational volume
more speech pauses
lower amplitude
thorax
stiffer
lungs
lighter
less elastic
lower position
Christian Müller
Impact of Aging on the Human Speech
Production
laryngal area
effects:
rise of fundamental frequency (in men)
reduced voice quality
larynx
vocal folds
calcification and ossification
loss of tissue
stiffening
Christian Müller
Impact of Aging on the Human Speech
Production
supralaryngal area
facial bones and
muscles
degeneration
reduced elasticity
effects:
imprecise articulation
for example vowel centralization
Christian Müller
Impact of Aging on the Human Speech
Production
neurological
effects
loss of tissue in the cortex
reduced performance of the neuronal transmitters
effects:
reduced articulation rate
defective coordination between the articulators
vowel centralization
Christian Müller
Development of F0 in Men / Women
F0 (Hz)
170
men
160
150
only non-smokers
women
140
130
120
smokers and nonsmokers
110
100
Linville (2001)
90
age in years
20
30
40
50
60
70
80
90
Christian Müller
Age Classes
Female
CF
Male
age
CM
Children
<= 13 years
YM
Youth
Adults
Seniors
14 - 19 years
AM
20 - 64 years
>= 65 Jahren
Christian Müller
Age Classes
Female
Male
age
CF CM
Children
<= 13 years
YM
Youth
Adults
Seniors
14 - 19 years
AM
20 - 64 years
>= 65 Jahren
Christian Müller
Features
fundamental frequency (pitch)
mean
pitch_mean
standard deviation
pitch_stddev
min, max and difference
pitch_min / pitch_max / pitch_diff
voice quality
shimmer
shim_l / shim_ldb / shim_apq3 / shim_apq11 / shim_ddp
jitter
jitt_l / jitt_la / jitt_rap / jitt_ppq / jitt_ddp
harmonics-to-noise-ratio
harm_mean / harm_stddev
articulation rate
ar_rate
speech pauses
pause_num / pause_dur
Christian Müller
Features
fundamental frequency (pitch)
mean
standard deviation
min, max and difference
voice quality
voice
shimmer
jitter
harmonics-to-noise-ratio
articulation rate
speech pauses
speaking style
Christian Müller
Example Results
C_YF
AF
SF
YM_AM_SM
C
F
C
M
Y
F
Y
M
A
F
A
M
S
F
S
M
high jitter value = low voice quality
fundamental frequency (F0)
Christian Müller. Zweistufige kontextsensitive
Sprecherklassifikation am Beispiel von Alter und Geschlecht
[Two-layered Context-Sensitive Speaker Classification on the
Example of Age and Gender]. AKA, Berlin, 2006
C
F
C
M
Y
F
speech pauses
Y
M
A
F
A
M
S
F
S
M
Christian Müller
Hiearchical Feature Model
High-level features
(learned characteristics)
Low-level features
(physical
characterstics)
semantics
?
dialog
A
b a
B: b
d
d
e
:
e
b
c
ideloect
<s> how shall I say this <c> <s> yeah I know...
phonetics
/S/ /oU/ /m/ /i:/ /D/ /&/ /m/ / / /n/ /i:/ ...
prosody
spectrum
Christian Müller
 How can your features be modeled
assuming that they
 are multi-dimentional
 represent repeating observations of the
same kind
 can be assumed to be independent (“bag”
of observations)
 Proposing the GMM/SVM Supervector
Approach on the example of frame-byframe acoustic features
Christian Müller
General Classification Scheme
zk
e.g. channel
compensation
Preprocessing
multilayer perceptron
support-vector
machines
(not addressed in
this
networks
talk)
-0,4
wkj
0.7
-1
y1
y2
-1.5
0.5
1
1
Feature
Extraction
1 wji
1
x1
x2
Classification
Fusion
Top-DownKnowledge
Christian Müller
Modeling Acoustics and Prosodics
semantics
?
dialog
A
b a
B: b
d
d
e
:
e
b
c
ideloect
<s> how shall I say this <c> <s> yeah I know...
no ASR
phonetics
/S/ /oU/ /m/ /i:/ /D/ /&/ /m/ / / /n/ /i:/ ...
prosody
spectrum
Christian Müller
Generative Approach: Gaussian Mixture Model
(GMM)
training
“emergency vehicle”
probability
density
feature
extraction
“emergency
vehicle”
model
frame of speech
test
?
feature
extraction
“emergency
vehicle”
model
avg likelihood
over all frames
for class
“emergency
vehicle”
Christian Müller
Generative Approach: Gaussian Mixture Model
(GMM)
test
?
feature
extraction
“emergency
vehicle”
model
frame of speech
background
model
avg. log
likelihood ratio
over all
frames for
class
“emergency
vehicle”
Christian Müller
A Mixture of Gaussians
 Means, variances, and mixtures weights are
optimized in training
 Black line = mixture of 3 Gaussians
Christian Müller
Discriminative Method:
Support Vector Machine (SVM)
training
“em. vehic.” (1)
“not em. vehic.” (-1)




feature
extraction
“em. vehic.”
model
Features are transformed into higher-dimensional space where problem is linear
Discriminating hyper plane is learned using linear regression
Trade-off between training error and width of margin
Model is stored in form of “support vectors” (data points on the margin)
Christian Müller
Discriminative Method:
Support Vector Machine (SVM)
test
?
feature
extraction
score
(distance to
hyper plane)
 Discriminative methods have shown to be superior to generative methods for
similar tasks
 Features vectors have to be of the same lengths (sensitive to variable
segment lengths)
 Solutions:
 feature statistics calculated over the entire utterance
 fixes portion of the segment
 sequential kernels
Christian Müller
GMM/SVM Supervector Approach
feature
extraction
Gaussian means
(MAP adapted)
 Combines discriminative power of SVMs with length
independency of GMMs
 Very successful with similar tasks such as speaker
recognition
 GMM is trained using MAP adaptation
Christian Müller
Evaluation Results
DCF
23,41
25
19,55
20
14,58
15
10,22
8,09
10
3,45
5
0
ir
t
en
et
s
e
ed
h
c
at
m
ed
h
c
at
m
un
GM M-UBM
GM M-SVM
Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection
for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008.
Christian Müller
 How can you evaluate your multi-class
models independently from the given
application?
 How can you establish a appropriate
evaluation in order procedure to obtain
valid results?
 Proposing the detection task and the
“pseudo NIST” evaluation
procedure on the example of acoustic
event detection and speaker age
recognition.
Christian Müller
Background
 With multi-class recognition problems, many
test/analyzing methods are very application
specific.
 e.g. confusion matrices.
 we want a method that allows results to be generalized
across a large set of applications.
 With home-grown databases, parameter tuning on
the evaluation set often compromises the validity
of the results/inferences.
 we want a fair “one shot” evaluation.
Christian Müller
The Detection Task
system
yes , 1.324326
emergeny vehicle ?
 Given
 a speech segment (s)
 and an acoustic event to be detected (target event, ET )
 the task is to decide whether ET is
present in s (yes or no)
 the system's output shall also contains a score
indicating its confidence with more positive
scores indicating greater confidence.
Christian Müller
Terminology
 Segment class
 e.g. segment event, segment age-class.
 ground truth (not known).
 Target
 the hypothesized class.
 Trial
 a combination of segment and target.
Christian Müller
Evaluation
yes
emergency vehicle ?
music ?
talking ?
laughing ?
phone ?
no event ?
system
no
no
no
yes
no
1.32432
-0.3212
1.8463
-2.5773
0.00132
2.20122
 The system performance is evaluated by presenting it with a
set of trials.
 Each test segment is used for multiple trials.
 The absence of all of all targets is explicitly included.
Christian Müller
Type of Errors
segment “em. vehic.”
system
no
“MISS”
target “em. vehic” ?
segment “em. vehic”
system
target “phone” ?
yes
“FALSE ALARM”
Christian Müller
Decision-Error Tradeoff
misses
“equal error rate”
false alarms
 Selecting an operating point (decision threshold) along the
dotted line trades misses off false alarms.
 Optimal operating point is application dependent.
 Low false alarm rates are desirable for most applications.
Christian Müller
Decision Cost Function
C(ET, EN) = CMiss · PTarget · PMiss(ET)
+ CFA · (1-PTarget) · PFA (ET,EN)
where ET and EN are the target and non-target events,
and CMiss, CFA and PTarget are application model parameters.
The application parameters for EER are:
CMiss = CFA = 1
and
PTarget = 0.5
 Weighted sum of misses and false alarms using
variable costs and priors.
 Application model parameters are selected according to
the application.
Christian Müller
Example DET-Plot
miss
probability
false alarm probability
Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection
for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008.
Christian Müller
Example Cost Chart
COSTS: (At, An)
An
C
YF
YM
AF
AM
SF
SM
C
--
0.220
0.092
0.145
0.083
0.133
0.069
YF
0.166
--
0.081
0.201
0.080
0.198
0.070
YM
0.076
0.084
--
0.130
0.203
0.108
0.188
AF
0.088
0.161
0.110
--
0.095
0.219
0.082
AM
0.064
0.083
0.254
0.139
--
0.105
0.228
SF
0.096
0.150
0.100
0.249
0.091
--
0.095
SM
0.065
0.085
0.238
0.117
0.246
0.118
--
Avg Cost (At)
0.092
0.130
0.146
0.164
0.133
0.147
0.122
Avg Cost
0.133
Acoustic GMM/SVM Supervector system on 7-class age task
Christian Müller
Pseudo NIST Evaluation Procedure
 ERL provided development and evaluation data as representative
as possible for the application.
 Three months before the evaluation, ICSI was provided with the
development data.
 At a pre-determined date, the blind evaluation data was provided to
ICSI for processing.
 The system's output was submitted to ERL in NIST format.
 ERL downloaded the scoring software from NIST’s website, made
the necessary modifications due to the changes in the labels.
 ERL ran the software on the submitted system output.
 The results were then disclosed to ICSI along with the keys (truth)
for further analysis.
 --> Fair “one-shot” evaluation, no parameter tuning on the evaluation
set.
Christian Müller
 How can you normalize your features
in order to obtain a uniform scale and
a unifom distribution?
 Proposing rank normalization
respectively polynomial rank
normalization
Christian Müller
Background
 Fundamental frequency (pitch):
 Jitter: 0.001324 PPQ
 --> implicit feature weighing
75-200 Hz
Christian Müller
Mean/Variance Normalization
1
-1
vi − min(vi)
max(vi) − min(vi)
ai =
1
 uniform scale
 non-uniform distribution
Christian Müller
Rank-Normalization
feature
0101
...
0.01
0101
...
0.06
0101
...
0.13
0101
...
0.29
background model
normalized
feature
0101
0101
0101
0101
0101
...
0101
...
0123
2317
...
0
0.01
0.06
0.13
0.29
0
0.25
0.5
0.75
1
0.75
0.4
0.2
 create ordered list of values using bg data
 rank = position in list / number of values
 no occurrence mapped to 0
Christian Müller
Rank Normalization
1
-1
1
1
-1
1
 (+) uniform distribution
 (-) large three dimensional lookup tables
 (-) linear interpolation for unseen values
 larger values ? smaller values ?
Christian Müller
Polynomial Rank Normalization
 use ranks to train a polynomial
 apply polynomial instead of look-up tables
 better interpolation
 no need to store look-up
tables
Christian Müller and Joan-Isaac Biel. The ICSI 2007 Language
Recognition System. In Proc eedings of the Odyssey 2008 Workshop
on Speaker and Language Recognition. Stellenbosch, South Africa, 2008
Christian Müller
Conclusions

Speech as a source of information for non-intrusive user modeling
Speech/signal processing
 Vocal aging -> features for
speaker age recognition
 GMM/SVM supervector
approach for acoustic
speech features
 Detection task and
pseudo-NIST evaluation
procedure
 Rank and polynomial rank
normalization
Take-away messages
 Knowledge-driven feature
selection
 Classification methods for
independent “bag of
observations” features
 Valid applicationindependent evaluation
 Feature space warping
normalization
Christian Müller
Thank you!
Christian Müller
Download