>> Ivan Tashev: Good morning, everyone. We're glad... here. Today, we have Kun Han, Ph.D. student in...

advertisement
>> Ivan Tashev: Good morning, everyone. We're glad to have you
here. Today, we have Kun Han, Ph.D. student in Ohio State
University, and he is going to present his interim project,
Emotion Detection From Speech Signals, with stress on gaming
scenarios. Without further adieu, Kun, you have the floor.
>> Kun Han: Thank you very much. My name is Kun Han. I'm a
Ph.D. student in the Ohio State University. It's my honor to be
here for the summer internship. Today, I'm going to present my
internship project. It's Emotion Detection From Speech Signals.
First, I would like to thank all the people who give me help.
First, I want to thank the Conversational Systems Research
Center, especially for my mentor, Ivan, and he host my internship
for the whole summer. Thank you very much. And also, I would
thank Xbox team. They fund my project and my internship. Thank
you. Also, I need to thank Dr. Dong Yu. He give me a lot of
help on I project and we have some very helpful discussions,
especially on deep neural network.
Okay. Let's move on to the talk. This is outline. So at first,
I will give a brief introduction for the project and then I will
discuss some state of the art from previous study, and then I
will discuss some details for the approach and I will show some
experimental results. The last part is the conclusion and some
future works.
Okay. So what is emotion recognition? So here, we focus on the
emotion recognition from speech. So it means we want to extract
the emotional state from
emotional state of a speak from a
speech. So basically, there are two types of emotion
recognition. The first one is the utterance level recognition.
It means we just detect the emotional status from each sentence.
Typically, the sentence is not very long, maybe less than 20
seconds. So we assume that the emotion state in one sentence is
constant. It doesn't change. So each sentence just has one
emotion label.
And another is dialogue level. So that's kind of monitoring when
people are talking, you're talking on some topic so the emotional
state can be changed across the time. So in this scenario,
people can use temporal dynamics to capture the emotional state.
In our work, we only focus on the first one, the utterance level.
So there are some applications on emotion recognition. Since
[indiscernible], we already make big progress on artificial
intelligence by [indiscernible] very far from natural way to
communicate with machine. So because machine doesn't know the
emotional state of the speaker, so we can use emotion recognition
to improve the experience of the users, especially in the human
computer interface, in the gaming scenario, and we can use
emotion recognition result to improve the gaming experience and
also voice search and eLearning. And another important
application is monitoring and control. In this scenario, because
some people work under the stress, for example, in aircraft
cockpit, it is very important that we need to know the emotional
state of the pilot. So okay.
So emotion recognition actually still kind of new area. There
are a lot of problems we need to solve. I list some of them.
Maybe there are also some other problem, but these three may be
the most important. So the first one is feature set. So it's
not like the speech recognition. People know NOCC or PRP, some
feature very effective. But emotion recognition, we don't know
which feature effective for our task. So right now, what people
are doing is just trying different features [indiscernible] to
the classifier and get the results. But we don't know if the
feature is effective or not, which is [indiscernible]
classification result.
And another is representation. So you can imagine that the
emotion state is kind of like the [indiscernible]. So there are
always some overlaps between different emotions. So you can not
find the clear boundary between different emotion. We cannot
say, okay, this is happiness. This is excitement, and there's
boundary. We cannot do that. So overlap always exists so during
the presentation, I will be covering here.
And the third one is emotion actually is very highly dependent on
the speaker and the culture. So different people have different
style to express emotion. So the emotion recognition system
probably won't consider this problem.
Okay. Now we can discuss some existing stories. The first one
is emotion labeling. So the simplest way to label the emotion is
use the category. It will just pick some basic emotions and tell
and label one of them, like the happiness, sadness, anger,
neutral, something. But as I said, we can now find a clear
boundary between different emotions. So some other study uses a
dimensional representation. It's kind of show each emotion as a
coordinate on an emotional plane here. So there are two
dimensions, one is valence, and it means the emotion is positive
or negative, and at is arousal, and is the emotion can be active
or passive.
For example, the angry is a negative emotion, but it's kind of
very active. But bored is also a negative, but this is passive.
So you don't want to say something, so you can label each emotion
on this plane and so that's just a coordinate. And some other
study uses a third dimension, like the dominance, the tension,
something. But these two, this arousal and valence are very
commonly used.
Our work actually just based on the, essentially it's based on
the category labeling. But we also give the score vector for
each of the basic emotions. So the imaginary [indiscernible] is
this kind of principle which represented emotion but it's not
easy to explain when you give a label. It's just a coordinate.
It's [indiscernible]. It's not straightforward.
And also, some previous study, the feature set in the previous
study includes local feature and global feature. So the local
feature is something just extract a feature from each frame, like
the pitch, magnitude, and also some spectrum based feature like
the MFCC, LPC something. Also has some voice quality features,
like harmonic to noise ratio, jitter, shimmer. Okay. So this is
frame feature, but we need to make decision for each utterance so
we also needed to extract the global feature based on this local
feature. So the global feature is the combination of the local
feature. So just collect all the local feature from one
utterance, and then typically way to do it is just computer
statistics, like the mean, standard deviation, maximum, minimum
to get a global feature to [indiscernible] feature for all
utterance.
So then different feature we use different classifier. So this
is for the local feature, traditionally is using Gaussian mixture
model for each emotion and compare that likelihood. And some
other people treat this as speaker
very similar to the speaker
ID, just use the GMM, UBM and construct the super vector and then
use SVM to do classification. Also, you can use the HMM to
capture the temporal dynamics. And written work use LDA, each
sound is corresponds to the document and each frame corresponded
to the word and use the LDA to do the classification.
And the most
the reason the work actually like to use the
global feature, just [indiscernible] studies across the whole
utterance to the classification, the SVM is the most commonly
used classifier, and also some people use the K nearest neighbor
and the decision tree. And also, most recent study uses a deep
belief network, but they still use the statistical feature to
like do the feature extraction and then use the SVM on top of
that to extract
to estimate the emotion status.
And some database, this common used in emotion recognition
different databases. You have different source. Some database
uses acted it ans and some use actual recording. That is not
actual, just recorded some emotional speech. For example, this
one is recording of the talk show, and this one is asks some kids
to talk to the robot. So it's not acted emotion. And we use
this one, IEMOCAP. This one is acted, asked some actor to
pretend some emotion. And this database has audio and visual
signals. We only use audio here. And the labeling is
categorical and dimensional. And this database is very large and
very rich so that's why we choose this database. Other database
is relatively small.
Okay. Let's go to our approach. So this is overview of the
system. So given an utterance, okay, and the first thing we just
cut the cut rans to different segment. And then we extract the
feature from each segment. So we get a segment level features.
And then we, through this segment, choose a deep neural network,
deep neural network to estimate the emotion state of each
segment, okay. With this output, we get the segment level output
that is
can be the probability of each emotion state. This is
segment level result, and we
for one utterance, we collect all
the segment level results and we build
we computed the
utterance level feature and use [indiscernible] classifier to do
the classification to make the utterance level decision. So this
is
>>: [inaudible].
>> Kun Han: Yeah, I will talk about that. Yeah. Okay. So when
we extract this [indiscernible] feature, the first thing we need
to do the framing and then we convert the signal from time domain
to the frequency domain. The window of 25 milliseconds versus
ten mill le seconds step size. The feature here we are using the
pitch based feature, including the pitch value and the harmonic
[indiscernible] and also MFCC features. Also, we use delta
feature across time. And then we [indiscernible] segment. But
because of the contact information is very important so we
increase the frame before the current frame and the frame after
the current frame. So total is 25 frame run. That is the
segment, and we concatenate the feature from each frame and get
the signal level features.
>>: The feature set for per frame is pretty much the standard
[indiscernible] features?
>> Kun Han: [indiscernible]. Okay. So now we have the segment
level features XT. And when we train the neural network, we need
to give the training label, since the label is
we only have
label for every utterance, and here essentially we give all the
segments from one utterance the same label, the label is from the
utterance. And also, we don't use all the segments in the
training data, because utterance also contains some silence, we
don't want to use that. And also, some speech
the energy is
very weak. It may not contain much emotional information. So we
also throw that away. We just pick the top ten percent segments
with the highest energy for the train and classification.
Okay. And the output of the deep neural network is a probability
vector for each emotion in this segment. Okay. Then deep neural
network configuration is we use three hidden layer. We also try
one, two, three, four, and three give me the best performance and
the [indiscernible]. We don't need to go to the four or five.
And there is rectified linear neuron and the objective function
is cross entropy and use mini batch stochastic gradient descent
to training. So this is very standard. And so the deep neural
network output will be here. If we have K different emotion, it
will output the probability of each emotion, okay?
Okay. This is an example. So this, the blue, the blue line here
represents the probability of
there are five emotions, five
emotions. Excitement, frustration, happiness, neutral, and
sadness, okay? So basically, for most of the segment, the
excitement is
gives the highest probability, and some
segments, they are not. But overall, it dominant, this
utterance. So, yeah, [indiscernible] this sentences excitement.
But not all the sentence has this good performance. Some
sentence [indiscernible] that's very noisy, so we need to use
another classifier to [indiscernible] utterance level decision.
Okay. So when we have the segment level output, we want to get
the utterance level decision. So then for the utterance level
classification, the inputs will be utterance level feature. So
first we get a segment level output for each segment, and then we
get the set of segments, we pick the maximal, minimal and the
mean for each emotion. This is the one type of feature. Now we
choose the number of the segments with the high probability.
That means we want to know how many segment support this emotion
so that will be another feature. Okay. We combine them together
as an utterance level feature, and then we
the outputs of the
utterance level will be the emotion score vector for the whole
utterance. When we do this classification, there are two
we
try different classifier. Of course, we use the SVM that is very
popular classifier. We also try another classifier called the
extreme learning machine and we will compare them.
Okay. A short discussion on the extreme learning machine. So
for ELM, ELM is actually the single hidden layer neural network,
but it was special training strategy. So there are just one
single layer. So from the weight from input to the hidden layer
is just randomly assigned. The weight is just random. And from
the hidden layer to the output layer, we use minimize
we use
minimal least square error to train this weight. So I think that
[indiscernible] here, people typically use this one is in this we
use the number of hidden units is much, much larger than the
number of free input units. So that is a random project. And
from input to the hidden layer, when we have a lot of hidden
units, we can get the good representation for the training data.
But also, since the weight are random, so this unit, this hidden
representation is not highly dependent on the training data. So
it probably give us a good generalization performance. So
>>:
How many features you feel [indiscernible].
>> Kun Han:
>>:
The input layer essentially just 20, 20 input unit.
[indiscernible].
>> Kun Han: Yeah, so we have five emotions, so this has three,
so three times five is 15. And there are five
>>:
[indiscernible].
>> Kun Han: I try different configurations around, like, 100
[indiscernible].
>>:
That many.
>> Kun Han: Yeah, I also try more [indiscernible] units, the
performance is similar. Okay. So what we needed to train is
just the hidden layer to the output layer. So we use minimal
least square error and it minimizes so it turns out with just a
[indiscernible] to solve this equation, [indiscernible]. So it's
a very fast, very efficient way to do the training. And also,
essentially, it give us a good performance. So this is from the
advantage of this ELM. So there is no gradient descent so it's
very fast. So it give a good generalization and we also compare
with support vector machine, essentially it gets better
performance than SVM, and it's much more efficient, around ten
times faster. Also, this ELM has kernel version. We also use
kernel version and we [indiscernible] comparison better.
And for the performance measurement, so we use two types
measurements. The first one is a weight, called a weighted
accuracy. So the weighted accuracy essentially is just standard
classification accuracy. Use correct labeled utterance divided
by the number of all utterance. And the unweighted accuracy
essentially for each
for each class, we computed accuracy in
this class, and then we take the average of this accuracy so
essentially, it's
for this measurement, it's kind of required
you get some balanced accuracy for each of the emotion classes.
Okay. Now experimental results. So this is dataset we are
using. It's IEMOCAP database. It's the acted, multi modal,
multi speaker database, including video, speech, motion, text
transcription, something. And each utterance is actually
annotated by three human annotators. So they also use category
and dimension labels.
different category.
Category would be like eight to nine
So since there are three annotators, when we build our corpus, we
asked the
we just select a sentence if more than two annotator
give the same label to this sentence, then we will
[indiscernible]. Apparently, if three annotator give different
labels, we don't use that, because we don't know the ground
truth.
Okay. So in our corpus, we use the happiness, excitement,
sadness, frustration, and neutral. This is five emotions are
common purely in the gaming scenario. And the training, eight
speakers, four male, four female. And the test is two speakers,
they're not seen in the training set, so it's speaker
independent. And we don't use visual signal or don't use text
transcription. Just speech signal.
I also wanted to mention that we don't use speaker normalization,
because some study, a lot of studies use speaker normalization.
They normalize the feature for every speaker. So it means, it
assumes that you know the speaker ID, because when you normalize
them in the test phase, you need to know the normalization factor
for each speakers, but we don't use that. Some studies show that
this speaker normalization get very much more large improvement,
but we just use this more training scenario.
Okay. Some gaming scenario. Right now, we have the clean
speech. Then we created the speech in the gaming scenario. So
in the gaming scenario, we have three [indiscernible]. The
speech is the speaker will say something and we wanted to label
the emotion for this guy. And also, we have five loud speaker
that are playing movie, playing game, something, music. And we
have some background noise like the air conditioner, some room
noise. So all of this sound source are captured by the Kinect.
The Kinect has four microphones. So that this mixture also, the
room reverberation, because in this room, there are reverberation
in the room. So the [indiscernible] to the microphone are some
reverberation and also loud speaker. So they will
[indiscernible] with room impulse response get the mixture, and
then we use the Kinect audio pipeline to attenuate this noise.
Then we will get the processed signal. So this signal will be
used in our task. Okay?
This is the configuration to create the gaming scenario corpus.
The loud speaker track include ten different sound source. Five
game, five movie. And in this room, so we try
we use 12
different positions. So different position will give different
room impulse response, and from one meter to four meters in the
center, left, right, we just random pick position and mix with
loud speaker and also the background noise created the mixture.
Also, some level I random choose from this
so this is the
[indiscernible] and I can play it. So this is a clean speech.
>>: So I'm excited, I'm like a kid.
of the house with my fly zipped up.
>> Kun Han:
>>:
I can't believe I got out
This is Kinect mix.
[indiscernible].
>> Kun Han:
This is the processed speech.
>>: I'm so excited, I'm like a kid.
of the house with my fly zipped up.
I can't believe I got out
>> Kun Han: So there is some distortion, but most noise are
removed. So we will use the clean speech and the processed
speech. Can anyone tell me what is the emotion of that
utterance? I'll give you five options. Excitement, frustration,
happiness, neutral and sadness. I can play again if you want.
>>: I'm so excited, I'm like a kid.
of the house with my fly zipped up.
I can't believe I got out
>>:
He laughed at the end.
>>:
There are five options.
>>:
He said excitement.
>>:
Excitement.
>> Kun Han:
>>:
I want three answers.
Excitement.
So
It's excitement.
>> Kun Han: Okay. Excitement, okay. And anyone for
frustration? No frustration. Happiness. Happiness? Okay.
Neutral? One neutral. Sadness? Okay. I think most of your
guys are good evaluator because the answer is we have three
annotators, two give excitement, one give happiness. So
>>:
That's pretty much
>> Kun Han:
>>:
all of us labeled it properly.
Yeah.
When they annotate, do they see the video?
>> Kun Han: Yeah, they see the video. So there are some
difference between
if you only listen to the speech and when
you also watch the video, maybe you give different label. It is
possible, yeah.
>>: It's interesting, even the words he says gives you a hint, I
think. It's not just a level of the way they're saying the
words. It's the actually word.
>> Kun Han: And also, essentially, video and speech for
different emotions, they have different effect. For some
emotion, you probably you are getting the information from the
video. But for some emotion, maybe you are kind of captured from
>>:
[indiscernible].
>> Kun Han:
What?
>>: Annotators, do they have [indiscernible] language to
annotate. [indiscernible].
>> Kun Han:
>>:
[indiscernible].
So [indiscernible] use the meaning of the phrase.
>> Kun Han:
Yes, we use all the [indiscernible] we can get.
>>:
[indiscernible].
>>:
[indiscernible].
>> Kun Han:
Sorry?
>>:
In the same sound, it says it's sad if it picks up.
>>:
That is a good [indiscernible].
>>:
If you laugh at the end of the sentence, I am sad.
>>:
I'm not sad.
>>:
[indiscernible].
>> Kun Han: Okay so we'll compare our approach with two
different existing algorithm the first one is local features with
HMM. HMM is just pick the frame level feature. And for each
emotion, [indiscernible] and then use maximal likelihood to pick
the emotion.
So we train this one for each emotion and we use four fully
connected states for each emotion and GMM to represent the
observation probability and determine emotion by the maximal
likelihood. And also, we compare it with global feature plus
SVM. There is a tool kit called OpenEAR. This is very popular
cool tight of emotion reaction and we create a very, very large
feature sets. The MFCC, pitch, LPC, zero crossing and blah,
blah, blah, around like eight to nine, ten something different
acoustic features. And for each feature, they apply the
statistic function, the mean, variance, skewness, kurtosis,
maximal, minimal and whatever, a lot of statistics so the feature
set would be 988. And then apply the SVM to the classification.
So with compare these five, the MHH, OpenEAR, the SVM and our
approach, the deep neural network with SVM [indiscernible] level
classification and the deep neural network with extreme learning
machine. This one, this guy is DNN with ELM using the kernel
version.
This is the result. It's a weighted accuracy. It's the clean
speech result and the result in the gaming scenario. This side
is the clean speech. So basically, the HMM gets the lowest
performance, and the OpenEAR is a little bit better. And the DNN
based system is significantly better than these two. And also,
you can compare this SVM with the ELM.
>>:
I'm sorry, what is the [indiscernible]?
>> Kun Han: Yeah, it's just a classification accuracy. We have
five emotion. We just come to the name of emotion correct
labeled, divide by the number of utterance. Just a standard
accuracy, yeah.
>>: And what would be perfect accuracy?
be, one?
>> Kun Han:
>>:
Yeah, one.
What number would it
Yeah.
[indiscernible].
>> Kun Han: Yeah, this is not a very high number. The highest
number on this covers is like the 60 to 70, but they use the
speech plus visual plus speaker normalization. So all the
information together gets like this number. And also, they use
the four way classification. Here we use a five way
classification. Okay.
>>: Quick question. So when you calculate the accuracy, is that
one example you showed us where two people tagged it as, I think,
it was excitement and one tagged it as happiness, does that mean
that this
it's a failure if you don't get both excitement and
happiness? Or is that
>> Kun Han:
excitement.
>>:
For that one, the label in the training is
Because two people
>> Kun Han:
Yeah, two people.
>>: So if you classified in your algorithm happiness, that would
count as a failure?
>> Kun Han:
Yes.
>>: Like you made your points around video plus the words spoken
plus the audio really need to work in concert to get a really
good accuracy. What is the actual
is it like a ground truth,
if you just heard audio, what a human could possibly do, because
that would be best your algorithm could probably ever hope to
achieve.
>> Kun Han: Yeah, if you just take the audio, I mean, if you ask
annotator to the label can be very different. It can be
different. Yeah. Yeah, but this [indiscernible] provides the
label based on the recording and audio and the actually includes
text transcription, whatever. That kind of the true
we
believe that is the true emotion for this utterance. But, of
course, this is not a really true ground truth, because there is
a levelling for emotion [indiscernible] problem. You always ask
people to label the utterance, but people always have different
feeling for the emotions. Yeah, but the
we can only base the
training or test based on the label provided by the covers.
>>:
Okay.
>> Kun Han: Okay. And essentially if you compare the clean
speech and the gaming, there are some decrease on average around
five percent, but it's not very bad. It's around five percent
drop.
And this unweighted accuracy. Essentially, you need to get a
kind of balanced result for each classes. This is the same trend
you can find here, and this is clean one. The HMM is even worse
and the SVM is better and the DNN based system is a much better
and also is the ELM is better than SVM. But overall, the ELM and
the ELM with kernel is getting pretty similar. It's comparable.
Okay. This is the confusion matrix. Okay. This is gaming
scenario. This is clean scenario, and we have
so this column
is the true label, is the label from the training set. This is
the label from our approach. So you can see that for the gaming
center, on average, each accuracy is
it kind of not different
too much. It's different, but not as large as the clean speech.
And there's a very interesting thing is that if you compare this
gaming and the clean, the excitement, frustration, neutral, you
get a very, very similar performance. It's 0.5, 0.6, 0.36, very
similar. But the happiness and the sadness are very different.
With clean speech, you get very clear performance for the
sadness. But very poor for happiness. But gaming is pretty much
different. The happiness is good and the sadness is lower.
>>:
So now [indiscernible].
>>:
[indiscernible].
>>:
You should think about speech and [indiscernible].
>> Kun Han:
>>:
[indiscernible].
Actually, it's better [indiscernible].
>> Kun Han: Okay.
future work.
So let's go to the conclusion and discuss
So basically, we designed the emotion recognition algorithm and
we used a simplified feature set, and used deep neural network
for the segment level classification and used the ELM for the
utterance level classification. And our new approach achieved
outperform relatively 13 percent to the state of the art previous
studies. And in the gaming environment, we already see there
some negative effect on emotion detection, around five drop.
Still, the new algorithm outperforms the state of the art around
13, relatively.
Okay. And I also want to mention some direction of the future
work. The first one is go multi modal, and also there is somehow
technical one. Okay. So [indiscernible] right now is corporate
provider speech and the audio and visual signal. Some provided
the text transcription. So we
previous study already show
that when we include the other information like the visual or
speech, we can improve the emotion recognition result
significantly. So right now, it's available in the corporate so
we can choose this multi modal information.
And [indiscernible] measure, of course, include like the gesture
dynamics and also the face expression is very important to
recognize the emotion. And also, with speech recognizer, if we
know what they are seeing, maybe we can also
this will be
another [indiscernible] to do the emotion recognition.
Okay. Then for some technical one, future work, so because
previous study we
in our approach, we use ten percent segment
with highest energy as training and test sample. So because we
believe that this strong segment contains more emotional
information that are informative for our task, but which of this
ten percent. But the question is can we determine this
informative segment directly from learning. The idea is we can
choose the best training sample from the last training model.
When we get this best training sample, we throw that to the next
model to the training. Then the next model gets better training
sample, and so we just throw away those non informative segment.
So principally, this new model should give us a sharper
probability, because we have a better training sample than the
training model should give a better performance.
And this is the idea of this. We first get the whole training
set, and then train DNN and with the DNN we get the training
the second training set. This training set are choosing from
this one, and then we keep training to this for a few times. And
with this trained DNN in the test phase, we just throw the
[indiscernible] to all of them and do the combination. Maybe we
can get better performance of some [indiscernible] results.
So with this hierarchical training verse the ordinary training,
for the unweighted accuracy, we get like a two percent better
performance. The weighted, pretty similar. This improvement is
not very large. But I still believe there are some work we can
do, like how to choose the best example and how to combine the
different DNN model together and get the good preferred utterance
level performance.
And also, we should improve
incorporate some temporal
dynamics. Because previously, the HMM is trained under the
unsupervised manner. We don't have the label. We just, for each
emotion, we train the HMM. But here, when we have the DNN, it
can give the label for each segment and then we just initial
label. We can use the supervised [indiscernible] to train the
[indiscernible] principal hi to give me the better performance
[indiscernible].
Also, this one is actually at the beginning, I do some work on
this one. It kind of very interesting. Because right now, the
problem of emotion recognition is that we pick the hand crafted
feature, like MFC, PRP, whatever, and together as a feature to
train the system. But to essentially from the [indiscernible] at
any point, it is possible to use the spectrum feature, because
spectrum doesn't release any information. Everything is in the
spectrogram. And also, motivated by recent progress in speech
recognition [indiscernible] to train speaker recognize using
filter bank. So maybe in the emotion, we can still directly
trend on the spectrogram and like the learning machine to learn
all the feature and then do the training. But in my experiment,
it's not very good results. It's firm level accuracy is lower
than the MFCC plus pitch. Around four percent lower.
But it's worth trying. Maybe we can try different parameters
like the window lens or something, and also maybe more data will
benefit for the DNN training.
Okay. That's it. Yeah. So this is the
some important
reference, and, yeah, pretty much done my presentation. And I
want to share this one. I have been in U.S. for a few years, and
I would say this summer is the most wonderful summer I have had
in U.S., and this picture was taken a few weeks ago. I spent two
days to climb the Mt. Adams. It's 12,000 feet. And it's very
hard, very cold, but it is very, very nice experience. And there
is no sound and no voice here. But if you only look at the
picture, if you look at my face, my emotion is very, very
excited.
[laughter]
>> Kun Han:
Thank you.
>> Ivan Tashev:
questions?
>>:
Thank you, Kun.
Questions, please?
No
Have you tested it on real player besides [indiscernible]?
>> Kun Han:
[indiscernible]?
>>: So in the presentation, you said you tested active progress.
Have you tested [indiscernible]?
>> Kun Han: [indiscernible] but I test on the data from the same
corpus, but it's the actual data. There's some drop
don't
remember the name, but not very worth. There is some drop, yeah.
But if you want to, if you really want to train it with actual
data, you may need to train on the actual data. But here we
>>:
[indiscernible] provided by your team [indiscernible].
>>: I was wondering whether you or somebody did the same work
with additional [indiscernible] unknown. [indiscernible] is not
unknown. Unknown is, for example, all those samples above the
[indiscernible] different. So you [indiscernible] because many,
many [indiscernible].
>> Kun Han: Yeah, you are right. So if
yeah, you can believe
that if [indiscernible] give different labels, then maybe this
emotion is very difficult to describe. But it's still some
pretty good emotion. Yeah, that's right. Yeah, but in our
experiment, I [indiscernible].
>>:
[indiscernible] to drive.
>> Kun Han: Yeah, that's very interesting. I haven't thought
about that. But yeah, it's some emotion, but we don't know what
is it, yeah.
>>: In your test with the sound in the background, is it just
the algorithm change? I'm just thinking about people sometimes
speaking loudly, like they speak up louder when they think that,
like, somebody can't hear them. So if I have something running
in my background, I try to talk to the Kinect. I might sound
more excited or something because I'm trying to project. Did
that kind of thing come out, or was it, was it very
did it not
really change? Like you could still detect the emotion even if
somebody was trying to be more emphatic?
>> Kun Han: You mean with the background, the background maybe
masked what you are saying?
>>: The background, because of the background noise, they might
be changing the way they're talking to the machine, because
they're trying to talk over it, or try to talk over the sound in
the background.
>>: So this is going back to the [indiscernible] and we didn't
account for that. So technically, yes, people speak differently
when there's a loud sound. They try to get more energy through.
For the corpus, this is just pure synthetic. You give the
[indiscernible] we have the noise, but it's the same noise
regardless of the loud speaker. And that may affect
[indiscernible] of the classification.
>>: You mentioned at the beginning that different cultures
express themselves in different ways. Your data corpus, though,
had German and United States averaged together. Did you see,
like, where you were substantially better with Germans versus
>> Kun Han: This is a very interesting question, because I
haven't did this experiment, but some [indiscernible] already
they pick different languages together, like the German, English,
Spanish, something together. And the performance is still very
good. But if you test on different culture like the Chinese,
Japanese, then [indiscernible] are very different. So it means
because
because the English or Spanish, German, they are kind
of very similar to each other. And in the emotion parts, they
are maybe have a similar way to express emotion. But if the
language are very different, then the system may not work for the
other language.
>>: So you didn't get deep into that, then, where you would make
a recommendation, oh, for every language or for every country
language pairing that you should have a different training set?
>> Kun Han: Yeah, of course, for different language, if you
train on particular language, you will get a good performance.
But for me, I think you can put the similar language together and
train the emotion, train the system, and all the different
multiple languages together, another model. That would be good,
I think.
>>: This might be an interesting place to go deeper in your next
steps, like to really dig into the differences regionally,
different locales.
>> Kun Han: Yeah, because the emotions not very
well, not
highly dependent on the language. Even you don't understand the
language, you can still say that he's happy or sad or something.
Yeah.
>>: So it didn't have language dependent features. It's more
cultural difference. Let's say [indiscernible] speaking
Norwegian, it will be
you better use the Italian language
setups, because the case fortunately is more excited with more
emotions in the speech. The opposite case, certain
[indiscernible] people speaking Italian, you may not see the kind
of different [indiscernible] difficulties [indiscernible]
emotion. And those are all examples from Europe. If you start
to go across spaces and continents, it's getting even more
[indiscernible]. So it's more cultural than [indiscernible]
dependent.
>>:
I see, okay.
>>: And eventually, it's possible to find some I'd say some
large general training data and to do some adaptation towards
[indiscernible] more emotion to find some correlated system. But
for now, what the [indiscernible] is you have to have a label
detected for each [indiscernible].
different than Spanish in Mexico.
More questions?
So let's thank Kun.
Spanish in Spain may be
Download