>> Ivan Tashev: Good morning, everyone. We're glad to have you here. Today, we have Kun Han, Ph.D. student in Ohio State University, and he is going to present his interim project, Emotion Detection From Speech Signals, with stress on gaming scenarios. Without further adieu, Kun, you have the floor. >> Kun Han: Thank you very much. My name is Kun Han. I'm a Ph.D. student in the Ohio State University. It's my honor to be here for the summer internship. Today, I'm going to present my internship project. It's Emotion Detection From Speech Signals. First, I would like to thank all the people who give me help. First, I want to thank the Conversational Systems Research Center, especially for my mentor, Ivan, and he host my internship for the whole summer. Thank you very much. And also, I would thank Xbox team. They fund my project and my internship. Thank you. Also, I need to thank Dr. Dong Yu. He give me a lot of help on I project and we have some very helpful discussions, especially on deep neural network. Okay. Let's move on to the talk. This is outline. So at first, I will give a brief introduction for the project and then I will discuss some state of the art from previous study, and then I will discuss some details for the approach and I will show some experimental results. The last part is the conclusion and some future works. Okay. So what is emotion recognition? So here, we focus on the emotion recognition from speech. So it means we want to extract the emotional state from emotional state of a speak from a speech. So basically, there are two types of emotion recognition. The first one is the utterance level recognition. It means we just detect the emotional status from each sentence. Typically, the sentence is not very long, maybe less than 20 seconds. So we assume that the emotion state in one sentence is constant. It doesn't change. So each sentence just has one emotion label. And another is dialogue level. So that's kind of monitoring when people are talking, you're talking on some topic so the emotional state can be changed across the time. So in this scenario, people can use temporal dynamics to capture the emotional state. In our work, we only focus on the first one, the utterance level. So there are some applications on emotion recognition. Since [indiscernible], we already make big progress on artificial intelligence by [indiscernible] very far from natural way to communicate with machine. So because machine doesn't know the emotional state of the speaker, so we can use emotion recognition to improve the experience of the users, especially in the human computer interface, in the gaming scenario, and we can use emotion recognition result to improve the gaming experience and also voice search and eLearning. And another important application is monitoring and control. In this scenario, because some people work under the stress, for example, in aircraft cockpit, it is very important that we need to know the emotional state of the pilot. So okay. So emotion recognition actually still kind of new area. There are a lot of problems we need to solve. I list some of them. Maybe there are also some other problem, but these three may be the most important. So the first one is feature set. So it's not like the speech recognition. People know NOCC or PRP, some feature very effective. But emotion recognition, we don't know which feature effective for our task. So right now, what people are doing is just trying different features [indiscernible] to the classifier and get the results. But we don't know if the feature is effective or not, which is [indiscernible] classification result. And another is representation. So you can imagine that the emotion state is kind of like the [indiscernible]. So there are always some overlaps between different emotions. So you can not find the clear boundary between different emotion. We cannot say, okay, this is happiness. This is excitement, and there's boundary. We cannot do that. So overlap always exists so during the presentation, I will be covering here. And the third one is emotion actually is very highly dependent on the speaker and the culture. So different people have different style to express emotion. So the emotion recognition system probably won't consider this problem. Okay. Now we can discuss some existing stories. The first one is emotion labeling. So the simplest way to label the emotion is use the category. It will just pick some basic emotions and tell and label one of them, like the happiness, sadness, anger, neutral, something. But as I said, we can now find a clear boundary between different emotions. So some other study uses a dimensional representation. It's kind of show each emotion as a coordinate on an emotional plane here. So there are two dimensions, one is valence, and it means the emotion is positive or negative, and at is arousal, and is the emotion can be active or passive. For example, the angry is a negative emotion, but it's kind of very active. But bored is also a negative, but this is passive. So you don't want to say something, so you can label each emotion on this plane and so that's just a coordinate. And some other study uses a third dimension, like the dominance, the tension, something. But these two, this arousal and valence are very commonly used. Our work actually just based on the, essentially it's based on the category labeling. But we also give the score vector for each of the basic emotions. So the imaginary [indiscernible] is this kind of principle which represented emotion but it's not easy to explain when you give a label. It's just a coordinate. It's [indiscernible]. It's not straightforward. And also, some previous study, the feature set in the previous study includes local feature and global feature. So the local feature is something just extract a feature from each frame, like the pitch, magnitude, and also some spectrum based feature like the MFCC, LPC something. Also has some voice quality features, like harmonic to noise ratio, jitter, shimmer. Okay. So this is frame feature, but we need to make decision for each utterance so we also needed to extract the global feature based on this local feature. So the global feature is the combination of the local feature. So just collect all the local feature from one utterance, and then typically way to do it is just computer statistics, like the mean, standard deviation, maximum, minimum to get a global feature to [indiscernible] feature for all utterance. So then different feature we use different classifier. So this is for the local feature, traditionally is using Gaussian mixture model for each emotion and compare that likelihood. And some other people treat this as speaker very similar to the speaker ID, just use the GMM, UBM and construct the super vector and then use SVM to do classification. Also, you can use the HMM to capture the temporal dynamics. And written work use LDA, each sound is corresponds to the document and each frame corresponded to the word and use the LDA to do the classification. And the most the reason the work actually like to use the global feature, just [indiscernible] studies across the whole utterance to the classification, the SVM is the most commonly used classifier, and also some people use the K nearest neighbor and the decision tree. And also, most recent study uses a deep belief network, but they still use the statistical feature to like do the feature extraction and then use the SVM on top of that to extract to estimate the emotion status. And some database, this common used in emotion recognition different databases. You have different source. Some database uses acted it ans and some use actual recording. That is not actual, just recorded some emotional speech. For example, this one is recording of the talk show, and this one is asks some kids to talk to the robot. So it's not acted emotion. And we use this one, IEMOCAP. This one is acted, asked some actor to pretend some emotion. And this database has audio and visual signals. We only use audio here. And the labeling is categorical and dimensional. And this database is very large and very rich so that's why we choose this database. Other database is relatively small. Okay. Let's go to our approach. So this is overview of the system. So given an utterance, okay, and the first thing we just cut the cut rans to different segment. And then we extract the feature from each segment. So we get a segment level features. And then we, through this segment, choose a deep neural network, deep neural network to estimate the emotion state of each segment, okay. With this output, we get the segment level output that is can be the probability of each emotion state. This is segment level result, and we for one utterance, we collect all the segment level results and we build we computed the utterance level feature and use [indiscernible] classifier to do the classification to make the utterance level decision. So this is >>: [inaudible]. >> Kun Han: Yeah, I will talk about that. Yeah. Okay. So when we extract this [indiscernible] feature, the first thing we need to do the framing and then we convert the signal from time domain to the frequency domain. The window of 25 milliseconds versus ten mill le seconds step size. The feature here we are using the pitch based feature, including the pitch value and the harmonic [indiscernible] and also MFCC features. Also, we use delta feature across time. And then we [indiscernible] segment. But because of the contact information is very important so we increase the frame before the current frame and the frame after the current frame. So total is 25 frame run. That is the segment, and we concatenate the feature from each frame and get the signal level features. >>: The feature set for per frame is pretty much the standard [indiscernible] features? >> Kun Han: [indiscernible]. Okay. So now we have the segment level features XT. And when we train the neural network, we need to give the training label, since the label is we only have label for every utterance, and here essentially we give all the segments from one utterance the same label, the label is from the utterance. And also, we don't use all the segments in the training data, because utterance also contains some silence, we don't want to use that. And also, some speech the energy is very weak. It may not contain much emotional information. So we also throw that away. We just pick the top ten percent segments with the highest energy for the train and classification. Okay. And the output of the deep neural network is a probability vector for each emotion in this segment. Okay. Then deep neural network configuration is we use three hidden layer. We also try one, two, three, four, and three give me the best performance and the [indiscernible]. We don't need to go to the four or five. And there is rectified linear neuron and the objective function is cross entropy and use mini batch stochastic gradient descent to training. So this is very standard. And so the deep neural network output will be here. If we have K different emotion, it will output the probability of each emotion, okay? Okay. This is an example. So this, the blue, the blue line here represents the probability of there are five emotions, five emotions. Excitement, frustration, happiness, neutral, and sadness, okay? So basically, for most of the segment, the excitement is gives the highest probability, and some segments, they are not. But overall, it dominant, this utterance. So, yeah, [indiscernible] this sentences excitement. But not all the sentence has this good performance. Some sentence [indiscernible] that's very noisy, so we need to use another classifier to [indiscernible] utterance level decision. Okay. So when we have the segment level output, we want to get the utterance level decision. So then for the utterance level classification, the inputs will be utterance level feature. So first we get a segment level output for each segment, and then we get the set of segments, we pick the maximal, minimal and the mean for each emotion. This is the one type of feature. Now we choose the number of the segments with the high probability. That means we want to know how many segment support this emotion so that will be another feature. Okay. We combine them together as an utterance level feature, and then we the outputs of the utterance level will be the emotion score vector for the whole utterance. When we do this classification, there are two we try different classifier. Of course, we use the SVM that is very popular classifier. We also try another classifier called the extreme learning machine and we will compare them. Okay. A short discussion on the extreme learning machine. So for ELM, ELM is actually the single hidden layer neural network, but it was special training strategy. So there are just one single layer. So from the weight from input to the hidden layer is just randomly assigned. The weight is just random. And from the hidden layer to the output layer, we use minimize we use minimal least square error to train this weight. So I think that [indiscernible] here, people typically use this one is in this we use the number of hidden units is much, much larger than the number of free input units. So that is a random project. And from input to the hidden layer, when we have a lot of hidden units, we can get the good representation for the training data. But also, since the weight are random, so this unit, this hidden representation is not highly dependent on the training data. So it probably give us a good generalization performance. So >>: How many features you feel [indiscernible]. >> Kun Han: >>: The input layer essentially just 20, 20 input unit. [indiscernible]. >> Kun Han: Yeah, so we have five emotions, so this has three, so three times five is 15. And there are five >>: [indiscernible]. >> Kun Han: I try different configurations around, like, 100 [indiscernible]. >>: That many. >> Kun Han: Yeah, I also try more [indiscernible] units, the performance is similar. Okay. So what we needed to train is just the hidden layer to the output layer. So we use minimal least square error and it minimizes so it turns out with just a [indiscernible] to solve this equation, [indiscernible]. So it's a very fast, very efficient way to do the training. And also, essentially, it give us a good performance. So this is from the advantage of this ELM. So there is no gradient descent so it's very fast. So it give a good generalization and we also compare with support vector machine, essentially it gets better performance than SVM, and it's much more efficient, around ten times faster. Also, this ELM has kernel version. We also use kernel version and we [indiscernible] comparison better. And for the performance measurement, so we use two types measurements. The first one is a weight, called a weighted accuracy. So the weighted accuracy essentially is just standard classification accuracy. Use correct labeled utterance divided by the number of all utterance. And the unweighted accuracy essentially for each for each class, we computed accuracy in this class, and then we take the average of this accuracy so essentially, it's for this measurement, it's kind of required you get some balanced accuracy for each of the emotion classes. Okay. Now experimental results. So this is dataset we are using. It's IEMOCAP database. It's the acted, multi modal, multi speaker database, including video, speech, motion, text transcription, something. And each utterance is actually annotated by three human annotators. So they also use category and dimension labels. different category. Category would be like eight to nine So since there are three annotators, when we build our corpus, we asked the we just select a sentence if more than two annotator give the same label to this sentence, then we will [indiscernible]. Apparently, if three annotator give different labels, we don't use that, because we don't know the ground truth. Okay. So in our corpus, we use the happiness, excitement, sadness, frustration, and neutral. This is five emotions are common purely in the gaming scenario. And the training, eight speakers, four male, four female. And the test is two speakers, they're not seen in the training set, so it's speaker independent. And we don't use visual signal or don't use text transcription. Just speech signal. I also wanted to mention that we don't use speaker normalization, because some study, a lot of studies use speaker normalization. They normalize the feature for every speaker. So it means, it assumes that you know the speaker ID, because when you normalize them in the test phase, you need to know the normalization factor for each speakers, but we don't use that. Some studies show that this speaker normalization get very much more large improvement, but we just use this more training scenario. Okay. Some gaming scenario. Right now, we have the clean speech. Then we created the speech in the gaming scenario. So in the gaming scenario, we have three [indiscernible]. The speech is the speaker will say something and we wanted to label the emotion for this guy. And also, we have five loud speaker that are playing movie, playing game, something, music. And we have some background noise like the air conditioner, some room noise. So all of this sound source are captured by the Kinect. The Kinect has four microphones. So that this mixture also, the room reverberation, because in this room, there are reverberation in the room. So the [indiscernible] to the microphone are some reverberation and also loud speaker. So they will [indiscernible] with room impulse response get the mixture, and then we use the Kinect audio pipeline to attenuate this noise. Then we will get the processed signal. So this signal will be used in our task. Okay? This is the configuration to create the gaming scenario corpus. The loud speaker track include ten different sound source. Five game, five movie. And in this room, so we try we use 12 different positions. So different position will give different room impulse response, and from one meter to four meters in the center, left, right, we just random pick position and mix with loud speaker and also the background noise created the mixture. Also, some level I random choose from this so this is the [indiscernible] and I can play it. So this is a clean speech. >>: So I'm excited, I'm like a kid. of the house with my fly zipped up. >> Kun Han: >>: I can't believe I got out This is Kinect mix. [indiscernible]. >> Kun Han: This is the processed speech. >>: I'm so excited, I'm like a kid. of the house with my fly zipped up. I can't believe I got out >> Kun Han: So there is some distortion, but most noise are removed. So we will use the clean speech and the processed speech. Can anyone tell me what is the emotion of that utterance? I'll give you five options. Excitement, frustration, happiness, neutral and sadness. I can play again if you want. >>: I'm so excited, I'm like a kid. of the house with my fly zipped up. I can't believe I got out >>: He laughed at the end. >>: There are five options. >>: He said excitement. >>: Excitement. >> Kun Han: >>: I want three answers. Excitement. So It's excitement. >> Kun Han: Okay. Excitement, okay. And anyone for frustration? No frustration. Happiness. Happiness? Okay. Neutral? One neutral. Sadness? Okay. I think most of your guys are good evaluator because the answer is we have three annotators, two give excitement, one give happiness. So >>: That's pretty much >> Kun Han: >>: all of us labeled it properly. Yeah. When they annotate, do they see the video? >> Kun Han: Yeah, they see the video. So there are some difference between if you only listen to the speech and when you also watch the video, maybe you give different label. It is possible, yeah. >>: It's interesting, even the words he says gives you a hint, I think. It's not just a level of the way they're saying the words. It's the actually word. >> Kun Han: And also, essentially, video and speech for different emotions, they have different effect. For some emotion, you probably you are getting the information from the video. But for some emotion, maybe you are kind of captured from >>: [indiscernible]. >> Kun Han: What? >>: Annotators, do they have [indiscernible] language to annotate. [indiscernible]. >> Kun Han: >>: [indiscernible]. So [indiscernible] use the meaning of the phrase. >> Kun Han: Yes, we use all the [indiscernible] we can get. >>: [indiscernible]. >>: [indiscernible]. >> Kun Han: Sorry? >>: In the same sound, it says it's sad if it picks up. >>: That is a good [indiscernible]. >>: If you laugh at the end of the sentence, I am sad. >>: I'm not sad. >>: [indiscernible]. >> Kun Han: Okay so we'll compare our approach with two different existing algorithm the first one is local features with HMM. HMM is just pick the frame level feature. And for each emotion, [indiscernible] and then use maximal likelihood to pick the emotion. So we train this one for each emotion and we use four fully connected states for each emotion and GMM to represent the observation probability and determine emotion by the maximal likelihood. And also, we compare it with global feature plus SVM. There is a tool kit called OpenEAR. This is very popular cool tight of emotion reaction and we create a very, very large feature sets. The MFCC, pitch, LPC, zero crossing and blah, blah, blah, around like eight to nine, ten something different acoustic features. And for each feature, they apply the statistic function, the mean, variance, skewness, kurtosis, maximal, minimal and whatever, a lot of statistics so the feature set would be 988. And then apply the SVM to the classification. So with compare these five, the MHH, OpenEAR, the SVM and our approach, the deep neural network with SVM [indiscernible] level classification and the deep neural network with extreme learning machine. This one, this guy is DNN with ELM using the kernel version. This is the result. It's a weighted accuracy. It's the clean speech result and the result in the gaming scenario. This side is the clean speech. So basically, the HMM gets the lowest performance, and the OpenEAR is a little bit better. And the DNN based system is significantly better than these two. And also, you can compare this SVM with the ELM. >>: I'm sorry, what is the [indiscernible]? >> Kun Han: Yeah, it's just a classification accuracy. We have five emotion. We just come to the name of emotion correct labeled, divide by the number of utterance. Just a standard accuracy, yeah. >>: And what would be perfect accuracy? be, one? >> Kun Han: >>: Yeah, one. What number would it Yeah. [indiscernible]. >> Kun Han: Yeah, this is not a very high number. The highest number on this covers is like the 60 to 70, but they use the speech plus visual plus speaker normalization. So all the information together gets like this number. And also, they use the four way classification. Here we use a five way classification. Okay. >>: Quick question. So when you calculate the accuracy, is that one example you showed us where two people tagged it as, I think, it was excitement and one tagged it as happiness, does that mean that this it's a failure if you don't get both excitement and happiness? Or is that >> Kun Han: excitement. >>: For that one, the label in the training is Because two people >> Kun Han: Yeah, two people. >>: So if you classified in your algorithm happiness, that would count as a failure? >> Kun Han: Yes. >>: Like you made your points around video plus the words spoken plus the audio really need to work in concert to get a really good accuracy. What is the actual is it like a ground truth, if you just heard audio, what a human could possibly do, because that would be best your algorithm could probably ever hope to achieve. >> Kun Han: Yeah, if you just take the audio, I mean, if you ask annotator to the label can be very different. It can be different. Yeah. Yeah, but this [indiscernible] provides the label based on the recording and audio and the actually includes text transcription, whatever. That kind of the true we believe that is the true emotion for this utterance. But, of course, this is not a really true ground truth, because there is a levelling for emotion [indiscernible] problem. You always ask people to label the utterance, but people always have different feeling for the emotions. Yeah, but the we can only base the training or test based on the label provided by the covers. >>: Okay. >> Kun Han: Okay. And essentially if you compare the clean speech and the gaming, there are some decrease on average around five percent, but it's not very bad. It's around five percent drop. And this unweighted accuracy. Essentially, you need to get a kind of balanced result for each classes. This is the same trend you can find here, and this is clean one. The HMM is even worse and the SVM is better and the DNN based system is a much better and also is the ELM is better than SVM. But overall, the ELM and the ELM with kernel is getting pretty similar. It's comparable. Okay. This is the confusion matrix. Okay. This is gaming scenario. This is clean scenario, and we have so this column is the true label, is the label from the training set. This is the label from our approach. So you can see that for the gaming center, on average, each accuracy is it kind of not different too much. It's different, but not as large as the clean speech. And there's a very interesting thing is that if you compare this gaming and the clean, the excitement, frustration, neutral, you get a very, very similar performance. It's 0.5, 0.6, 0.36, very similar. But the happiness and the sadness are very different. With clean speech, you get very clear performance for the sadness. But very poor for happiness. But gaming is pretty much different. The happiness is good and the sadness is lower. >>: So now [indiscernible]. >>: [indiscernible]. >>: You should think about speech and [indiscernible]. >> Kun Han: >>: [indiscernible]. Actually, it's better [indiscernible]. >> Kun Han: Okay. future work. So let's go to the conclusion and discuss So basically, we designed the emotion recognition algorithm and we used a simplified feature set, and used deep neural network for the segment level classification and used the ELM for the utterance level classification. And our new approach achieved outperform relatively 13 percent to the state of the art previous studies. And in the gaming environment, we already see there some negative effect on emotion detection, around five drop. Still, the new algorithm outperforms the state of the art around 13, relatively. Okay. And I also want to mention some direction of the future work. The first one is go multi modal, and also there is somehow technical one. Okay. So [indiscernible] right now is corporate provider speech and the audio and visual signal. Some provided the text transcription. So we previous study already show that when we include the other information like the visual or speech, we can improve the emotion recognition result significantly. So right now, it's available in the corporate so we can choose this multi modal information. And [indiscernible] measure, of course, include like the gesture dynamics and also the face expression is very important to recognize the emotion. And also, with speech recognizer, if we know what they are seeing, maybe we can also this will be another [indiscernible] to do the emotion recognition. Okay. Then for some technical one, future work, so because previous study we in our approach, we use ten percent segment with highest energy as training and test sample. So because we believe that this strong segment contains more emotional information that are informative for our task, but which of this ten percent. But the question is can we determine this informative segment directly from learning. The idea is we can choose the best training sample from the last training model. When we get this best training sample, we throw that to the next model to the training. Then the next model gets better training sample, and so we just throw away those non informative segment. So principally, this new model should give us a sharper probability, because we have a better training sample than the training model should give a better performance. And this is the idea of this. We first get the whole training set, and then train DNN and with the DNN we get the training the second training set. This training set are choosing from this one, and then we keep training to this for a few times. And with this trained DNN in the test phase, we just throw the [indiscernible] to all of them and do the combination. Maybe we can get better performance of some [indiscernible] results. So with this hierarchical training verse the ordinary training, for the unweighted accuracy, we get like a two percent better performance. The weighted, pretty similar. This improvement is not very large. But I still believe there are some work we can do, like how to choose the best example and how to combine the different DNN model together and get the good preferred utterance level performance. And also, we should improve incorporate some temporal dynamics. Because previously, the HMM is trained under the unsupervised manner. We don't have the label. We just, for each emotion, we train the HMM. But here, when we have the DNN, it can give the label for each segment and then we just initial label. We can use the supervised [indiscernible] to train the [indiscernible] principal hi to give me the better performance [indiscernible]. Also, this one is actually at the beginning, I do some work on this one. It kind of very interesting. Because right now, the problem of emotion recognition is that we pick the hand crafted feature, like MFC, PRP, whatever, and together as a feature to train the system. But to essentially from the [indiscernible] at any point, it is possible to use the spectrum feature, because spectrum doesn't release any information. Everything is in the spectrogram. And also, motivated by recent progress in speech recognition [indiscernible] to train speaker recognize using filter bank. So maybe in the emotion, we can still directly trend on the spectrogram and like the learning machine to learn all the feature and then do the training. But in my experiment, it's not very good results. It's firm level accuracy is lower than the MFCC plus pitch. Around four percent lower. But it's worth trying. Maybe we can try different parameters like the window lens or something, and also maybe more data will benefit for the DNN training. Okay. That's it. Yeah. So this is the some important reference, and, yeah, pretty much done my presentation. And I want to share this one. I have been in U.S. for a few years, and I would say this summer is the most wonderful summer I have had in U.S., and this picture was taken a few weeks ago. I spent two days to climb the Mt. Adams. It's 12,000 feet. And it's very hard, very cold, but it is very, very nice experience. And there is no sound and no voice here. But if you only look at the picture, if you look at my face, my emotion is very, very excited. [laughter] >> Kun Han: Thank you. >> Ivan Tashev: questions? >>: Thank you, Kun. Questions, please? No Have you tested it on real player besides [indiscernible]? >> Kun Han: [indiscernible]? >>: So in the presentation, you said you tested active progress. Have you tested [indiscernible]? >> Kun Han: [indiscernible] but I test on the data from the same corpus, but it's the actual data. There's some drop don't remember the name, but not very worth. There is some drop, yeah. But if you want to, if you really want to train it with actual data, you may need to train on the actual data. But here we >>: [indiscernible] provided by your team [indiscernible]. >>: I was wondering whether you or somebody did the same work with additional [indiscernible] unknown. [indiscernible] is not unknown. Unknown is, for example, all those samples above the [indiscernible] different. So you [indiscernible] because many, many [indiscernible]. >> Kun Han: Yeah, you are right. So if yeah, you can believe that if [indiscernible] give different labels, then maybe this emotion is very difficult to describe. But it's still some pretty good emotion. Yeah, that's right. Yeah, but in our experiment, I [indiscernible]. >>: [indiscernible] to drive. >> Kun Han: Yeah, that's very interesting. I haven't thought about that. But yeah, it's some emotion, but we don't know what is it, yeah. >>: In your test with the sound in the background, is it just the algorithm change? I'm just thinking about people sometimes speaking loudly, like they speak up louder when they think that, like, somebody can't hear them. So if I have something running in my background, I try to talk to the Kinect. I might sound more excited or something because I'm trying to project. Did that kind of thing come out, or was it, was it very did it not really change? Like you could still detect the emotion even if somebody was trying to be more emphatic? >> Kun Han: You mean with the background, the background maybe masked what you are saying? >>: The background, because of the background noise, they might be changing the way they're talking to the machine, because they're trying to talk over it, or try to talk over the sound in the background. >>: So this is going back to the [indiscernible] and we didn't account for that. So technically, yes, people speak differently when there's a loud sound. They try to get more energy through. For the corpus, this is just pure synthetic. You give the [indiscernible] we have the noise, but it's the same noise regardless of the loud speaker. And that may affect [indiscernible] of the classification. >>: You mentioned at the beginning that different cultures express themselves in different ways. Your data corpus, though, had German and United States averaged together. Did you see, like, where you were substantially better with Germans versus >> Kun Han: This is a very interesting question, because I haven't did this experiment, but some [indiscernible] already they pick different languages together, like the German, English, Spanish, something together. And the performance is still very good. But if you test on different culture like the Chinese, Japanese, then [indiscernible] are very different. So it means because because the English or Spanish, German, they are kind of very similar to each other. And in the emotion parts, they are maybe have a similar way to express emotion. But if the language are very different, then the system may not work for the other language. >>: So you didn't get deep into that, then, where you would make a recommendation, oh, for every language or for every country language pairing that you should have a different training set? >> Kun Han: Yeah, of course, for different language, if you train on particular language, you will get a good performance. But for me, I think you can put the similar language together and train the emotion, train the system, and all the different multiple languages together, another model. That would be good, I think. >>: This might be an interesting place to go deeper in your next steps, like to really dig into the differences regionally, different locales. >> Kun Han: Yeah, because the emotions not very well, not highly dependent on the language. Even you don't understand the language, you can still say that he's happy or sad or something. Yeah. >>: So it didn't have language dependent features. It's more cultural difference. Let's say [indiscernible] speaking Norwegian, it will be you better use the Italian language setups, because the case fortunately is more excited with more emotions in the speech. The opposite case, certain [indiscernible] people speaking Italian, you may not see the kind of different [indiscernible] difficulties [indiscernible] emotion. And those are all examples from Europe. If you start to go across spaces and continents, it's getting even more [indiscernible]. So it's more cultural than [indiscernible] dependent. >>: I see, okay. >>: And eventually, it's possible to find some I'd say some large general training data and to do some adaptation towards [indiscernible] more emotion to find some correlated system. But for now, what the [indiscernible] is you have to have a label detected for each [indiscernible]. different than Spanish in Mexico. More questions? So let's thank Kun. Spanish in Spain may be