Document 17860158

>> Mary Czerwinski: Okay. Folks. I think we should get started. I'm very happy today to introduce you to Na Yang. She's from the University of Rochester, and she works in a very interesting complement of areas including wireless sensing, speech, HCI and most currently effective computing. So I think she's going to tell us a lot about that today. And I just want to welcome you. >> Na Yang: Yeah, thanks Mary. Thanks for having me. Today it's my pleasure to come here and thank you very much for attending my talk. And so today my talk is about emotion sensing. So emotions really primary form of communication between humans, and that originates a lot of HCI project like behavior sensing or designing context aware systems. So today I'll talk about how we can use speech for emotion sensing, and how we can combat noise in the mobile scenario, do want to apply that in the mobile scenario. So we'll talk about emotion sensing. So there's a lot of applications that can be enabled, for example, for friends, if you want to do that in the gaming scenario or this robot is really [indiscernible] by south bank called Pepper. So it's a member of the family that can talk to people. Also for health. Like we want to monitor the emotions, emotional state for people and pleasure. Like for drivers or for representatives working in call centers. And also as designed this great and pretty Web, one of the applications I think can be used for children who suffer from autism so they can better communicate with them by visualizing their emotional state. Also in behavior studies, and that's actually originates our research in University of Rochester. Our engineering team is collaborating with psychologists to deal with some problems that teenagers and the family members have so that's a very interesting interdisciplinary research going on. And, of course, mobile. So with increasing applications, in mobile platforms, so there are a lot of context can be provided through the mobile platform like how fast we tag or how many arrows we made, but of course still speech is the primary form and we'll talk about how we can sense emotions through speech. So at Microsoft Research, it's a great opportunity that so many researchers are actively working in this field like a Web group, affect and visualize through sensors or some effective fabric. Also Evans has one project working on try to detect emotion using speech signals and neural networks and applied that in the noisy gaming scenario. Also this directional robot. So researchers have been try to design systems that can communicate with people in a more natural way. Also some other research going on by using social media and also mobile sensing, some other context for emotion sensing, academia that's still very hot and very active field that a lot of researchers put their effort here, just to mention a few that motivated my research. So by saying that the research I focus on is emotion sensing using mobile platforms. The reason is that mobile is very personal item. So other people want to carry that all the time. So it might be easier to capture the users real emotions. And also if we can apply the system by training using the user's own voice, it can generate the system with a more, with higher accuracy by using speaker dependent training. Also during a test with Paul the other day, because it's rather difficult to detect passive and negative emotions, especially when the user has carried that emotion for a long time. So mobile phones on the other side is very good at long-term monitoring. And, of course, mobile can provide other contexts. But the oldest context speech is easy to capture, better save battery life and also if we can just detect emotions through speech features not what the people speak, they can better preserve the user's privacy. But if we apply speech based signal, speech-based emotion sensing in the mobile platform, noise is a factor that cannot be ignored. So that's formed the starting line that I will first talk about how in the next 40 minutes how we designed speech noise resilient speech signal extraction method and especially since speech is a very important feature talk about how we design noise repeat detection and how we design the system to classify the emotions based on the speech features extracted and finally how we apply the system on a mobile platform. Okay. So here are just given a notion what pitch is it's defined at the highest or lowest tone as perceived by the ear. So if we play this sample. >> 309. 309. >> Na Yang: Do you know which curve in those, the high or lows of the pitch frequency? >> [indiscernible]. >> 309. >> Na Yang: 309. Are you sure? How many think that's A? Okay. And the rest are B? Okay. Actually B. Because 309 -- so we really need the computer to help us do this pitch detection. It's a very simple illustration so pitch can be used in a variety of ways. For example, we can detect pitch and do the emotion sensing, the first one, and also the [indiscernible] so we can do speech recognition, and also in music we can do automatic music transcription or music information retrieval. So this clock will show us what pitch looks like in the frequency domain. So if the path, the stream is generated from our lungs, like the power supply, and the pitch are vibrating, are generated by vibrating of the vocal cord, like a filter, so this sequence as we can see here, so in the frequency domain for those spectral peaks. So we can see the first dome in the peak here is the pitch value we want to detect. And all the peaks followed by are called harmonics. After the signal goes through the vocal tract like a resonator, it's shaped by ups and downs so this, this speaker on the top shows how the speech signal looks like in the frequency domain that comes out of our mouths. It's not necessary that the pitch has the highest amplitude. So that adds difficulty to detect pitch. Compared with that signal, this signal is what we generate with zero DB [indiscernible] so we cannot tell which are from the noise signal, which pitch are from the speech signal. Let's bring back the clean speech signal. So we can see the pitch located at 192 hertz, if you map that to the frequency horizontal axis. And it's followed by the first harmonic, the second, third and the fourth, but we can see the peak corresponding to the noise speech can also be very high in terms of amplitude. So our key notion for the algorithm we proposed called banner is that we just use pitch frequencies, not amplitude. So our algorithm is just based on calculating the frequency ratio pitch in those flagged peaks. So first step is to calculate the five peaks with the lowest frequency. So this time will be this one, two, three, four, five peaks will be selected. And this looks a little bit complicated. But I will explain it in more detail. So this peaks, this one, two, three, four and five peaks are summarized in this table and we calculate the frequency ratio between each pair of them. So, for example, it's 2.04 calculated here. And then we map all these values to this table. So this table is what the ideal harmonic ratio is, the expected harmonic ratio. For example, the first harmonic should be, should be located at twice of the pitch value. Ideally for human speech, harmonics are placed at integer multiples of the pitch value. So that's, for example, 100, 200, 300. But in this time we cannot tell what the five selected peaks corresponding to whether to noise or speech and which harmonic, which order of harmonic they belong to. So we calculate the frequency ratio and guard this pitch candidate. And we combine two additional pitch candidate. One is from the catch [indiscernible] method, because other method focuses on the low frequency value focusing on the Captron pitch value we have a better view of the global information in the frequency domain. And the other additional pitch candidate is called lowest frequency, because for some of you, if expert in signal processing domain, for some speech signals, if we look at the spectrums there's only one dominant peak. Using this banner algorithm, we cannot calculate the ratio between the two. There's only one. Using this by combining all this pitch candidate calculated here and above, we put them all into a table and we count how many close candidates are they and we count them as a confidence score so number of multiples. For example, we can see this candidate 198 has the highest number of highest number of multiples. So the higher the confidence score, the more likely that is the real pitch value. The final step of this banner algorithm is to use, build a cost function to calculate the cost between all the pitch candidate between each neighboring frame. And the cost function, the first term is if the difference between two candidates are small, which is more likely for human speech, we give that a lower number and the cost would be low. This is a confidence score. The higher the confidence score, the lower the cost. We go through all the frames and pick up determine the final pitch from all these candidates and that forms our banner algorithm. So to evaluate the banner algorithm, we will introduce one error metric it's called the growth error pitch rate. If deviate more than 10% of the real pitch value, we say that pitch is detected. The growth period error rate is percentage of the frame. So that is if that value is higher, that is worse. And we use the ground truth, for example, I can play this audio file again. >> 309. >> Na Yang: So that is a sample for the quiz previously. And we combine the detect the pitch value from three very well performed algorithms and average their value to be the ground truth. So the noise, we add noise to the clean speech. For example, 108 in the sample. >> 108. >> Na Yang: >> 108. White. >> Na Yang: So we use we validate our system in a different type of noise. And in different noise level. >> 108. >> Na Yang: The most cleanest one. >> 108. 108. >> Na Yang: That's zero DBs, the most noisy one. And we compare our banner algorithm, that's our banner algorithm, with both classic and very modern algorithms. So the algorithms listed in blue are the algorithms that I will compare with later. Because that noise is the most common type of noise, so we can see, we try at our level from zero DB which is the worst, so we can see the gross error speech rate is highest to cleanest one with only 20-DBSNR. And our banner algorithm performed better than all the previous algorithms. And this just shows the result on one dataset we validate. And in our research we validated in total three total datasets. I presented several without here. >> [indiscernible]. >> Na Yang: With that's 20 sample for this one. But that's by arranging all different type of noise and all different noise levels. So that's still a large dataset. And we have another called CFTR and Q. Maybe you're quite familiar with, that's more than 100 clean samples. And if we multiply it by one, two, three, four, 5, different ISR values and multiply it by eight different types, that's a huge dataset. >> Are they all samples with noise added or any coming with environmental noise. >> Na Yang: With a type of noise added. >> They all start off with clean samples. >> Na Yang: Yes. >> Are the links typical of the examples you showed? >> Na Yang: Some samples are around ten seconds or some are two seconds, three seconds. >> Are they all prompted speech or are they conversational speech. >> Na Yang: They're all prompted speech. That's more either to get the ground truth peak value. There's still several algorithms speech recorded in real noisy scenarios but that's hard to get ground truth. The pressure. So PSAC and [indiscernible] are the two most powerful algorithms. We compare performance for zero DB for different type of noise. We can see banner get the best performance for four out of eight different type of noise. And also the banner algorithm is open source. And we made it available on our group's website and we also developed application that can visualize your pitch value, if you're interested you're more than welcome to download it. So the second topic, how we can use this extracted pitch features to do emotion classification. So emotions are, can be classified in the, around the level, active or passive, or in a balanced level. Positive or negative. Some people also do categories, for example all these blocks here show different emotion categories. In our research we only pick up the six -- we only pick up the six basic emotions. Because that's either to compare with others work that also use the same six emotions. And also these six emotions are widely used in psychology studies. There are so many datasets out there with emotional state. Some are acted, acted with new natural conversation. But for natural conversation we have to have a human coder to label those emotions. So we just use acted ones. Acted ones are we invite actors or actresses to perform different type of emotions. Some use audio. Or audio plus visual. Of course, it will be more complicated for the system if we combine other modalities. So we just use audio. And also one of the challenges that previous researchers proposed is that if you use both audio and visual, some speakers tend to express their emotion only by facial expression. So if we just get their audio information, it doesn't tell us much about the emotions. So we just use audio. Only focus on English, so we choose LCD dataset. There are other dataset out there for different languages. For the LCD dataset, it just contains numbers and date so the speech content is neutral meaning but are expressed in different ways. For example, this male speaker with different emotions. >> 108. >> Na Yang: What's happening? >> August 18th! >> Na Yang: You can hear the difference. >> December 1st. March 21st, 2001. December 12th. >> Na Yang: That's a very good dataset. Especially we use that because a lot of literature use that. So it's a good way to compare our result with other benchmarks. Also throughout collaboration at UNC Rochester with the University of Georgia we also collect over 10,000 samples from anagraph and expressed different ways. But of course they're not professionals. So if you listen to their recordings. >> 506. 502. October 12th. October 1st. 4,001. 203. >> Na Yang: From my point of view, that's quite similar. Especially that's for recording the relative noisy environment. So ->> Sounded like that was [indiscernible] speech. What conditions were they -were they recording over the phone or ->> Na Yang: Yeah, I think. >> In a lab. >> Na Yang: I think the quality is not that good. >> But did it over a phone. >> Na Yang: Yes. >> It's eight kilohertz. >> Na Yang: Yeah. So I'll show the results on these two datasets. >> Do you have numbers on how well people do with classifying this? >> Na Yang: They just do the recording, but we don't have human coders to actually label whether that really matched that emotion. Yeah. So these are the speech features we used both in the frequency domain and in the energy. So these are just very basic features that people are widely used. And we also include the difference of pitch and the difference of energy. Because that can give us a better picture of how pitch, the tone changes, how speech changes over time. And for all these features, we extract five statistic values and plus speech speaking rate. So in total our feature set is 121 metrics. The classifier we use is a supportive actual machine. The reason comparing with unsupervised learning method we can take advantage of the labels that for the emotions and also we can, we use RBF kernel so that can better deal with the linear inseparable data. And also we have the C parameter that can be tuned to prevent overfitting. Of course, there are other methods out there. And the DB networks by the method used by Persy Evans, an intern. So the normality, on top of normality on top of our system now. If we listen to the sample. The second quiz come what's the emotion for this speech sample? >> June 28th. May 29th. September 3rd. 505. 312. >> Na Yang: Yep. >> [indiscernible]. >> Na Yang: Yep. >> October 12th! The second one. 8,005! 906! 203! >> Na Yang: So how many think that's -- okay. And but I saw some didn't -okay. So here it comes, the divergence. So for some samples we can clearly see what they want to express. But some samples it's relatively ambiguous. So our approach just used this concept to just throw away those ambiguous samples. >> The sample itself is ambiguous or some of those, it's harder to determine the difference between angry and disgusted. >> Na Yang: Yeah, some of them are throw away. >> You're throwing away based on people labeling them ambiguous or by the algorithm coming up with ambiguous things. >> Na Yang: By the system. By the confidence of the system of the output. That sounds really easy idea but we want to see how that can help us improve the accuracy of the system. Yeah. So ours is quite simple we used supportive vector machine with one against all classifiers for different emotions. For example, happy or not. And we train the system using some part of the LCD dataset and we do the cross validation to test on new samples or we use speaker independent methods to try using a new speaker to tie the performance of the system. So the Fusion Center just to compare as you can see here, just compare the one against output with highest confidence score, with a freshwater gamma. So gamma can be controlled by the user if that value is greater than gamma, then we are confident to classify that sample to be a certain emotion category. Otherwise we just reject that sample. Now in our system use three additional enhancement strategies, speaker normalization. Over sample training set and feature selection to better improve the accuracy. So this shows the accuracy versus the rejection rate. So rejection rate means how many, how many percentage of samples we throw away. We can see that as we throw away more samples, the higher accuracy can be again with a curve of growing from around 80% if we throw away half of the samples that's around 92%. So 80%, if we throw away nothing, 80 percent is compared with one over six because we classify one emotion out of six emotion categories. That's around 17 percent. So we can see that it's a huge improvement. So this is based on the cross validation for the LCD dataset and the red curve is for the general task. This blue curve is for the general dependent male and the black curve is general dependent female. We can see if we train the system using only female speech and test it on females, we can have a better im -- better accuracy here. But still we are trading off higher system accuracy and throwing away several samples and that's based on the statement that for some speech-based, some speech-based systems or applications, we can throw away several if we are not confident with them. >>: Seems like there's an important thing to add, maybe bringing it up earlier, Steve was bringing it up earlier, that so, A, it probably matters whether the classifier notion of confidence corresponds to human judges notion of confidence because if those are different it might put you in a funny place. You might be rejecting things that humans don't find ambiguous. You want to make sure those are aligned that's one thing. Another thing would be to say like the notion of confidence that you have, what does that look like when you're looking at, say, for instance, natural speech and does it end up -because I would expect that after samples tend to be quite extreme. >>: That's why we use that. >>: Natural speech, it would be interesting to see. >> Na Yang: That's actually bringing us up to our current ongoing research. To answer your first question, we actually compare the results, the performance of our system with human coders, because we are uploading all the samples to Amazon Mechanical Turk, and we want human coders to code. And it will be interesting to compare how the system works. And to answer your second question, right now we're just based on the outer dataset because that's widely used show the performance comparison with other papers later. And that's bringing us our future work because we are calibrating with some psychologists, they are doing interesting user studies on using real conversations between family members or some conflict between teenagers as an interesting group. So, yeah, that should be interesting. >>: I'm curious the accuracy. Based on the samples that left are the controlling of them or based on all samples? >> Na Yang: That's a very good question. We just ignore the samples thrown away and just based on the samples that are, we're confident to classify during motion. >>: You estimate the accuracy based on what's left. >> Na Yang: >>: Yes. That explains the monotonic? >> Na Yang: Yes, thanks for adding that. >>: In the previous slide, you mentioned speech normalization, how do you do this. >> Na Yang: >>: Speaker normalization? Yes. >> Na Yang: Each speech is focused on one express different emotions that around 100 each metrics, which is the statistic value and standard deviation and we do the lease >>: speaker, because we have one speaker samples for each speaker and for of the feature, we coded the mean scored normalization. [indiscernible]. >> Na Yang: Yeah, so that's the challenge if new speaker comes, we do know more than normal speaking rate for that, I think that's quite widely challenged right now. >>: Thank you. >> Na Yang: Thank you. So that's why our normally based on accuracy with throwing away data. So here comes the interesting corpus I talked about. The UG 8 dataset. So we can see the performance drops a lot from around 80% with no data rejected to around 45%. And we also compare our result with other work, and because the error measurement metrics are different for different work, so the numbers are not consistent. So our system is listed above and the numbers from the emotion sense paper is listed on the below. And the numbers shown in bold is which system is doing better. And this is based on the speaker independent training, because this, the voice for the new speaker is not used for training the system so we can see the performance drops by around 40 percent. >>: Some level of data rejection. >> Na Yang: >>: No -- No data rejection in this case. >> Na Yang: Yes. It's the starting point of the curve. >>: It seems like in your class random accuracy, your numbers are in the nineties. >> Na Yang: That's a very questioned question. Classifier level means we just measure the accuracy for one against the whole classifiers. >>: Right. But in your previous figures, when you had zero percent data rejection, started off in like the 70s or 80s seemed like. >> Na Yang: Classifier level we just measure the performance here. Whether they help you or not. So that's just a binary classification. We are comparing with a baseline of 50, for the 50/50. Division level means after the fusion -- here's the division level. The final outcome. >>: You go several slides forward to the second database results. >> Na Yang: >>: Follow up. >> Na Yang: >>: Here. The re -- Eugene. >>: Yes. So you get the recognition obviously because the left is [indiscernible] personization and it's higher in emotional contrast. correct classification rate is weighted or normal weight. >> Na Yang: >>: Not weighted. What is the proportion of different emotions in your dataset. This >> Na Yang: >>: So pretty much equal amount of different emotions into the dataset. >> Na Yang: >>: That's almost balanced. Yeah, neutral, relatively small. It's pretty much weighted then? >> Na Yang: Yeah. Yeah. Right. Since we're adding that. So you have a term here called division level classification rate. So we just sum up all the samples for all the emotions, correctly classify with all the samples we tested. Yeah. So thanks. >>: And the number of users recognized correctly? The percentage. >> Na Yang: Percentage, yes. So this plot shows how noise will influence our result. We are comparing with the red line here with true and test on clean speech. If we true and test normal speech we can see the performance drop a lot. If we true and clean on test on normal speech we see the performance drops a lot. The two curves in between are if we tune in on the same type of noise and test on the same type of noise, that then drops too much for the performance. So that concludes if we trend -- test on noisy speech, we would rather tune in on noisy speech. So for the mobile implementation, during my previous internship here at Microsoft Research three years ago, I got honored to collaborate with my mentor art min [indiscernible] the connections group. So we developed this kind of prototype for the emotion sensor, mobile emotion sensor. And that's quite a simple one. Can only classify the user as happy or not happy. So we just capture the user's voice in real time and we train the system here. We train the system off line so we can train on the LCD dataset. We apply the true model to combine with attracted speech and gather pretty emotion with that. And this work has been presented on Microsoft Research Tech Fest in 2012 and we won the award for the Microsoft [indiscernible] layer. That's my mentor holding it up. Here comes the GUI time. So that's just a very simple prototype for binary division. And this can implement the entire system of course that's in my lab. So we can visualize the extracted speech features and we can select which model we can use. Okay. We agree here. So this GUI can, for example, record, load a sample from the LCD or UG 8 dataset. Just a random truth one. >>: 4,012! >> Na Yang: So the real emotion is anger and that's reflected by a female speaker. If we said set the threshold gamma to be zero we don't want to throw away the sample. We can extract the features. Takes some time. Sorry. I didn't drag the window here. So these are speech features extracted. And if we choose to tell the system using the model, trend by LCD dataset. Okay. Drag here. You can see the plot is shown on the four quadrants. That's an angry emotion. I can show this GUI in more detail after the talk. So that can just -- okay. Here comes with that happy or other emotions. To summarize, my research is based on how to combat noise in a mobile gaming or mobile voice-based application scenario. And what's motivated us to design noise with speech extraction method and we have applied the banner rhythm and also state-of-the-art speech extraction method to classify emotions based on a standard dataset and based and apply that real users. Finally I just show very simple prototype for the, to how that can be used in mobile platforms. The app is called [indiscernible] and these are the papers and I also have background in sensor networks and wireless communications and these two journals are still in review. For future work, besides the pitch detection algorithm, we are still working on how to improve other speech feature extraction thoughts like MACC or other features to improve those algorithms in noisy environment. Also we want to continue to implement the entire system in the mobile platforms. Also as you mentioned we are continuing to project one on Mechanical Turk and the other one through the real data gathered through the collaboration with the psychology department. So actually my research just based on, based on the first sensing and interpreting these two stages, for the entire affective computing, that's a really broad range of research and applications. Also we can work on visualizations and designing interesting designs how to visualize key emotions. And we can take interventions whenever it is necessary to help people, for example, with depression and monitoring their term and getting intervention. But if we just to step back a little while we look ahead. So at the end of my talk I want to share some notes, the reasons to be cheerful part four from this block article. Sometimes we want to step back and rethink whether our emotion sensing work is really tailored to one's need where the user may be compelled to do those suggested tier work. Whether we can just present the user to be cheerful or actually we can improve their emotional state. So this author did a quite an interesting kind of survey with all these apps right now on the market that can try to cheer you up. So by trying all these different kinds of apps, she thinks whether the user just has the plain after app but can really tailor to that person's personal need and really improve their emotional state. I think that's one of the kind of a very good guideline for my research in the future. Okay. So for some possibilities applying this emotion sensing system to Microsoft products, this can be applied across different devices or across different services. And any device with a Mac can apply this system. And one of the kind of interesting one is smart watch. So emotional [indiscernible] is proposed by Mary because, yeah, that's also an emerging market right now. For all the currently available products, for example, smart band or smart watch, it just forks down fitness. For example, maintaining the hardware. If you imagine your watch can listen to your voice and really understand your settings, that can help to improve your emotional findings. So that's a very interesting concept I want to brought up here. For different surveys they can use this white space emotion sensing in gaming scenarios like KMAC and call centers to monitoring those representatives and they're working under pressure. Voice search, for example, on the Bing platform and targeted as. Of course, Cortana, if you can imagine your digital assistant can understand your setting and that can provide a similar scenario like the movie her. So not just communicate with you, tell what the mails are there, but also can comfort you when you are done. So that should be very interesting scenario. And scat translator. This is a demo for the scalp translator on the code conference. And a person is speaking English, is communicating with other, speaking German we can see the big smile on the screen and you can tell the emotion. But what if we can just hear the person's voice and that is in a totally different language you know nothing about. So if we can just detect the emotions and tell the user on this side so they can have a very better way of communications. And that sums up my talk. And I want to thank all my collaborators, my advisor Wendi Heinzelman and Melissa Sturge-Apple in the psychology department. So our work has been featured in several media and I want to thank you all and we'll come to any questions you may have. [applause]. >>: Mary Czerwinski: >>: Got a lot already. Is emotion interpreted the same way by people in different cultures? >> Na Yang: That's one footnote I put in the setup translator. And I forgot to mention that because emotion differs from culture to culture. Especially between different country cultures. So it might be very useful if we can convey this emotions through the system to help you understand what the other person's feeling is. >>: If that's true, then when you did the Mechanical Turk studies, did you kind of try to pick Turkers of the same culture of the samples? >> Na Yang: yeah. >>: Yes, that's exactly what we are doing now. We're making sure, Doesn't necessarily mean that they're written from the same -- >> Na Yang: We can -- and I know that. [laughter]. >>: You guys [indiscernible] this motion, let's take a step back say we have a perfect emotion detector. We know the emotion for sure. You mentioned that one of the applications is helping the person to improve the emotional fitness. What else can be -- basically what element applications can we have to have the perfect motion detection system. Want to see that. >> Na Yang: That's a really interesting question. Because for humans if we talk face to face, it's quite as simple to just within several seconds we can catch the other person's emotions. But the system that we think is necessary to do it in obtrusive objective way. So, for example, just like I mentioned the [indiscernible] they can help children suffer from autism to better communicate their emotions. So there are needs and also in gaming scenario, if the person can't reflect his or her emotion in the character, again that should be really interesting. But of course that's totally based on that emotion sensing system is perfect because if that made a wrong division, that sometimes can be annoying. >>: So you compare strongly versus the what does a natural trained actor or is this is the trained afters, they did the emotions more college students, you're saying like it wasn't as clean, person just speaking naturally do they sound more like the there more ambiguity there. >> Na Yang: Actually, we did a very first, very early study by trying to analyze some real data collected in the lab environment, and that's just an al truistic way of collecting some communication between the child and their parent. And some student in our, psychology department actually labeled the emotion and most of the samples are labeled as neutral. Labeled as happy or upset. And that's just a single -- you can scroll down, or you can scroll up or down. And I think most of the speech are neutral or ambiguous meaning with very subtle expressed emotion that it can sense and I think that's why we want to throw away the samples automatically. >>: Seems like if you had a user talking to Cortana, wouldn't most of their speech be useful, is that what you're saying? >> Na Yang: That's one of the challenge is most speech are neutral. to do long-term monitoring. Yes. Yes. We want >>: Is there a notion of like an emotion for a group of people? So like if you were recording at a party or in the case of like you're adding noise to a clean sample. What is the battle itself had on emotions, like happy babble versus sad babble, like you recorded it at an Alcoholics Anonymous meeting. Like it might alter the interpretations or different type of emotions? >> Na Yang: Whenever I did my [indiscernible] in Kenai my home becomes the entertainment center. And I think the gaming center is most involved by multiple people. So I think Evelyn has done some research that we want to validate the system with the last speakers and with the, some background music or [indiscernible] in the room. And so therefore noise from the air conditioner. So there are all kinds of noise and reverberations in the room. So that's still an open question. And that has to be done by sort separation by multiple people and a lot of papers are working on that. So that considers multiple challenges. Right now we're just focusing on addressing one speaker. >>: So you focused on speech here, is there a sense in the community on people that work on the multiple aspects. Is there more of the signal in the visual, speech channel, and are there orthogonal do you expect to get other samples hard in the same way and you're not getting more by doing both? >> Na Yang: Yeah, I think that can improve the accuracy by combining other modalities. We have to determine how to combine the division by different modalities if they do not agree. And I think we can also sample the speech or video for several times and try to see what the majority emotion detected. So there's different emotion items. But I think for the sophisticated and more realistic system we have to combine other modalities of course. Yes? >>: I was wondering, what are your thoughts on what do you think what you have now could extend beyond the six basic emotions or on you do you see that working and not working and why. >> Na Yang: If we looked at fixed emotions, they are based quite with distance to each other. That's why we can better -- that's made life easier to differentiate them. But, for example, anger or disgust, these two emotions within the six we're currently studying are already relatively close to each other. And some emotions are actually, emotional states have combined several emotions. Some other researchers also gave confidence score for each of the six emotions. So sometimes maybe that speech is 70 percent happy or with -- or with other combinations of some subtle emotions that, like relax or hesitated. So that should be an interesting question, because in real scenario emotion is so complicated. You can't just say which emotion. Yeah, I hope that can answer your question. >>: When the psychologist was labeling the natural speech from the parent/child interaction, were they listening to the direct audio of the English or was it garbled? Like were they able to hear the words the person is saying as well. >> Na Yang: We usually a flash card like the LCD. >>: No when you said there's a student in the psychology department labeling the natural interactions between a parent and teenager, were they actually listening. >> Na Yang: Just to the audio. >>: Then the problem is you don't know how much there is in the content versus the speech. Wonder if you could actually garble, some garbly stuff to put that in and retain the pitch for instance, the MCCS, but the speech client is gone. Much harder test for them of course but they have to label based on what they're hearing and maybe what they're seeing but not the words. >> Na Yang: Yep. Mary Czerwinski: [applause] >> Na Yang: All right. Let's thank our speaker again. Thank you very much for coming.

Document 17860158

Related documents

Products

Support

Document 17860158

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib