22883 >> Arjmand Samuel: It is my pleasure to introduce Ny Yang today. She's been interning with us here at Microsoft Research Connections this summer. She's from University of Rochester, Upstate New York. Her research interests are in signal processing, model communication and sensor networks. And this summer she's actually developed a unique sensor for the mobile, Windows 7 mobile phone. Over to you. >> Ny Yang: Thank you. Hi, everyone. My name is Ny Yang. It's my pleasure to have all of you here today to come to my presentation. It's my great pleasure. And the phone application I developed during the summer is called Listen-n-Feel. So it's a mobile emotion sensor, which can detect emotion based on how we speak to the phone. It will use signal processing and cloud computing. And my mentor at MSR is Arjmand Samuel. Okay. Here is the last chance I do my introduction here. I come from the University of Rochester, which is a very cold place, as you can see from the upper left picture. There's a lot of snow during the winter. And my advisor at school is Wendy Handelman. From this picture you can see that's a beautiful view from the top of my campus. And in my spare time I like playing [inaudible]. So if there are any questions or any concerns you can contact me through my e-mail address. So any questions I welcome during my presentation. Here's the line. I will explain how we come up with this idea to develop this emotion sensor and how we implement it on the cloud. And I will sum up with challenges and future work and, of course, the -- brought from here so you can see the demo and you can try that. Okay. So now there is phones that's used everywhere and anytime, and there are a lot of sensors which enable the phone to be a very convenient tool, and also a fairly accompanied to us [phonetic]. We can see there are microphone cameras and on the left side that's accelerometer. So if you move the phone, it will detect movement. And also the GPS, which will tell you your location and also the general meter. If you just move your phone or turn it around so it can also feel that difference in position. Also, those sensors, they are another kind of sensor which can help monitor our physical condition. Like the new prototype developed by RIM which is called Blackberry Empathy which detects all the user's heart rate, blood pressure and all the physical conditions using a very magic little ring here. But we'll come up with a question: What about the user's emotion? Physical condition is important, but sometimes emotion will be more important, which is our -- which can tell us inside world. So let's begin from the very beginning. What is emotion? So from definition, emotion is the cycle physiologic experience of an individual's state of mind as interacting with biomedical which is the internal factors and also environmental, which is the external factors. So the reason why we want to deploy the sensor on the phone is that we want to detect emotion in a mobile fashion, anywhere and anytime. And also people may feel that it may be -- they're not very willing to tell their real emotion to the outside world or to another person. But they may feel comfortable talking to their own phone. So we would like to make use of this point to just let users talk to their own phone and then the phone will know their real emotion. It will be much easier. So there are at least some user cases here. So you can -- you can see that phone can be, the emotion sensor on the phone from that point can be used in Citizen Science project. So some of you here may not have heard that. But that's a very popular, Citizen Science project, happening now, which the very normal citizens can participate in citizen project to help scientists or help the psychologists, some other scientists in other research fields to do some very simple science project by collecting data. So that's a crowd-enabled project. So by using our emotion sensor on the phone we can help scientists in psychology or sociology to, for example, character data and test which city has the most highest happiness index or which factors will influence people's emotions. For example, people may feel much happier at home versus at work. So that's a very interesting point to study. Also, healthcare. It's hard to know people's emotions. So a patient with an illness, mental illness may not be willing to tell their counselor or doctor about their real mood. So the doctors can put this emotion sensor on the phone to do this monitoring all the time. Also, for the networks. Imagine that you can like a photo or a message or the status of your friends by using real life. So if you speak to the phone like, whoa, that's amazing, and then it will translate this message automatically to Facebook or Twitter to post to all your friends. So that would be a really cool thing. Also, we can, because certain networks give us an opportunity to get a lot of data from a large population so we can also make use of this to analyze emotion of the certain population. And also, again, I know some of the interns here already got your Kinect and XBox. So imagine if you can play an XBox game which you can influence your emotion on the character you were playing, that would be extremely cool. So by using your voice, so the Kinect or the XBox machine will listen to your mood by listening to your voice. So it can be on the phone or on the Kinect or other sensors. And finally, you can customize your phone. Imagine your phone will change the sounds or the same color based on the mood, based on your current mood. So which will make the phone more loudly. Here's some today. In the Android or iPhone market, there are at least some motion detection application out there, but they're not used -- they're not used in the scenarios as told before. Neither did they have very strong scientific background. So that's the main difference and also the Kinect, there's no emotion sensor on Kinect but there's some very simple speech recognition which you can give very simple comment. So that's a difference show how we stand out. Okay. Here comes the meat. So the phone applications look like this, and I will show that later. So when you press a button, your voice, the whole audio voice style will be transmitted, upload to a server on the cloud. And we'll do some signal processing to extract the speech features of that voice. So it does not try to understand what you are saying but how you say that word or utterance or sentence. Those features will be used as a test data to input to learning machine. So we can see on the other side, the server also works off line; we try to extract all the speech features from a large database with over 1,000 data samples, with no -- so there are speeches with no emotion. We also do the same thing extract the speech features from all this data samples. And then use the same machine learning algorithm which is called a logistic regression to train the system and gather weight. So with the impact data and the weight, as the input, we can output the predictive emotion. And I'll talk about later what the database we use, how we do the signal processing and also how we do machine learning. Okay. Here is the database. I will show you some samples. So the speakers actors or actress, they just speak neutral meaning sentence, utterance, the number or the date. For example: [video] June 20th. June 20th. October 5th. October 5th. 508. 508. >> Ny Yang: Another iteration. [video] 810! 810! 502! 502! 502! >> Ny Yang: So emotion on the left side are both happy, happy emotions, and the emotions on the left side are used as sad emotions, for example, despair. [video] November 9th. November 9th. November 9th. >> Ny Yang: Now hear the difference. [Video] 4,006. 4,006. >> Ny Yang: They're performed by actress or -[video]. August 2nd. June 28th. June 28th. [end of demonstration] >> Ny Yang: So just to give you some sense about the training data we used. So we have all the features listed here in total 12 features in the frequency domain and also in the energy domain. So all these features we get because, for example, pitch, the pitch changes a lot when we speak. So that's a bunch of pitch values. So we get the whole vector and then calculate the mean value max, minimum, standard deviation, median and the range. All these statistic values. So all this 27, 72 metrics are used as our, as the risk factor or the features. Okay. Here comes some signal processing things. So given wave form, we will, because they're so sampled, so for each sample is the amplitude is above a certain/word we'll mark it as one. Here's the sample marker listed here and then for a bunch of markers, we will set a frame with less 180 second. So we do the signal processing, for example, pitch reduction, energy calculation within that frame, and then we have this frame marker. So but you can see there's a little gap here, because the amplitude is very low. So it may not be very accurate to know the exact pitch value. So it doesn't make sense to carry the pitch so you just totally ruin result. So we just do the signal processing in this frame with frame marker equals to 1. And here this slide shows the noise reduction. So before the noise reduction, we can see the pitch, there's some background noise, but after that we'll see the signal wave forms pretty smooth. Okay. Here come the first feature, pitch. So pitch is kind of a relative high needs or low needs of the torn as perceived by the ear. So, for example, women speakers tend to have a higher PH than men. The pitch calculation algorithm yields is in the time domain which is called outer correlation. For example, for so these utterances we can see pitch variance a lot. So each sample here, the frame marker which the amplitude of the frame marker detect, show the pitch value. Whether it's high or low. The second feature is energy. So for -- within each frame we get the summation of all the square value of the sample amplitude of the sample marker, which will be used at the energy value of that frame. So we can see -- you can also see from the amplitude of the wave form. So energy is all above 0. That's the absolute value. Okay. Last one is formant. So formant is the -- it's generated by the vocal track renaissance. So I'll show later how the human speech is generated. So formant is generally, again, in your, in the speech, in the frequency domain. Not the pulsate but they're shaped with ups and downs. So all by this formant. So with this again we can get the oval peaks which determine, which define as the formant frequency. And also we use the 3-D bandwidth. So the algorithm yields to get the formant it's called Linear Predictive Coding. So from the samples we can see here they are shaped ups and downs with the formant, kind of again, and then we can have ups and downs in those sample markers. So the pitch is above the final speech signal we'll see in the frequency domain. So the other way around, what we see is the signal like this, in the frequency domain, but we need to get again from the signal, which is not measured but it is not measured but derived from the amplitude in the frequency domain. Okay. So here are all the features that I talked about. So this slide shows the machine learning algorithm called logistic regression. So I know some of you are in the machine learning area. So this is very simple machine learning algorithm because it's a linear -- that's a bunch of risk factors which are the features we use. They are weighted by beta, which is called the regression coefficient. So they all form the input value Z. And Z if the output can be calculated by using the function, the regression function shown above, so this function will output a value between either 0 or 1. Between 0 or 1 which give us the probability of occurrence of an event. So we can see the weight is pretty high or very low, it will show how strong or weak it will influence the final outcome. For example, from the simulations we did, we can see that energy and peak range influence a lot on the outcome. And the position of our algorithm has been tested through a method called a cross-validation, and the outcome is around 71 percent. So 71 is calculated by all the values that calculate predicted correctly divided by the total sample. Okay. Here come the demo time. So here the bottom bar with the record button play, stop and the field button, which is a little hard. Oh. Switch to the projector. Thank you. So here is the phone. Let me make sure it's -- okay. So we can see here the little hard, this listen application. So in the start screen we can see -you can record your voice. For example, let me give you an example first. Keep quiet. I'm really happy today. There I just pressed stop. Then we can press the feel button, which is the auto file will be uploaded to the server. And it's processing now. You can see you have a good mood. Okay. Let me try again. Oh no. Which is sad, hopefully it can give the correct answer. Oh. Okay. It is too happy today for my presentation. So anyway, you will want to try that. And if you want to try this demo, you can try it yourself. >>: Would you like to say something? >>: I'd like to try it. Should I do it off there so people can see it? >> Ny Yang: All right. >>: I'll warn you, my voice is a little scratchy today. So I'm feeling a little under the weather today. That might have been too happy. >> Ny Yang: Press the feel button. Hey, I'm not very happy today. >>: Let me try to be happy. Today is a great day. >> Ny Yang: With the noise. >>: Oh, that didn't work out well. Let me try it one more time. I'm having an excellent day today. >> Ny Yang: We'll see. [laughter]. >>: It might just be the signal. >> Ny Yang: Maybe the pitch of your mouth. Okay. Great. Thank you. >>: We have proof. >> Ny Yang: So anyone want to try that again? >>: Yeah. >> Ny Yang: Okay, yeah, or you can play with it after the presentation. Can we switch back to the laptop? Okay. Cool. So here how this app is working, we just demoed it. So some of the challenges as in the most challenging part I face is how to set up the server on the cloud. So we'll use Windows Communication Foundation to do the communication between the phone and the server, and also how to set all this configuration correctly, for example, the firewall proxy and the Internet information service which hosts the service, and also there are a lot of access control issues when you deal with a server, you need to get permission to access the file or change a file. And also to put all the components together is another challenging part. Because as I show in the architecture, there are a lot of, for example, signal processing, and how to design a phone application and also how to set up the server, how to, how you need to learn and implement machine learning algorithms. So all this stuff makes the whole system work. So I think that's the beauty of being an engineer. Okay. So some future work. In the future we need to collect more data like this to test on users, and then we issue comparative performance with other platforms, for example, Apple, iPhone or Android. Also in the future we'll publish as a conference paper. Also we need to improve the position in different ways. For example, to improve the feature extraction algorithm based on the existing features or we can, during my internship I also talked with some researchers in the communications field. So they also give some very good recommendation, for example, to use malfrequency castral [phonetic] coefficient. So we can try that feature as well. Also we can try other machine learning algorithms and also one thing that we can use the user data we collected as a new training data to input to the training system. So we expect we can get a better result when we have more training data. Third one is to refine the app. For example, only one user can access the app now. But if we published on the market, we should expect that more users can access the application simultaneously. So the server needs to queue all these requests and handle that. And also we need to -- another thing we can see the application, it takes some time to get feedback from the server. That's because -- that's because we need to upload the whole audio wave file to the server. But if we can do some signal processing just locally on the phone and just upload the feature data to the server. So that will reduce the traffic and we can get a better feedback speed. And also we can add one user privacy agreement just before we play the app so the user will know that their voice will be transmitted to the cloud. Also if I'm lucky enough to do another internship here next summer, probably I can make this app really benefit the society. For example, we can use it in a Citizens Science project or healthcare project. I think that it would be more meaningful. And, finally, because the app is trying to detect people's emotions, but imagine that we can do the other way around. If there's a very plain sentence, but we want to color it with different emotions. So that would be more interesting. For example, if you call a call center, so there's a virtual person answering your phone, so it may feel that's a little bit strange. But if we can color this conversation, for example, with different emotion, you will imagine that as if you were talking with a real person. So that will give us a better user experience. Okay. Finally, the internship wrap up. Being at MSR is really a great opportunity because you can really do something not just sitting in front of your computer but you can play with other things like the server and the apps. And finally you can just make everything work, which is really wonderful. So actually the first recording I did on the phone is -- the app is working. Real emotion, really happy. So it's not just the app but everything worked. So which really was a fantastic job. And because my internship is in the Connections group, and I want to sincerely thank you for all of your help during my internship for any advice or information you have provided really will be greatly appreciated. And here the local I don't know how to say that. The spirit of the Connection group is imagine the event is bare and I learned a lot about it and also about MSR. I think really a super research engine. I made a lot of great researchers and great intern friends. So I really want to come out to this cool place next summer, not because the weather, but also the great research. I think that's the reference and thanks for watching. [applause]. I'd like to take any questions you may have. >>: I have two comments. I think you have to distinguish between a mood and emotion and expression, vocal expression, facial expression. These are things that you can objectively measure. So the competitions that exist here they're very careful, they're saying we're detecting facial expressions, but that's something external. What you're doing, I believe, is vocal expression. Not emotion. Because emotion is internal. You have to verify if this guy is faking it or not. And then you're mixing mood and emotion. Emotion is kind of transient, short term. Mood is long term. And it's very complicated psychological difference here. A second thing is I think I believe right now your machine learning is kind of a speaker-independent. You try to use the same training data and then try to detect the emotion vocal expression of everybody. And your accuracy is just a little bit better than random guessing. I'm guessing 50 percent. You're achieving 70 percent. So I think that's actually pretty impressive, because people can be so different in their vocal expression. So I wonder if you personalized the detection probably be much more accurate. >> Ny Yang: We can refine the app to get actually record from the user itself, very personalized ->>: You can adapt the model you have in the cloud. >> Ny Yang: Can train with the user's voice. >>: Actually you can do it when you're deploying in the marketplace, you can ask people to confirm whether recognition is right or not. They can use that to basically tag the data collected. Good work. A lot of work. >> Ny Yang: Yes? >>: I have a question. I think that's a great idea. And I'm interested particularly in your idea about call centers, because we all like called in and have been on hold and all that stuff. But what if you reversed it so the computer could actually know what kind of mood you're in as opposed to giving emotions to the computer, therefore the computer can say answers like I understand you're frustrated at this point sort of deflating tension and sort of like a computer way. >> Ny Yang: The more angry customer, to the top of the queue. >>: So you can understand ->> Ny Yang: Pretend to anyway. Good point, thank you. Yes? >>: I didn't mean to cut somebody off. I know you showed a couple of potential applications of it, but do you have any other plans for making it more sort of ubiquitous, sort of part of the phone and I guess maybe that was part of the theme, did you have any other -- >> Ny Yang: Yeah, I want to make it actually embedded in the phone, not stand out app. For example, it can constantly monitor or sample the conversation, the data conversation to track how your emotion is changing. So, yeah. Yeah. On the back. >>: You kind of indicated at the beginning when you showed the samples, you took a variety of different kind of samples with different emotional labels and characterized them to like positive or negative. I was just wondering if you thought about other, so you've done some kind of high level labeling, but have you thought about the ways you could kind of label the speeches without these kind of labels that could kind of be like pride, for instance, is maybe app subjective and is loaded with other connotations, but this other work in emotion which categorizes things in terms of valence and arousal which perhaps a different way of looking at the ->> Ny Yang: Another way to classify emotions is to use a quadrant. So emotions can be differentiated in negative emotions or positive emotions, like negative and sad and happy and interest or pride, they are positive emotions. Also another direction is active or passive. For example, bored or sad. They are passive. But like happy or angry, they are active emotions. So we do have that full quadrant. But this is just a very prototype. So we just want to differentiate happy and sad. >>: But certain things like positive -- like the effect model, certain things, there are two dimensions, failure, one is active. >> Ny Yang: Yeah, yeah, it's with all the emotions there. And they're totally 14 emotions including neutral emotion. >>: So actually have another comment. You define the problem as a passive problem. You try to classify as high. Most of the time we're emotionless when you talk to people. I think the problem can be better defined as a detection problem, you can detect an unhappiness or happiness with the voice. This actually I have this observation from one of the papers, a paper they claim to have 70 percent accuracy in regularized emotions but the thing they actually showed 75 percent of the time people are emotionless. Basically that means if you just always guess emotionless you achieve 70 percent accuracy. That's why you define a detection problem is more reasonable. >> Ny Yang: For example, there should be a confidence index to say the person is very angry or very happy or very sad. So if the person ->>: All this emotion. >> Ny Yang: Okay. And another important part is emotion is sometimes can be filled with neutral -- >>: Right. The thing becomes the evaluation metrics become different. But right now you're trying to use a matrix classification matrix to evaluate. >> Ny Yang: The features. >>: And then it becomes a positive, false positive, false negative, detection problem. >> Ny Yang: Yeah, I think that maybe more accurate if we want to, if we can combine all these methods together. For example, also try to interpret the conversation for so maybe some keywords to indicate whether people is in in active or passive emotion and we can also combine the facial expression or the gestures. Yeah. So maybe Kinect will be the future. >> Arjmand Samuel: Any other questions? Let's give a round of applause. [applause]