22883 >> Arjmand Samuel: It is my pleasure to introduce... interning with us here at Microsoft Research Connections this summer. ...

advertisement
22883
>> Arjmand Samuel: It is my pleasure to introduce Ny Yang today. She's been
interning with us here at Microsoft Research Connections this summer. She's
from University of Rochester, Upstate New York.
Her research interests are in signal processing, model communication and
sensor networks. And this summer she's actually developed a unique sensor for
the mobile, Windows 7 mobile phone. Over to you.
>> Ny Yang: Thank you. Hi, everyone. My name is Ny Yang. It's my pleasure
to have all of you here today to come to my presentation. It's my great pleasure.
And the phone application I developed during the summer is called Listen-n-Feel.
So it's a mobile emotion sensor, which can detect emotion based on how we
speak to the phone. It will use signal processing and cloud computing.
And my mentor at MSR is Arjmand Samuel. Okay. Here is the last chance I do
my introduction here. I come from the University of Rochester, which is a very
cold place, as you can see from the upper left picture. There's a lot of snow
during the winter. And my advisor at school is Wendy Handelman. From this
picture you can see that's a beautiful view from the top of my campus.
And in my spare time I like playing [inaudible]. So if there are any questions or
any concerns you can contact me through my e-mail address. So any questions
I welcome during my presentation.
Here's the line. I will explain how we come up with this idea to develop this
emotion sensor and how we implement it on the cloud. And I will sum up with
challenges and future work and, of course, the -- brought from here so you can
see the demo and you can try that.
Okay. So now there is phones that's used everywhere and anytime, and there
are a lot of sensors which enable the phone to be a very convenient tool, and
also a fairly accompanied to us [phonetic]. We can see there are microphone
cameras and on the left side that's accelerometer. So if you move the phone, it
will detect movement. And also the GPS, which will tell you your location and
also the general meter. If you just move your phone or turn it around so it can
also feel that difference in position.
Also, those sensors, they are another kind of sensor which can help monitor our
physical condition. Like the new prototype developed by RIM which is called
Blackberry Empathy which detects all the user's heart rate, blood pressure and
all the physical conditions using a very magic little ring here. But we'll come up
with a question: What about the user's emotion? Physical condition is important,
but sometimes emotion will be more important, which is our -- which can tell us
inside world.
So let's begin from the very beginning. What is emotion? So from definition,
emotion is the cycle physiologic experience of an individual's state of mind as
interacting with biomedical which is the internal factors and also environmental,
which is the external factors. So the reason why we want to deploy the sensor
on the phone is that we want to detect emotion in a mobile fashion, anywhere
and anytime.
And also people may feel that it may be -- they're not very willing to tell their real
emotion to the outside world or to another person.
But they may feel comfortable talking to their own phone. So we would like to
make use of this point to just let users talk to their own phone and then the phone
will know their real emotion. It will be much easier.
So there are at least some user cases here. So you can -- you can see that
phone can be, the emotion sensor on the phone from that point can be used in
Citizen Science project. So some of you here may not have heard that. But
that's a very popular, Citizen Science project, happening now, which the very
normal citizens can participate in citizen project to help scientists or help the
psychologists, some other scientists in other research fields to do some very
simple science project by collecting data. So that's a crowd-enabled project.
So by using our emotion sensor on the phone we can help scientists in
psychology or sociology to, for example, character data and test which city has
the most highest happiness index or which factors will influence people's
emotions.
For example, people may feel much happier at home versus at work. So that's a
very interesting point to study. Also, healthcare. It's hard to know people's
emotions. So a patient with an illness, mental illness may not be willing to tell
their counselor or doctor about their real mood. So the doctors can put this
emotion sensor on the phone to do this monitoring all the time.
Also, for the networks. Imagine that you can like a photo or a message or the
status of your friends by using real life. So if you speak to the phone like, whoa,
that's amazing, and then it will translate this message automatically to Facebook
or Twitter to post to all your friends. So that would be a really cool thing.
Also, we can, because certain networks give us an opportunity to get a lot of data
from a large population so we can also make use of this to analyze emotion of
the certain population. And also, again, I know some of the interns here already
got your Kinect and XBox. So imagine if you can play an XBox game which you
can influence your emotion on the character you were playing, that would be
extremely cool. So by using your voice, so the Kinect or the XBox machine will
listen to your mood by listening to your voice.
So it can be on the phone or on the Kinect or other sensors. And finally, you can
customize your phone. Imagine your phone will change the sounds or the same
color based on the mood, based on your current mood.
So which will make the phone more loudly. Here's some today. In the Android or
iPhone market, there are at least some motion detection application out there,
but they're not used -- they're not used in the scenarios as told before. Neither
did they have very strong scientific background. So that's the main difference
and also the Kinect, there's no emotion sensor on Kinect but there's some very
simple speech recognition which you can give very simple comment.
So that's a difference show how we stand out. Okay. Here comes the meat. So
the phone applications look like this, and I will show that later. So when you
press a button, your voice, the whole audio voice style will be transmitted, upload
to a server on the cloud.
And we'll do some signal processing to extract the speech features of that voice.
So it does not try to understand what you are saying but how you say that word
or utterance or sentence. Those features will be used as a test data to input to
learning machine. So we can see on the other side, the server also works off
line; we try to extract all the speech features from a large database with over
1,000 data samples, with no -- so there are speeches with no emotion.
We also do the same thing extract the speech features from all this data
samples. And then use the same machine learning algorithm which is called a
logistic regression to train the system and gather weight.
So with the impact data and the weight, as the input, we can output the predictive
emotion. And I'll talk about later what the database we use, how we do the signal
processing and also how we do machine learning.
Okay. Here is the database. I will show you some samples. So the speakers
actors or actress, they just speak neutral meaning sentence, utterance, the
number or the date. For example:
[video]
June 20th. June 20th. October 5th. October 5th. 508. 508.
>> Ny Yang: Another iteration.
[video]
810! 810! 502! 502! 502!
>> Ny Yang: So emotion on the left side are both happy, happy emotions, and
the emotions on the left side are used as sad emotions, for example, despair.
[video]
November 9th. November 9th. November 9th.
>> Ny Yang: Now hear the difference.
[Video]
4,006. 4,006.
>> Ny Yang: They're performed by actress or -[video].
August 2nd. June 28th. June 28th.
[end of demonstration]
>> Ny Yang: So just to give you some sense about the training data we used.
So we have all the features listed here in total 12 features in the frequency
domain and also in the energy domain. So all these features we get because, for
example, pitch, the pitch changes a lot when we speak. So that's a bunch of
pitch values.
So we get the whole vector and then calculate the mean value max, minimum,
standard deviation, median and the range. All these statistic values. So all this
27, 72 metrics are used as our, as the risk factor or the features. Okay. Here
comes some signal processing things.
So given wave form, we will, because they're so sampled, so for each sample is
the amplitude is above a certain/word we'll mark it as one. Here's the sample
marker listed here and then for a bunch of markers, we will set a frame with less
180 second. So we do the signal processing, for example, pitch reduction,
energy calculation within that frame, and then we have this frame marker.
So but you can see there's a little gap here, because the amplitude is very low.
So it may not be very accurate to know the exact pitch value. So it doesn't make
sense to carry the pitch so you just totally ruin result. So we just do the signal
processing in this frame with frame marker equals to 1.
And here this slide shows the noise reduction. So before the noise reduction, we
can see the pitch, there's some background noise, but after that we'll see the
signal wave forms pretty smooth. Okay. Here come the first feature, pitch. So
pitch is kind of a relative high needs or low needs of the torn as perceived by the
ear. So, for example, women speakers tend to have a higher PH than men. The
pitch calculation algorithm yields is in the time domain which is called outer
correlation.
For example, for so these utterances we can see pitch variance a lot. So each
sample here, the frame marker which the amplitude of the frame marker detect,
show the pitch value. Whether it's high or low. The second feature is energy.
So for -- within each frame we get the summation of all the square value of the
sample amplitude of the sample marker, which will be used at the energy value
of that frame.
So we can see -- you can also see from the amplitude of the wave form. So
energy is all above 0. That's the absolute value.
Okay. Last one is formant. So formant is the -- it's generated by the vocal track
renaissance. So I'll show later how the human speech is generated. So formant
is generally, again, in your, in the speech, in the frequency domain. Not the
pulsate but they're shaped with ups and downs. So all by this formant. So with
this again we can get the oval peaks which determine, which define as the
formant frequency. And also we use the 3-D bandwidth. So the algorithm yields
to get the formant it's called Linear Predictive Coding.
So from the samples we can see here they are shaped ups and downs with the
formant, kind of again, and then we can have ups and downs in those sample
markers. So the pitch is above the final speech signal we'll see in the frequency
domain. So the other way around, what we see is the signal like this, in the
frequency domain, but we need to get again from the signal, which is not
measured but it is not measured but derived from the amplitude in the frequency
domain.
Okay. So here are all the features that I talked about. So this slide shows the
machine learning algorithm called logistic regression. So I know some of you are
in the machine learning area. So this is very simple machine learning algorithm
because it's a linear -- that's a bunch of risk factors which are the features we
use. They are weighted by beta, which is called the regression coefficient. So
they all form the input value Z. And Z if the output can be calculated by using the
function, the regression function shown above, so this function will output a value
between either 0 or 1. Between 0 or 1 which give us the probability of
occurrence of an event. So we can see the weight is pretty high or very low, it
will show how strong or weak it will influence the final outcome. For example,
from the simulations we did, we can see that energy and peak range influence a
lot on the outcome. And the position of our algorithm has been tested through a
method called a cross-validation, and the outcome is around 71 percent. So 71
is calculated by all the values that calculate predicted correctly divided by the
total sample.
Okay. Here come the demo time. So here the bottom bar with the record button
play, stop and the field button, which is a little hard. Oh. Switch to the projector.
Thank you. So here is the phone. Let me make sure it's -- okay. So we can see
here the little hard, this listen application. So in the start screen we can see -you can record your voice. For example, let me give you an example first. Keep
quiet.
I'm really happy today. There I just pressed stop. Then we can press the feel
button, which is the auto file will be uploaded to the server. And it's processing
now. You can see you have a good mood. Okay. Let me try again.
Oh no. Which is sad, hopefully it can give the correct answer. Oh. Okay. It is
too happy today for my presentation. So anyway, you will want to try that. And if
you want to try this demo, you can try it yourself.
>>: Would you like to say something?
>>: I'd like to try it. Should I do it off there so people can see it?
>> Ny Yang: All right.
>>: I'll warn you, my voice is a little scratchy today. So I'm feeling a little under
the weather today. That might have been too happy.
>> Ny Yang: Press the feel button.
Hey, I'm not very happy today.
>>: Let me try to be happy. Today is a great day.
>> Ny Yang: With the noise.
>>: Oh, that didn't work out well. Let me try it one more time. I'm having an
excellent day today.
>> Ny Yang: We'll see. [laughter].
>>: It might just be the signal.
>> Ny Yang: Maybe the pitch of your mouth. Okay. Great. Thank you.
>>: We have proof.
>> Ny Yang: So anyone want to try that again?
>>: Yeah.
>> Ny Yang: Okay, yeah, or you can play with it after the presentation. Can we
switch back to the laptop? Okay. Cool. So here how this app is working, we just
demoed it. So some of the challenges as in the most challenging part I face is
how to set up the server on the cloud.
So we'll use Windows Communication Foundation to do the communication
between the phone and the server, and also how to set all this configuration
correctly, for example, the firewall proxy and the Internet information service
which hosts the service, and also there are a lot of access control issues when
you deal with a server, you need to get permission to access the file or change a
file.
And also to put all the components together is another challenging part.
Because as I show in the architecture, there are a lot of, for example, signal
processing, and how to design a phone application and also how to set up the
server, how to, how you need to learn and implement machine learning
algorithms. So all this stuff makes the whole system work.
So I think that's the beauty of being an engineer. Okay. So some future work. In
the future we need to collect more data like this to test on users, and then we
issue comparative performance with other platforms, for example, Apple, iPhone
or Android. Also in the future we'll publish as a conference paper.
Also we need to improve the position in different ways. For example, to improve
the feature extraction algorithm based on the existing features or we can, during
my internship I also talked with some researchers in the communications field.
So they also give some very good recommendation, for example, to use
malfrequency castral [phonetic] coefficient. So we can try that feature as well.
Also we can try other machine learning algorithms and also one thing that we can
use the user data we collected as a new training data to input to the training
system.
So we expect we can get a better result when we have more training data. Third
one is to refine the app. For example, only one user can access the app now.
But if we published on the market, we should expect that more users can access
the application simultaneously. So the server needs to queue all these requests
and handle that. And also we need to -- another thing we can see the
application, it takes some time to get feedback from the server. That's
because -- that's because we need to upload the whole audio wave file to the
server. But if we can do some signal processing just locally on the phone and
just upload the feature data to the server.
So that will reduce the traffic and we can get a better feedback speed. And also
we can add one user privacy agreement just before we play the app so the user
will know that their voice will be transmitted to the cloud.
Also if I'm lucky enough to do another internship here next summer, probably I
can make this app really benefit the society. For example, we can use it in a
Citizens Science project or healthcare project. I think that it would be more
meaningful. And, finally, because the app is trying to detect people's emotions,
but imagine that we can do the other way around.
If there's a very plain sentence, but we want to color it with different emotions.
So that would be more interesting. For example, if you call a call center, so
there's a virtual person answering your phone, so it may feel that's a little bit
strange. But if we can color this conversation, for example, with different
emotion, you will imagine that as if you were talking with a real person. So that
will give us a better user experience.
Okay. Finally, the internship wrap up. Being at MSR is really a great opportunity
because you can really do something not just sitting in front of your computer but
you can play with other things like the server and the apps. And finally you can
just make everything work, which is really wonderful.
So actually the first recording I did on the phone is -- the app is working. Real
emotion, really happy. So it's not just the app but everything worked. So which
really was a fantastic job.
And because my internship is in the Connections group, and I want to sincerely
thank you for all of your help during my internship for any advice or information
you have provided really will be greatly appreciated.
And here the local I don't know how to say that. The spirit of the Connection
group is imagine the event is bare and I learned a lot about it and also about
MSR. I think really a super research engine. I made a lot of great researchers
and great intern friends. So I really want to come out to this cool place next
summer, not because the weather, but also the great research. I think that's the
reference and thanks for watching.
[applause].
I'd like to take any questions you may have.
>>: I have two comments. I think you have to distinguish between a mood and
emotion and expression, vocal expression, facial expression. These are things
that you can objectively measure. So the competitions that exist here they're
very careful, they're saying we're detecting facial expressions, but that's
something external. What you're doing, I believe, is vocal expression. Not
emotion. Because emotion is internal. You have to verify if this guy is faking it or
not. And then you're mixing mood and emotion. Emotion is kind of transient,
short term. Mood is long term. And it's very complicated psychological
difference here.
A second thing is I think I believe right now your machine learning is kind of a
speaker-independent. You try to use the same training data and then try to
detect the emotion vocal expression of everybody. And your accuracy is just a
little bit better than random guessing. I'm guessing 50 percent. You're achieving
70 percent. So I think that's actually pretty impressive, because people can be
so different in their vocal expression. So I wonder if you personalized the
detection probably be much more accurate.
>> Ny Yang: We can refine the app to get actually record from the user itself,
very personalized ->>: You can adapt the model you have in the cloud.
>> Ny Yang: Can train with the user's voice.
>>: Actually you can do it when you're deploying in the marketplace, you can ask
people to confirm whether recognition is right or not. They can use that to
basically tag the data collected.
Good work. A lot of work.
>> Ny Yang: Yes?
>>: I have a question. I think that's a great idea. And I'm interested particularly
in your idea about call centers, because we all like called in and have been on
hold and all that stuff. But what if you reversed it so the computer could actually
know what kind of mood you're in as opposed to giving emotions to the
computer, therefore the computer can say answers like I understand you're
frustrated at this point sort of deflating tension and sort of like a computer way.
>> Ny Yang: The more angry customer, to the top of the queue.
>>: So you can understand ->> Ny Yang: Pretend to anyway. Good point, thank you. Yes?
>>: I didn't mean to cut somebody off. I know you showed a couple of potential
applications of it, but do you have any other plans for making it more sort of
ubiquitous, sort of part of the phone and I guess maybe that was part of the
theme, did you have any other --
>> Ny Yang: Yeah, I want to make it actually embedded in the phone, not stand
out app. For example, it can constantly monitor or sample the conversation, the
data conversation to track how your emotion is changing. So, yeah. Yeah. On
the back.
>>: You kind of indicated at the beginning when you showed the samples, you
took a variety of different kind of samples with different emotional labels and
characterized them to like positive or negative.
I was just wondering if you thought about other, so you've done some kind of
high level labeling, but have you thought about the ways you could kind of label
the speeches without these kind of labels that could kind of be like pride, for
instance, is maybe app subjective and is loaded with other connotations, but this
other work in emotion which categorizes things in terms of valence and arousal
which perhaps a different way of looking at the ->> Ny Yang: Another way to classify emotions is to use a quadrant. So
emotions can be differentiated in negative emotions or positive emotions, like
negative and sad and happy and interest or pride, they are positive emotions.
Also another direction is active or passive. For example, bored or sad. They are
passive. But like happy or angry, they are active emotions. So we do have that
full quadrant. But this is just a very prototype. So we just want to differentiate
happy and sad.
>>: But certain things like positive -- like the effect model, certain things, there
are two dimensions, failure, one is active.
>> Ny Yang: Yeah, yeah, it's with all the emotions there. And they're totally 14
emotions including neutral emotion.
>>: So actually have another comment. You define the problem as a passive
problem. You try to classify as high. Most of the time we're emotionless when
you talk to people. I think the problem can be better defined as a detection
problem, you can detect an unhappiness or happiness with the voice. This
actually I have this observation from one of the papers, a paper they claim to
have 70 percent accuracy in regularized emotions but the thing they actually
showed 75 percent of the time people are emotionless. Basically that means if
you just always guess emotionless you achieve 70 percent accuracy. That's why
you define a detection problem is more reasonable.
>> Ny Yang: For example, there should be a confidence index to say the person
is very angry or very happy or very sad. So if the person ->>: All this emotion.
>> Ny Yang: Okay. And another important part is emotion is sometimes can be
filled with neutral --
>>: Right. The thing becomes the evaluation metrics become different. But right
now you're trying to use a matrix classification matrix to evaluate.
>> Ny Yang: The features.
>>: And then it becomes a positive, false positive, false negative, detection
problem.
>> Ny Yang: Yeah, I think that maybe more accurate if we want to, if we can
combine all these methods together. For example, also try to interpret the
conversation for so maybe some keywords to indicate whether people is in in
active or passive emotion and we can also combine the facial expression or the
gestures. Yeah.
So maybe Kinect will be the future.
>> Arjmand Samuel: Any other questions? Let's give a round of applause.
[applause]
Download