Document 17860158

advertisement
>> Mary Czerwinski: Okay. Folks. I think we should get started. I'm very
happy today to introduce you to Na Yang. She's from the University of
Rochester, and she works in a very interesting complement of areas including
wireless sensing, speech, HCI and most currently effective computing. So I
think she's going to tell us a lot about that today. And I just want to
welcome you.
>> Na Yang: Yeah, thanks Mary. Thanks for having me. Today it's my pleasure
to come here and thank you very much for attending my talk. And so today my
talk is about emotion sensing. So emotions really primary form of
communication between humans, and that originates a lot of HCI project like
behavior sensing or designing context aware systems. So today I'll talk about
how we can use speech for emotion sensing, and how we can combat noise in the
mobile scenario, do want to apply that in the mobile scenario.
So we'll talk about emotion sensing. So there's a lot of applications that can
be enabled, for example, for friends, if you want to do that in the gaming
scenario or this robot is really [indiscernible] by south bank called Pepper.
So it's a member of the family that can talk to people. Also for health. Like
we want to monitor the emotions, emotional state for people and pleasure. Like
for drivers or for representatives working in call centers. And also as
designed this great and pretty Web, one of the applications I think can be used
for children who suffer from autism so they can better communicate with them by
visualizing their emotional state. Also in behavior studies, and that's
actually originates our research in University of Rochester. Our engineering
team is collaborating with psychologists to deal with some problems that
teenagers and the family members have so that's a very interesting
interdisciplinary research going on. And, of course, mobile. So with
increasing applications, in mobile platforms, so there are a lot of context can
be provided through the mobile platform like how fast we tag or how many arrows
we made, but of course still speech is the primary form and we'll talk about
how we can sense emotions through speech. So at Microsoft Research, it's a
great opportunity that so many researchers are actively working in this field
like a Web group, affect and visualize through sensors or some effective
fabric. Also Evans has one project working on try to detect emotion using
speech signals and neural networks and applied that in the noisy gaming
scenario. Also this directional robot. So researchers have been try to design
systems that can communicate with people in a more natural way. Also some
other research going on by using social media and also mobile sensing, some
other context for emotion sensing, academia that's still very hot and very
active field that a lot of researchers put their effort here, just to mention a
few that motivated my research. So by saying that the research I focus on is
emotion sensing using mobile platforms. The reason is that mobile is very
personal item. So other people want to carry that all the time. So it might
be easier to capture the users real emotions. And also if we can apply the
system by training using the user's own voice, it can generate the system with
a more, with higher accuracy by using speaker dependent training. Also during
a test with Paul the other day, because it's rather difficult to detect passive
and negative emotions, especially when the user has carried that emotion for a
long time. So mobile phones on the other side is very good at long-term
monitoring. And, of course, mobile can provide other contexts. But the oldest
context speech is easy to capture, better save battery life and also if we can
just detect emotions through speech features not what the people speak, they
can better preserve the user's privacy. But if we apply speech based signal,
speech-based emotion sensing in the mobile platform, noise is a factor that
cannot be ignored. So that's formed the starting line that I will first talk
about how in the next 40 minutes how we designed speech noise resilient speech
signal extraction method and especially since speech is a very important
feature talk about how we design noise repeat detection and how we design the
system to classify the emotions based on the speech features extracted and
finally how we apply the system on a mobile platform. Okay. So here are just
given a notion what pitch is it's defined at the highest or lowest tone as
perceived by the ear. So if we play this sample.
>> 309. 309.
>> Na Yang: Do you know which curve in those, the high or lows of the pitch
frequency?
>> [indiscernible].
>> 309.
>> Na Yang: 309. Are you sure? How many think that's A? Okay. And the rest
are B? Okay. Actually B. Because 309 -- so we really need the computer to
help us do this pitch detection. It's a very simple illustration so pitch can
be used in a variety of ways. For example, we can detect pitch and do the
emotion sensing, the first one, and also the [indiscernible] so we can do
speech recognition, and also in music we can do automatic music transcription
or music information retrieval. So this clock will show us what pitch looks
like in the frequency domain. So if the path, the stream is generated from our
lungs, like the power supply, and the pitch are vibrating, are generated by
vibrating of the vocal cord, like a filter, so this sequence as we can see
here, so in the frequency domain for those spectral peaks. So we can see the
first dome in the peak here is the pitch value we want to detect. And all the
peaks followed by are called harmonics. After the signal goes through the
vocal tract like a resonator, it's shaped by ups and downs so this, this
speaker on the top shows how the speech signal looks like in the frequency
domain that comes out of our mouths. It's not necessary that the pitch has the
highest amplitude. So that adds difficulty to detect pitch. Compared with
that signal, this signal is what we generate with zero DB [indiscernible] so we
cannot tell which are from the noise signal, which pitch are from the speech
signal. Let's bring back the clean speech signal. So we can see the pitch
located at 192 hertz, if you map that to the frequency horizontal axis. And
it's followed by the first harmonic, the second, third and the fourth, but we
can see the peak corresponding to the noise speech can also be very high in
terms of amplitude. So our key notion for the algorithm we proposed called
banner is that we just use pitch frequencies, not amplitude. So our algorithm
is just based on calculating the frequency ratio pitch in those flagged peaks.
So first step is to calculate the five peaks with the lowest frequency. So
this time will be this one, two, three, four, five peaks will be selected. And
this looks a little bit complicated. But I will explain it in more detail. So
this peaks, this one, two, three, four and five peaks are summarized in this
table and we calculate the frequency ratio between each pair of them. So, for
example, it's 2.04 calculated here. And then we map all these values to this
table. So this table is what the ideal harmonic ratio is, the expected
harmonic ratio. For example, the first harmonic should be, should be located
at twice of the pitch value. Ideally for human speech, harmonics are placed at
integer multiples of the pitch value. So that's, for example, 100, 200, 300.
But in this time we cannot tell what the five selected peaks corresponding to
whether to noise or speech and which harmonic, which order of harmonic they
belong to. So we calculate the frequency ratio and guard this pitch candidate.
And we combine two additional pitch candidate. One is from the catch
[indiscernible] method, because other method focuses on the low frequency value
focusing on the Captron pitch value we have a better view of the global
information in the frequency domain. And the other additional pitch candidate
is called lowest frequency, because for some of you, if expert in signal
processing domain, for some speech signals, if we look at the spectrums there's
only one dominant peak. Using this banner algorithm, we cannot calculate the
ratio between the two. There's only one. Using this by combining all this
pitch candidate calculated here and above, we put them all into a table and we
count how many close candidates are they and we count them as a confidence
score so number of multiples. For example, we can see this candidate 198 has
the highest number of highest number of multiples. So the higher the
confidence score, the more likely that is the real pitch value. The final step
of this banner algorithm is to use, build a cost function to calculate the cost
between all the pitch candidate between each neighboring frame. And the cost
function, the first term is if the difference between two candidates are small,
which is more likely for human speech, we give that a lower number and the cost
would be low. This is a confidence score. The higher the confidence score,
the lower the cost. We go through all the frames and pick up determine the
final pitch from all these candidates and that forms our banner algorithm. So
to evaluate the banner algorithm, we will introduce one error metric it's
called the growth error pitch rate. If deviate more than 10% of the real pitch
value, we say that pitch is detected. The growth period error rate is
percentage of the frame. So that is if that value is higher, that is worse.
And we use the ground truth, for example, I can play this audio file again.
>> 309.
>> Na Yang: So that is a sample for the quiz previously. And we combine the
detect the pitch value from three very well performed algorithms and average
their value to be the ground truth. So the noise, we add noise to the clean
speech. For example, 108 in the sample.
>> 108.
>> Na Yang:
>> 108.
White.
>> Na Yang: So we use we validate our system in a different type of noise.
And in different noise level.
>> 108.
>> Na Yang: The most cleanest one.
>> 108. 108.
>> Na Yang: That's zero DBs, the most noisy one. And we compare our banner
algorithm, that's our banner algorithm, with both classic and very modern
algorithms. So the algorithms listed in blue are the algorithms that I will
compare with later. Because that noise is the most common type of noise, so we
can see, we try at our level from zero DB which is the worst, so we can see the
gross error speech rate is highest to cleanest one with only 20-DBSNR. And our
banner algorithm performed better than all the previous algorithms. And this
just shows the result on one dataset we validate. And in our research we
validated in total three total datasets. I presented several without here.
>> [indiscernible].
>> Na Yang: With that's 20 sample for this one. But that's by arranging all
different type of noise and all different noise levels. So that's still a
large dataset. And we have another called CFTR and Q. Maybe you're quite
familiar with, that's more than 100 clean samples. And if we multiply it by
one, two, three, four, 5, different ISR values and multiply it by eight
different types, that's a huge dataset.
>> Are they all samples with noise added or any coming with environmental
noise.
>> Na Yang: With a type of noise added.
>> They all start off with clean samples.
>> Na Yang: Yes.
>> Are the links typical of the examples you showed?
>> Na Yang: Some samples are around ten seconds or some are two seconds, three
seconds.
>> Are they all prompted speech or are they conversational speech.
>> Na Yang: They're all prompted speech. That's more either to get the ground
truth peak value. There's still several algorithms speech recorded in real
noisy scenarios but that's hard to get ground truth. The pressure. So PSAC
and [indiscernible] are the two most powerful algorithms. We compare
performance for zero DB for different type of noise. We can see banner get the
best performance for four out of eight different type of noise.
And also the banner algorithm is open source. And we made it available on our
group's website and we also developed application that can visualize your pitch
value, if you're interested you're more than welcome to download it. So the
second topic, how we can use this extracted pitch features to do emotion
classification. So emotions are, can be classified in the, around the level,
active or passive, or in a balanced level. Positive or negative. Some people
also do categories, for example all these blocks here show different emotion
categories. In our research we only pick up the six -- we only pick up the six
basic emotions. Because that's either to compare with others work that also
use the same six emotions. And also these six emotions are widely used in
psychology studies. There are so many datasets out there with emotional state.
Some are acted, acted with new natural conversation. But for natural
conversation we have to have a human coder to label those emotions. So we just
use acted ones. Acted ones are we invite actors or actresses to perform
different type of emotions. Some use audio. Or audio plus visual. Of course,
it will be more complicated for the system if we combine other modalities. So
we just use audio. And also one of the challenges that previous researchers
proposed is that if you use both audio and visual, some speakers tend to
express their emotion only by facial expression. So if we just get their audio
information, it doesn't tell us much about the emotions. So we just use audio.
Only focus on English, so we choose LCD dataset. There are other dataset out
there for different languages. For the LCD dataset, it just contains numbers
and date so the speech content is neutral meaning but are expressed in
different ways. For example, this male speaker with different emotions.
>> 108.
>> Na Yang: What's happening?
>> August 18th!
>> Na Yang: You can hear the difference.
>> December 1st. March 21st, 2001. December 12th.
>> Na Yang: That's a very good dataset. Especially we use that because a lot
of literature use that. So it's a good way to compare our result with other
benchmarks. Also throughout collaboration at UNC Rochester with the University
of Georgia we also collect over 10,000 samples from anagraph and expressed
different ways. But of course they're not professionals. So if you listen to
their recordings.
>> 506. 502. October 12th. October 1st. 4,001. 203.
>> Na Yang: From my point of view, that's quite similar. Especially that's
for recording the relative noisy environment. So ->> Sounded like that was [indiscernible] speech. What conditions were they -were they recording over the phone or ->> Na Yang: Yeah, I think.
>> In a lab.
>> Na Yang:
I think the quality is not that good.
>>
But did it over a phone.
>> Na Yang: Yes.
>> It's eight kilohertz.
>> Na Yang: Yeah. So I'll show the results on these two datasets.
>> Do you have numbers on how well people do with classifying this?
>> Na Yang: They just do the recording, but we don't have human coders to
actually label whether that really matched that emotion. Yeah. So these are
the speech features we used both in the frequency domain and in the energy. So
these are just very basic features that people are widely used. And we also
include the difference of pitch and the difference of energy. Because that can
give us a better picture of how pitch, the tone changes, how speech changes
over time. And for all these features, we extract five statistic values and
plus speech speaking rate. So in total our feature set is 121 metrics. The
classifier we use is a supportive actual machine. The reason comparing with
unsupervised learning method we can take advantage of the labels that for the
emotions and also we can, we use RBF kernel so that can better deal with the
linear inseparable data. And also we have the C parameter that can be tuned to
prevent overfitting. Of course, there are other methods out there. And the DB
networks by the method used by Persy Evans, an intern. So the normality, on
top of normality on top of our system now. If we listen to the sample. The
second quiz come what's the emotion for this speech sample?
>> June 28th. May 29th. September 3rd. 505. 312.
>> Na Yang: Yep.
>> [indiscernible].
>> Na Yang: Yep.
>> October 12th!
The second one.
8,005! 906! 203!
>> Na Yang: So how many think that's -- okay. And but I saw some didn't -okay. So here it comes, the divergence. So for some samples we can clearly
see what they want to express. But some samples it's relatively ambiguous. So
our approach just used this concept to just throw away those ambiguous samples.
>> The sample itself is ambiguous or some of those, it's harder to determine
the difference between angry and disgusted.
>> Na Yang:
Yeah, some of them are throw away.
>> You're throwing away based on people labeling them ambiguous or by the
algorithm coming up with ambiguous things.
>> Na Yang: By the system. By the confidence of the system of the output.
That sounds really easy idea but we want to see how that can help us improve
the accuracy of the system. Yeah. So ours is quite simple we used supportive
vector machine with one against all classifiers for different emotions. For
example, happy or not. And we train the system using some part of the LCD
dataset and we do the cross validation to test on new samples or we use speaker
independent methods to try using a new speaker to tie the performance of the
system. So the Fusion Center just to compare as you can see here, just compare
the one against output with highest confidence score, with a freshwater gamma.
So gamma can be controlled by the user if that value is greater than gamma,
then we are confident to classify that sample to be a certain emotion category.
Otherwise we just reject that sample. Now in our system use three additional
enhancement strategies, speaker normalization. Over sample training set and
feature selection to better improve the accuracy. So this shows the accuracy
versus the rejection rate. So rejection rate means how many, how many
percentage of samples we throw away. We can see that as we throw away more
samples, the higher accuracy can be again with a curve of growing from around
80% if we throw away half of the samples that's around 92%. So 80%, if we
throw away nothing, 80 percent is compared with one over six because we
classify one emotion out of six emotion categories. That's around 17 percent.
So we can see that it's a huge improvement. So this is based on the cross
validation for the LCD dataset and the red curve is for the general task. This
blue curve is for the general dependent male and the black curve is general
dependent female. We can see if we train the system using only female speech
and test it on females, we can have a better im -- better accuracy here. But
still we are trading off higher system accuracy and throwing away several
samples and that's based on the statement that for some speech-based, some
speech-based systems or applications, we can throw away several if we are not
confident with them.
>>: Seems like there's an important thing to add, maybe bringing it up
earlier, Steve was bringing it up earlier, that so, A, it probably matters
whether the classifier notion of confidence corresponds to human judges notion
of confidence because if those are different it might put you in a funny place.
You might be rejecting things that humans don't find ambiguous. You want to
make sure those are aligned that's one thing. Another thing would be to say
like the notion of confidence that you have, what does that look like when
you're looking at, say, for instance, natural speech and does it end up -because I would expect that after samples tend to be quite extreme.
>>:
That's why we use that.
>>:
Natural speech, it would be interesting to see.
>> Na Yang: That's actually bringing us up to our current ongoing research.
To answer your first question, we actually compare the results, the performance
of our system with human coders, because we are uploading all the samples to
Amazon Mechanical Turk, and we want human coders to code. And it will be
interesting to compare how the system works. And to answer your second
question, right now we're just based on the outer dataset because that's widely
used show the performance comparison with other papers later. And that's
bringing us our future work because we are calibrating with some psychologists,
they are doing interesting user studies on using real conversations between
family members or some conflict between teenagers as an interesting group. So,
yeah, that should be interesting.
>>: I'm curious the accuracy. Based on the samples that left are the
controlling of them or based on all samples?
>> Na Yang: That's a very good question. We just ignore the samples thrown
away and just based on the samples that are, we're confident to classify during
motion.
>>:
You estimate the accuracy based on what's left.
>> Na Yang:
>>:
Yes.
That explains the monotonic?
>> Na Yang:
Yes, thanks for adding that.
>>: In the previous slide, you mentioned speech normalization, how do you do
this.
>> Na Yang:
>>:
Speaker normalization?
Yes.
>> Na Yang: Each speech is focused on one
express different emotions that around 100
each metrics, which is the statistic value
and standard deviation and we do the lease
>>:
speaker, because we have one speaker
samples for each speaker and for
of the feature, we coded the mean
scored normalization.
[indiscernible].
>> Na Yang: Yeah, so that's the challenge if new speaker comes, we do know
more than normal speaking rate for that, I think that's quite widely challenged
right now.
>>:
Thank you.
>> Na Yang: Thank you. So that's why our normally based on accuracy with
throwing away data. So here comes the interesting corpus I talked about. The
UG 8 dataset. So we can see the performance drops a lot from around 80% with
no data rejected to around 45%. And we also compare our result with other
work, and because the error measurement metrics are different for different
work, so the numbers are not consistent. So our system is listed above and the
numbers from the emotion sense paper is listed on the below. And the numbers
shown in bold is which system is doing better. And this is based on the
speaker independent training, because this, the voice for the new speaker is
not used for training the system so we can see the performance drops by around
40 percent.
>>:
Some level of data rejection.
>> Na Yang:
>>:
No --
No data rejection in this case.
>> Na Yang:
Yes.
It's the starting point of the curve.
>>: It seems like in your class random accuracy, your numbers are in the
nineties.
>> Na Yang: That's a very questioned question. Classifier level means we just
measure the accuracy for one against the whole classifiers.
>>: Right. But in your previous figures, when you had zero percent data
rejection, started off in like the 70s or 80s seemed like.
>> Na Yang: Classifier level we just measure the performance here. Whether
they help you or not. So that's just a binary classification. We are
comparing with a baseline of 50, for the 50/50. Division level means after the
fusion -- here's the division level. The final outcome.
>>:
You go several slides forward to the second database results.
>> Na Yang:
>>:
Follow up.
>> Na Yang:
>>:
Here.
The re --
Eugene.
>>: Yes. So you get the recognition obviously because the left is
[indiscernible] personization and it's higher in emotional contrast.
correct classification rate is weighted or normal weight.
>> Na Yang:
>>:
Not weighted.
What is the proportion of different emotions in your dataset.
This
>> Na Yang:
>>:
So pretty much equal amount of different emotions into the dataset.
>> Na Yang:
>>:
That's almost balanced.
Yeah, neutral, relatively small.
It's pretty much weighted then?
>> Na Yang: Yeah. Yeah. Right. Since we're adding that. So you have a term
here called division level classification rate. So we just sum up all the
samples for all the emotions, correctly classify with all the samples we
tested. Yeah. So thanks.
>>:
And the number of users recognized correctly?
The percentage.
>> Na Yang: Percentage, yes. So this plot shows how noise will influence our
result. We are comparing with the red line here with true and test on clean
speech. If we true and test normal speech we can see the performance drop a
lot. If we true and clean on test on normal speech we see the performance
drops a lot. The two curves in between are if we tune in on the same type of
noise and test on the same type of noise, that then drops too much for the
performance. So that concludes if we trend -- test on noisy speech, we would
rather tune in on noisy speech. So for the mobile implementation, during my
previous internship here at Microsoft Research three years ago, I got honored
to collaborate with my mentor art min [indiscernible] the connections group.
So we developed this kind of prototype for the emotion sensor, mobile emotion
sensor. And that's quite a simple one. Can only classify the user as happy or
not happy. So we just capture the user's voice in real time and we train the
system here. We train the system off line so we can train on the LCD dataset.
We apply the true model to combine with attracted speech and gather pretty
emotion with that. And this work has been presented on Microsoft Research Tech
Fest in 2012 and we won the award for the Microsoft [indiscernible] layer.
That's my mentor holding it up. Here comes the GUI time. So that's just a
very simple prototype for binary division. And this can implement the entire
system of course that's in my lab. So we can visualize the extracted speech
features and we can select which model we can use. Okay. We agree here. So
this GUI can, for example, record, load a sample from the LCD or UG 8 dataset.
Just a random truth one.
>>:
4,012!
>> Na Yang: So the real emotion is anger and that's reflected by a female
speaker. If we said set the threshold gamma to be zero we don't want to throw
away the sample. We can extract the features. Takes some time. Sorry. I
didn't drag the window here. So these are speech features extracted. And if
we choose to tell the system using the model, trend by LCD dataset. Okay.
Drag here. You can see the plot is shown on the four quadrants. That's an
angry emotion. I can show this GUI in more detail after the talk. So that can
just -- okay. Here comes with that happy or other emotions. To summarize, my
research is based on how to combat noise in a mobile gaming or mobile
voice-based application scenario. And what's motivated us to design noise with
speech extraction method and we have applied the banner rhythm and also
state-of-the-art speech extraction method to classify emotions based on a
standard dataset and based and apply that real users. Finally I just show very
simple prototype for the, to how that can be used in mobile platforms. The app
is called [indiscernible] and these are the papers and I also have background
in sensor networks and wireless communications and these two journals are still
in review. For future work, besides the pitch detection algorithm, we are
still working on how to improve other speech feature extraction thoughts like
MACC or other features to improve those algorithms in noisy environment. Also
we want to continue to implement the entire system in the mobile platforms.
Also as you mentioned we are continuing to project one on Mechanical Turk and
the other one through the real data gathered through the collaboration with the
psychology department. So actually my research just based on, based on the
first sensing and interpreting these two stages, for the entire affective
computing, that's a really broad range of research and applications. Also we
can work on visualizations and designing interesting designs how to visualize
key emotions. And we can take interventions whenever it is necessary to help
people, for example, with depression and monitoring their term and getting
intervention. But if we just to step back a little while we look ahead. So at
the end of my talk I want to share some notes, the reasons to be cheerful part
four from this block article. Sometimes we want to step back and rethink
whether our emotion sensing work is really tailored to one's need where the
user may be compelled to do those suggested tier work. Whether we can just
present the user to be cheerful or actually we can improve their emotional
state. So this author did a quite an interesting kind of survey with all these
apps right now on the market that can try to cheer you up. So by trying all
these different kinds of apps, she thinks whether the user just has the plain
after app but can really tailor to that person's personal need and really
improve their emotional state. I think that's one of the kind of a very good
guideline for my research in the future. Okay. So for some possibilities
applying this emotion sensing system to Microsoft products, this can be applied
across different devices or across different services. And any device with a
Mac can apply this system. And one of the kind of interesting one is smart
watch. So emotional [indiscernible] is proposed by Mary because, yeah, that's
also an emerging market right now. For all the currently available products,
for example, smart band or smart watch, it just forks down fitness. For
example, maintaining the hardware. If you imagine your watch can listen to
your voice and really understand your settings, that can help to improve your
emotional findings. So that's a very interesting concept I want to brought up
here. For different surveys they can use this white space emotion sensing in
gaming scenarios like KMAC and call centers to monitoring those representatives
and they're working under pressure. Voice search, for example, on the Bing
platform and targeted as. Of course, Cortana, if you can imagine your digital
assistant can understand your setting and that can provide a similar scenario
like the movie her. So not just communicate with you, tell what the mails are
there, but also can comfort you when you are done. So that should be very
interesting scenario. And scat translator. This is a demo for the scalp
translator on the code conference. And a person is speaking English, is
communicating with other, speaking German we can see the big smile on the
screen and you can tell the emotion. But what if we can just hear the person's
voice and that is in a totally different language you know nothing about. So
if we can just detect the emotions and tell the user on this side so they can
have a very better way of communications. And that sums up my talk. And I
want to thank all my collaborators, my advisor Wendi Heinzelman and Melissa
Sturge-Apple in the psychology department. So our work has been featured in
several media and I want to thank you all and we'll come to any questions you
may have. [applause].
>>: Mary Czerwinski:
>>:
Got a lot already.
Is emotion interpreted the same way by people in different cultures?
>> Na Yang: That's one footnote I put in the setup translator. And I forgot
to mention that because emotion differs from culture to culture. Especially
between different country cultures. So it might be very useful if we can
convey this emotions through the system to help you understand what the other
person's feeling is.
>>: If that's true, then when you did the Mechanical Turk studies, did you
kind of try to pick Turkers of the same culture of the samples?
>> Na Yang:
yeah.
>>:
Yes, that's exactly what we are doing now.
We're making sure,
Doesn't necessarily mean that they're written from the same --
>> Na Yang:
We can -- and I know that.
[laughter].
>>: You guys [indiscernible] this motion, let's take a step back say we have a
perfect emotion detector. We know the emotion for sure. You mentioned that
one of the applications is helping the person to improve the emotional fitness.
What else can be -- basically what element applications can we have to have the
perfect motion detection system. Want to see that.
>> Na Yang: That's a really interesting question. Because for humans if we
talk face to face, it's quite as simple to just within several seconds we can
catch the other person's emotions. But the system that we think is necessary
to do it in obtrusive objective way. So, for example, just like I mentioned
the [indiscernible] they can help children suffer from autism to better
communicate their emotions. So there are needs and also in gaming scenario, if
the person can't reflect his or her emotion in the character, again that should
be really interesting. But of course that's totally based on that emotion
sensing system is perfect because if that made a wrong division, that sometimes
can be annoying.
>>: So you compare
strongly versus the
what does a natural
trained actor or is
this is the trained afters, they did the emotions more
college students, you're saying like it wasn't as clean,
person just speaking naturally do they sound more like the
there more ambiguity there.
>> Na Yang: Actually, we did a very first, very early study by trying to
analyze some real data collected in the lab environment, and that's just an al
truistic way of collecting some communication between the child and their
parent. And some student in our, psychology department actually labeled the
emotion and most of the samples are labeled as neutral. Labeled as happy or
upset. And that's just a single -- you can scroll down, or you can scroll up
or down. And I think most of the speech are neutral or ambiguous meaning with
very subtle expressed emotion that it can sense and I think that's why we want
to throw away the samples automatically.
>>: Seems like if you had a user talking to Cortana, wouldn't most of their
speech be useful, is that what you're saying?
>> Na Yang: That's one of the challenge is most speech are neutral.
to do long-term monitoring. Yes. Yes.
We want
>>: Is there a notion of like an emotion for a group of people? So like if
you were recording at a party or in the case of like you're adding noise to a
clean sample. What is the battle itself had on emotions, like happy babble
versus sad babble, like you recorded it at an Alcoholics Anonymous meeting.
Like it might alter the interpretations or different type of emotions?
>> Na Yang: Whenever I did my [indiscernible] in Kenai my home becomes the
entertainment center. And I think the gaming center is most involved by
multiple people. So I think Evelyn has done some research that we want to
validate the system with the last speakers and with the, some background music
or [indiscernible] in the room. And so therefore noise from the air
conditioner. So there are all kinds of noise and reverberations in the room.
So that's still an open question. And that has to be done by sort separation
by multiple people and a lot of papers are working on that. So that considers
multiple challenges. Right now we're just focusing on addressing one speaker.
>>: So you focused on speech here, is there a sense in the community on people
that work on the multiple aspects. Is there more of the signal in the visual,
speech channel, and are there orthogonal do you expect to get other samples
hard in the same way and you're not getting more by doing both?
>> Na Yang: Yeah, I think that can improve the accuracy by combining other
modalities. We have to determine how to combine the division by different
modalities if they do not agree. And I think we can also sample the speech or
video for several times and try to see what the majority emotion detected. So
there's different emotion items. But I think for the sophisticated and more
realistic system we have to combine other modalities of course. Yes?
>>: I was wondering, what are your thoughts on what do you think what you have
now could extend beyond the six basic emotions or on you do you see that
working and not working and why.
>> Na Yang: If we looked at fixed emotions, they are based quite with distance
to each other. That's why we can better -- that's made life easier to
differentiate them. But, for example, anger or disgust, these two emotions
within the six we're currently studying are already relatively close to each
other. And some emotions are actually, emotional states have combined several
emotions. Some other researchers also gave confidence score for each of the
six emotions. So sometimes maybe that speech is 70 percent happy or with -- or
with other combinations of some subtle emotions that, like relax or hesitated.
So that should be an interesting question, because in real scenario emotion is
so complicated. You can't just say which emotion. Yeah, I hope that can
answer your question.
>>: When the psychologist was labeling the natural speech from the
parent/child interaction, were they listening to the direct audio of the
English or was it garbled? Like were they able to hear the words the person is
saying as well.
>> Na Yang:
We usually a flash card like the LCD.
>>: No when you said there's a student in the psychology department labeling
the natural interactions between a parent and teenager, were they actually
listening.
>> Na Yang:
Just to the audio.
>>: Then the problem is you don't know how much there is in the content versus
the speech. Wonder if you could actually garble, some garbly stuff to put that
in and retain the pitch for instance, the MCCS, but the speech client is gone.
Much harder test for them of course but they have to label based on what
they're hearing and maybe what they're seeing but not the words.
>> Na Yang:
Yep.
Mary Czerwinski:
[applause]
>> Na Yang:
All right.
Let's thank our speaker again.
Thank you very much for coming.
Download