>> Zhengyou Zhang: Good morning, and it's my pleasure... Busso. Busso, sorry. He's fishing his Ph.D. in...

advertisement
>> Zhengyou Zhang: Good morning, and it's my pleasure to introduce Carlos
Busso. Busso, sorry. He's fishing his Ph.D. in USC, and he got his master
degree from the University of Chile and your would be next week, right?
>> Carlos Busso: No, next Friday. This Friday.
>> Zhengyou Zhang: So please.
>> Carlos Busso: Thank you very much. So first of all, thank you very much for
the invitation to be here at Microsoft to present part of my work doing my Ph.D.
What I will present today is multimodal -- multimodal behavior using instrument
spaces. So let me start by showing some examples of the double data we are
working on.
>>: Piece of rubber and hold it like that and just smokes around.
>>: Burnt hair. That's my favorite.
>>: So you have no problem with other people lighting up whatever they want?
>>: Actually I'm ->>: And stinking up a space?
>>: I think you shouldn't be allowed to smoke.
>> Carlos Busso: So this video was recorded in our model which we use no
intrusive sensors to record that. The second video.
>>: You have a business here. What the hell is this?
>>: The business? The business doesn't inspire me.
>>: Oh, must you be inspired?
>>: Yes. I like it for ->> Carlos Busso: So the second video was recorded between two actors, okay.
And nice features of that recording is that we put markers on the face and head
and the hands. So we have more detailed information from the gesture.
>>: They're in two different rooms, right?
>> Carlos Busso: Same room.
>>: That dress looks like it comes from Asia.
>> Carlos Busso: And here even more detail we have just one speaker reading
some sentences with different emotions. So yeah, here let me -- they were
talking to each other, so we put two cameras looking to each of them. But they
were in the same room.
So these three videos show different aspect of human communications that we
will address during the -- this talk.
So what we envision for our system we would like to be a system that will be able
to take every single things that happen during the discussion. Everything that a
man could -- a person could sensor in the room, we want to take it and just go
from the environment, we want to take who is in the room, the locations, the
identity of the participants. Also group level, we want to analyze the dynamic of
the discussion. And in the middle level we want to model also different aspect,
for example, the emotions the artists express by each of the participants.
And some examples in which this framework could be very useful is to analyze
teamwork collaboration. Also, for example, for applications such as retrieval, this
kind of annotation will be very useful. And for also focus group and observation
of practice. For example, therapy sessions, okay?
>>: (Inaudible) what you use it for?
>> Carlos Busso: You can provide annotations for post (inaudible) for example,
post analysis. For example, some example here is therapy sessions. So you
want -- people so far for example couple's therapies. So they spend a lot of time
annotating for example who is speaking, they spend a lot of time analyzing for
example if they are pointing to each other, different gesturing that for their
purpose are important. So if we can somehow provide annotation of this, some
of those aspect that would be very beneficial for them.
So here in my Ph.D. my interest was in studying human behavior at multiple
levels. So we start from the very beginning so how we acquire the data. And
this includes source organizations, speaker identity and so on, building the
system. Then how we use this information from different modalities to start
detecting who is speaking, for example, or the location of the participants. So
this is multimodal fusion.
The next step is how we make use of the information that we have to track, for
example, interaction to model the user. Going into even more detail, we want to
analyze -- we analyzing gestures and speech relationships, emotions. And finally
if we are able to detect -- if we are able to understand the relationship between
gesture and speech, for example, we can be able to synthesize believable
agents that could be incorporated in the room so we would not only have
speakers but also you can have agents there that could be used for example to
provide feedback to the participants.
While the three first block here were studied by using the smaller room
environment I will show in the next slide. These two blokes were artists using
this controlled environment with multi-caption system. And our goal here is to
bring smartness to the Smartroom.
So this is the big picture that we have in our lab in which many of my colleagues
are working on right now, and my research will address some of these blocks.
So I'll let -- my presentation will follow the same structure that I just mentioned,
so we will start with the acquisitions, how we use the data, how we model the use
-- each of the users and then we will move to a final analysis of gestures and
speech.
And then I will briefly mention how we can synthesize some aspect of expressive
behavior.
So this is a room that we have here. So from the very beginning we decide to
use noninvasive sensor in this room to try to -- to infer as much as possible from
the user. So in these settings we have a table here, we have four cameras in the
ceiling, we have an omnidirectional camera behind the table which by the use of
the mirrors we can try these kind of images.
From audio modalities we have 16 micro ray which is also in the middle of the
table. In addition to that, we have extra microphone there with high quality to
have a clean signal from the participants.
So let me -- so this -- let me explain very briefly how we handle each modality.
So let me just clarify that this was done in collaboration with other students in the
computer vision laboratory, this part.
So what we -- from the sitting cameras what we're interested in is the location of
the participants. How we do that, we learn the background mold from the room
and then by detecting more reasons we create silhouettes like this. And then we
create from this silhouettes we create 3D holes and then we send the local
maximum of the holes as we model that as the head of the participant. So you
can see here most of the noise came -- based on this approach that we are using
here, most of the noise came near the participants. For example, if you move a
paper in the desk, that will be considered as moving and that will create some
noise here.
From the omnidirectional camera what we are interested in is the face of the
participants, the angle of each of the participants. And the way we do this is we
implement the open CB library to track the face. So the output of this modality
will be the angle of each of the participants. From the microphone array we are
interested in the acoustic source, and the approach that we follow here is we
take the -- we use the time different of arrival approach in which we estimate the
delay between each pair of microphone.
So there are 16 microphones, so there are 240 pairs of -- and we estimate the
delay for we using each of the markers -- each of the microphones. And then we
detect the source, for each of these microphone we will have a direction and the
hour, but we will take the mold, we will take the angle, we say that the direction of
the acoustic source is given by the maximum of the result for each of the pair.
For speaker ID, here again we use this microphone and we train Gaussian
models, Gaussian mixture models, we train with spectral features and CCs and
also we implement background model to detect silence periods. So while most
of the techniques that I have just mentioned are very standard so the effort that
we make in order to make this system work in realtime was very, very, a big
challenge. Just for example in order to make the microphone array to work we
implement 240 threads which were synchronized in order to make the process in
parallel.
Okay. Then after -- so the next question how we can fuse this information to
start answering some of the questions that we have. In particular, what we want
to know is where are the subjects, who are the subjects, and who is speaking?
So it is well known that be sure modalities has better spacial resolutions so we
rely on cameras, only cameras to track the location of the participants, and then
we use the acoustic modalities as we show here to detect active participant and
participant identification.
So let me address each of these problem. So the first problem is how we detect
the locations. And we implement a background approach in which we train a -we have a background Gaussian model in the middle of the room, which is -which we adapted to each of the measurements. This is done sequentially for
each of the speakers. And when you have an consistent measurement you're
assigned this to a speaker and those are done with thresholds.
And we use the ceiling cameras here and we correct the positions of the ceiling
cameras by using the omnidirectional camera as we show in the next slide. So
let me play a little bit some video which show the algorithms. So here we detect
the first one to this at the beginning when they are entering the room. Okay.
And this you see there is some noise near the participants. So why we need two
sets of cameras to do this tracking? For example, in this case we only use the
ceiling cameras and what will happen is so even if you put threshold, for example
that two participants cannot be too close at some point, you will have false
speaker, for example. Let me show the video here. One particular case in which
they go -- they separate for each other and you create a false participant. But if
you know that there is only one face in that direction you can compensate for
that, and here you -- so you see for example you are still able to solve the trouble
that you had before.
So you see multi modalities give you robust information. Okay. So we know
where are the participants. Okay. Now the next question is what is the identity
of each of these participants and who is speaking? And we address these two
problems together.
So from the microphone array we have a direction of the source, okay. And we
also know that the location of the participants. So what we are interested here,
so what we do here is to computer.
The distance between the source and the participants. And then based on that,
we estimate the porality of each of the participants -- the porality of the speaker A
is speaking given the microphone array.
At the same time, we have the information from the speaker ID, so we have the
porality of what if speaker J is speaking given the measurement for the speaker
ID.
The only problem that we have here is that -- oh, this, first of all, let me say that
these two modalities are independent because one of them will give you
information of the spacial locations
of the source and the other will give you information about the spectral property
of the speech from that speaker. So these two are independent and you can
multiple them to fuse them. The only problem is that we may know for example
that this is the speaker -- this is the participant that is speaking, but we don't
know the identity of this guy.
But for the speaker ID, for example, we may know that Matt is speaking. So the
things that we do know is the CD arrangement, this part. But we can estimate it
by using correlation with some physical constraint, for example two participant
cannot be sitting in the same locations. And this is -- so we update these metrics
all the time so every -- for every phrase we update these assignments.
So a performance of the system in general is pretty much -- is pretty high, 75
percent depending on the metrics that I use to elevate that. And we tested this
with three meetings of 20 minutes with 4 participant in the room. You can have
different -- you don't need to have four participants. We could take a number of
speaker automatically. And this data was very challenging in the sense that you
have a casual conversation with interruptions overlapped.
This is building realtime. So let me play.
>>: Please ask the university to give us their name. So unless there is a specific
request ->>: Listen, if you're walking down the street and you see somebody robbing a
jewelry store, don't you feel obligated to take out your cell phone and call 911? I
mean, if you...
>> Carlos Busso: So this is things is working realtime, but you still can answer
questions that don't need to be in realtime, for example if you want to annotate
different behavior, you can do that. Everything for example for retrieval you can
do it post processing.
In fact, we are working on new tracking system based on particular filters. But at
this point it's not working realtime.
So now that we have this information, we can start answering interesting -- other
interesting question. For example we can start modeling the group dynamic
during the discussions. So the goal here is to automatically strike the dynamic -the flow of interaction, and the way we do here is to estimate different statistics
from a -- the participants. For example, the number of times -- of turn that each
participant took during the discussion, the duration of the turn, also the transition
between the participants, okay, will be -- so who speak after who.
This will give informations of the flow of interactions. And the way we evaluate
this is we hand annotate the speaker's annotation so we compare this with a
mono annotation.
So in your left you would see the result from the manual annotations and here on
the right you will see what is estimate from our system and you will see they are
basically pretty much similar. We keep very -- the same informations.
So from this we can start modeling the user. For example here we see that one
of the speaker spoke most of the time, so studies have shown this is really -- we
can conclude that this speaker was talking into the discussion.
Speaker three you see here, if you take the duration of the turn was small, which
means that this particular speaker was actively listening. It has many turns but
just to agree with the main speaker that was being dominant. Okay.
And you can by annotating the transition between speaker you can have this kind
of graph, so here you can see for example that the discussion was mainly
between speaker one and speaker three and also speaker two, okay. But we
see that speaker four was not involved in the discussion.
And you can estimate this on time, for example in every one minute and you can
see here for example that speaker three in this case spoke -- was, even though
he didn't have long turns he was active during the whole discussions while for
example speaker four was not completely active at all. However, this speaker
four might be the boss, the boss, sorry, could be the boss, and we're interested
to retrieve information from him.
So we can use this information to retrieve -- to identify the portion in the
discussion he was active. And if you have many recordings, if you have many
database, if you have a database with different discussions you can start labeling
each speaker. For example, you can label this speaker as a quiet person, so this
could be useful for user.
>>: (Inaudible) just manually listening ->> Carlos Busso: Yes. We manually listen and when each of the participants
were speaking. So some remark from this section. So first of all multimodal
improve the robustness and the (inaudible) of the whole system and this
intelligent environment provides suitable platform to answer these kind of
questions.
So what we want to do now is go as individuals, what we can say at individual
level. Okay. Human behavior is multimodal, okay, we convey intentions,
emotions and desires, and so it is not only important what being said, also how it
is said. And very important of my research was in emotion, so I will present
some of our research here, how we can detect emotions.
So the question is why study emotions. And it is well known that emotion affect
the way we make decisions. Also affect the way we relate with -- we express
with others, okay. And based on our emotion, people would react different to us.
So emotion is important part of human interactions.
In the content of this multimodal environment, we may want to detect hot spots
for example into discussions so also if we are analyzing teamwork by having an
emotional, by annotations, that would be very neat. Also, since people rely on
emotions to make decisions we want to understand what decisions the people
are doing, we may want to have this kind of annotation. And as I said before,
group interaction, for example, therapy it would be nice to have annotation of
emotional limits.
What is the state of the emotion, this is a new field which that start about five
years -- well, from (inaudible) perspective and start about five, ten years ago.
And so far the standard ways is you train, you collect database, more database,
you train your models on that database and you report those results. However,
you map those the models to real applications you find problems that for example
there is too much variability. So and problem, some of the problem is that
speaker dependency. So the emotional from the very beginning the emotional
description are not, people do not agree which emotional descriptor you should
use, sadness, happiness or do you use for example an activation, balance? And
so and also -- so the bottom line is that emotional model to more generalize.
So what we want to propose here is more scaleable framework that could be
implemented in space like Smartroom. So instead of detecting different emotion,
between different emotion, what we want to detect is whether it's emotional or
neutral, so this is a binary problem.
Okay. And our approach what we are proposing here is to instead of training
emotional models, we train neutral models. Why is that? First of all neutral
models is neutral models has less variability than emotional models and also you
have more data to train robust modeling. So what we are proposing here is that
you train neutral models using a big corpora and you have the Wall street
Journal, TV data, databases which you can use to train those models. We give
some example of how we implement this. So you have input speech so you
map, you assess the model using this input speech and then you still using the
feature from the speech, you use the fitness measure from the model to make
the specification.
So this is a two-step approach in which as a first step you estimate the fitness
and you use the fitness for classification to build this line. So this block, the next
part of here is that this block is completely independent of the emotional
description that you are using, you are training neutral models. It is completely
independent of the speaker if you use a big corpora to train this part.
However, this part you still need emotional database to set this line. I will
address how we deal with that. Okay. So I will -- next I will show one, how we
can use this particular framework with one popular feature which is the
fundamental frequency, the speech. So first of all we train different statistics
from the speech using GMM, okay, and also -- so this is our neutral model,
training with the -- training with -- yeah. Wall Street Journal, this (inaudible)
section of that corpora.
Then we use classifier to -- for classification in this state. And since the second
block depends on the emotional database that you are using, our approach was
to use several databases together so we have different emotions. As you can
see, there are three corporas here, different speakers, and even different
languages. For example, this database was recorded in German. So you have
here different source of variability.
Okay. So the next question is which aspect from the pitch you want to use here
to train the neutral models. And the way we address this is we find the most
emotional prominent use analogy regulation framework. So this is the general
framework of the GC regulations. And let me just highlight some aspect why we
use these frameworks.
So the likelihood distribution is model with this form, okay, and the nice thing is
that if you take the ratio between two models for example which are nestled and
if you take minus two the log of that, that will be pi squared distributed. So you
can use that statistic in order to compare different models. Okay. For example
in here in this particular problem what we are interested in whether the variable in
our model is useful or not.
So how we use this framework. So the first question that we did was first of all
answer -- well, first of all, we estimate 39 different statistics from the speech
ranging from mean, a maximum, a -- the curvature, the slope or the speech or
each speech segment, so we have 39 different statistics from that.
And the first question that we did was when you at one at a time, what is the
improvement in the local likelihood, okay.
And we take the average for each of the cases. This is (inaudible) this is a bit
binary, so we will hear more than neutral versus one particular class. And what
you see here the most prominent aspect using this approach is the mean, the
median, growth statistic from the speech. However, one problem with this
approach is for example these two are very correlated. There's a high level of
correlation between them since you are included one at a time, so this is not
good for classification.
So the next approach is that we did was to run for our feature selection so we
start adding features to the model until the improvement was not significantly
statistic. And then we count the number of times that each of these features was
selected. And you can see here that the median was still the top feature,
however the mean was never selected here.
>>: (Inaudible) experiments, or I'm not sure I'm understanding when you say
count the number of times ->> Carlos Busso: Yes. There are three databases. And each database has
many different emotional classes. So we are running this experiment from each
of these classes so there are many ->>: For each database?
>> Carlos Busso: For each database. So in total there is about 20 cases.
>>: So (inaudible).
>> Carlos Busso: Yes. So the idea here is to find in a more reliable way, not
base our -- do not make feature selection for one particular database because
that may feed that particular database but in general may not fit all of those.
>>: So one of the questions, this feature is like the median and this sort of like
overall like statistics of the speech. Are they related in any way to the speaker or
the history toward that speaker or you know, like -- or are you looking at them
sort of at a moment in time? Like you know there's big differences between male
and female (inaudible). Are you somehow taking into account (inaudible).
>> Carlos Busso: Well, I think I have a slide showing the -- well, but it's a hidden
slide. Yeah, we could talk about later, but we normalize by -- we have a
normalization first for a speaker, for each speaker.
>>: So we computed the meeting's over, period of time ->> Carlos Busso: Yes. All these datas are ->>: (Inaudible) for the entire audience?
>> Carlos Busso: For the whole audience, yes. We have some experiments by
analyzing the data in small 10 units, for example voice regions, okay. So you
know that a speech is not -- there is a voice segment in which the speech is zero
basically so we model each voice regions, we extract basically for each voice
region, but there is also lower than one percentage here. I can show some
results.
>>: You assume that the emotional state for each author is one emotional state,
you assume that you have the same person angry, normal and sad for the same
person?
>> Carlos Busso: Well, the only assumption that we made is that we know
neutral speech from that person in order to make that normalization, nothing
more than that. We don't require the person to be angry. And so the only thing
that we ask -- and this is (inaudible) in many applications.
>>: So I wonder from a machine running perspective why do you divide the base
and then count the number of times ->> Carlos Busso: Yes.
>>: Why don't you just ->> Carlos Busso: Yeah. If we do that, we will be -- my feeling is that if you do
that, you will select the best feature for that particular task.
>>: What I'm saying is you can (inaudible) a lot of data in so all the (inaudible)
rather than divide (inaudible) and then count how many ->> Carlos Busso: The things that different emotion will have different, we reflect
different in the speech itself. So by doing this, yeah, you can do a -- it's also ->>: (Inaudible).
>> Carlos Busso: Yeah, well, I think -- yeah. I think I did that using, put -- so
what I also did was to put everything, all the emotional classes into one category,
and then I compare them to -- compared that to neutral. And we extracted the
feature from that. But I don't remember whether the features are the same. The
best features are -- so if the order and the ranking are the same, I don't
remember that. I can check that data.
But the goal by doing this individually for each database is try to select the
features that are more important in each of these cases. That's why we count
the number of times.
>>: So it would be interesting to see, for instance, if you generalize this, because
your goal is to produce something that result (inaudible).
>> Carlos Busso: You will ->>: So you could hold out one data set, you know, and do this kind of thing
where you merge all the other ones and train and then this new data set, or you
can do this kind of individual and see if that works better, which one works
better?
>> Carlos Busso: I would show in the two slides this part. Actually in this slide.
So based on these two experiments we select the best features, the best from
the speech, and then so what we are comparing here is our model with a
conventional approach. So let me explain the difference between these two to
make things clear. So here in the conventional approach you take these features
from this speech and you use this feature from classification. What we're
proposing here again is to add this extra block which is trained with neutral model
and instead of using the -- this feature itself, we use the fitness measure to make
the classification.
So using the three other databases these other -- the performance for the three
databases and you see that our system is significantly better than the
conventional approach. And for me what is even more appealing is that when we
add a new database this is in Spanish which was completely was not in the
training at all, it's completely different from what we trained. So our system is still
able to have a high accuracy. Okay. And the reason for that is that this -- and
you see here that performance for the performance for the conventional
approach is much lower.
And the reason here, why this is happened is because Spanish speech it's also
different from neutral speech. It's different from -- so our system is still able to try
to differ from here. However, here since we trained the models, these lines with
the speech itself there is also do not generalize.
So this is one example that we use in speech features, but you can also -actually I also implement these ideas using spectral features, and the results are
very interesting. And you can use it also for durations and spectral durations and
energy which are also known to convey emotion and information.
>>: (Inaudible) by using the features (inaudible)?
>> Carlos Busso: Yes. In both cases.
>>: Seven features?
>> Carlos Busso: Seven features. And that's another thing we are using just
only seven features. If you read the previous work people -- what would people
usually do is approach I don't like at all is they come up with thousands of
features, and I mean thousands of features and then using feature selection they
come out to hundreds of features and they use that to train. And if you see the
size of the databases I don't know, 500 sentences. So you see there is a huge
overlap. And if you train this, if you transfer those model to real application, it will
fail.
>>: One question here it's not clear to me that, you know, so is the GMM plus, is
the power coming from the fact that you're having this abstraction where I'm
saying I'm going to take care of the neutrals and first identify that and I can then
factor in the model, or is the power coming just from the fact that the GMM itself,
you know, is a more powerful like machine than those ->> Carlos Busso: Well, we are using (inaudible) analysis in both cases.
>>: Right. But in that case you have a GMM in front of it.
>> Carlos Busso: We have a GMM -- yeah, my -- what I think is that the
improvement came from the new features that you are getting from the GMM.
So look this way. For example ->>: If I took those features, like if I took those same features, right, like and I -you have two things going on here. One is the looking for neutral, right?
>> Carlos Busso: Yes.
>>: And the second one is introducing the GMM into that, and I'm trying to apart
which one of these two gives you the gain, right, because for instance I could -probably, I don't know, but I could think probably of figuring out an approach
where you could use a GMM, you know, in that kind of setting from a
conventional approach without sort of taking care of the neutral first, just looking
at the whole classification problem. Does that make sense?
>>: I think what Carlos is doing with the GMM is to find the difference from the
neutral model, the data.
>> Carlos Busso: It's a fitness measure we have here. And actually we
implement it with GMM also for example for spectral features we put in this block
a GMM, okay, and then we have the likelihood and we use the likelihood of us
input of our specification and it still works. And get good result.
>>: It's not looking for neutral models.
>> Carlos Busso: We are not doing classification. This is not a classifier. This is
just -- this is a model that we get the likelihood of that model of the new input,
and we use that information for classification.
>>: So how do you do the -- cut out the experiments and vision approach, I
mean what exactly do you (inaudible) vision approach here? What features do
you use?
>> Carlos Busso: The same features, same seven features.
>>: So what's the difference here?
>> Carlos Busso: The difference is that here -- okay. Let me. So the features
that you are training your model are not the features from this speech or the
fitness measurement from this block. For example, if you have an emotional
model, sorry, if you have an emotional speech, your model will say this is unlikely
to be neutral. Okay.
>>: The GMM itself, once its underlying space it's the space of the features that
you have, the features (inaudible) input and output?
>> Carlos Busso: Input take these features.
>>: Those features right.
>> Carlos Busso: And the output give a likelihood basically.
>>: How different this is from neutral.
>> Carlos Busso: Yes.
>>: And then all we see is just based on just that likelihood or that likelihood and
those features.
>> Carlos Busso: No, just the likelihood.
>>: (Inaudible).
(Inaudible talking over).
>> Carlos Busso: It's a way of possessing the features before the classification.
And let me go why this, I feel -- why I think that this is working is because when
you -- so you have a neutral model, okay, and if you have any speech which is -if you have any input speech which is differing in any aspect of this neutral
model, for example, you will still be able to -- the likelihood of that particular
speech will be low, okay. So you can use that for classification.
Now, here in this case, for example if you have a neutral and a happy, for
example, in your training data, so you will set the line or the classifications in
order to maximize the accuracy in that particular set for those features, however
if you put a surprise -- if you have a speech full of surprise, for example, which
will be from neutral, from happy and happy you will not be able to recognize
because it will fall somewhere in this space. However, here you will be able to
say okay, this is still different from neutral, and the likelihood will see result.
So if I put aside the setting central here is more robust than the setting central in
the speech features itself.
>>: (Inaudible) directly use the GMM.
>>: For specification?
>>: Yeah.
>> Carlos Busso: The things that you then need emotional speech, emotional
database so this part will be completely dependent on the emotions. So one of
the point I have, the main one is that you don't have enough data, emotional
data.
>>: Just for neutral?
>> Carlos Busso: Just neutral.
>>: A neutral space.
>> Carlos Busso: One of the nice things about this approach because you don't
have enough emotional database to train robust modeling but you do have, for
example, the journal, Wall Street Journal is huge, so you can use that database
to create robust modeling.
>>: So (inaudible) number, right?
>> Carlos Busso: Yes.
>>: (Inaudible) classification?
>> Carlos Busso: Yes.
>>: But in addition to that you have (inaudible).
>> Carlos Busso: No, that's it.
>>: So use it here.
>>: Just threshold.
>> Carlos Busso: Yeah.
>>: But (inaudible).
>> Carlos Busso: Well, at yeah.
>>: (Inaudible).
>>: (Inaudible).
>> Carlos Busso: Here you have (inaudible) because you are building one GMM
for each.
>>: This is one (inaudible) for each feature.
>> Carlos Busso: Yes.
>>: Different. Okay.
>>: Basically he used the more classification using the GMM on the left side and
on the right side it's just the (inaudible). It's a different classification surface.
>>: Right. That's right.
>> Carlos Busso: So these (inaudible) because you can -- if you are able to
attack clean speech from this model this approach could be easily implemented
in this environment.
>>: (Inaudible).
>> Carlos Busso: (Inaudible).
>>: (Inaudible).
>>: (Inaudible).
>> Carlos Busso: No, no.
>>: (Inaudible) one for each.
>> Carlos Busso: One for each of these. One for each of these.
>>: (Inaudible) uses one.
>>: (Inaudible).
>>: (Inaudible).
>>: Yeah. Okay.
>>: (Inaudible)
>> Carlos Busso: Sorry about that. This is my picture for neutral model in
general. For (inaudible) I use the same picture. Sorry about that. So ->>: I'm sorry. Just a bit curious. So when you use the GMM one dimension but
are these the similar features you use for the speech recognition?
>> Carlos Busso: Speech recognition?
>>: Yes, I (inaudible).
>> Carlos Busso: I haven't worked on speech recognition.
>>: Sorry in the slide.
>> Carlos Busso: In the slide.
>>: Speaker verification?
>> Carlos Busso: Oh, speaker identification, yeah. No, from that we use
MMCCs, spectral properties instead of the speech to recognize that entity of the
participants.
>>: Not even trying to use the same speaker verification as the feature ->> Carlos Busso: For -- yes. We use -- as I say, we also implement this in
HMMs, modeling, training with MMCCs and male frequency. And our result from
that, I can show some result later. The results show that the male frequency give
better performance than MMCCs. Male frequency are -- so in order to estimate
MMCC, you first estimate the male frequency -- male (inaudible) sorry. So you
estimate the energy in each of the filters using the male, right. Then from then
you apply the sign transfer and then you basically get the MMCCs. So but what
our result indicate is in that transformation you lose some information from
emotion. Remember this is another problem. You want to recognize emotions,
okay. So MMCCs seem not to be the best feature for emotion recognition.
>>: Are there like, I'm just looking for a second these like seven -- do you have
anything about duration?
>> Carlos Busso: Here, no. But duration is also important.
>>: But for emotional, right, like speed and ->> Carlos Busso: We have some analysis on that and we have shown that
duration is important, important aspect. The goal here was to show one popular
features how this system worked. Actually of course if you add different
modality, that's part of my future what we can go future world is to fuse different
aspect from the speech, MMCCs. Even the speech is multimodal, so you have
different speech, you have the spectral property, you have duration, and you can
use all these features to improve this data performance here.
Okay. So now Smartroom provide a perfect platform to answer some of the
question. But if you were to go in more details to answer for example
relationship between gestures and speech you will have to face tracking you
need to track facial features which is challenging. I will address that in the next
slide. So here instead of waiting for or working to have a good detail information
from the face what we rely for this analysis is on emotional capture databases.
As I said before, this data was collected using the ID interactions. We have ten
actors, five stations in each, and then we collect markers from the face, head or
for the hands. So let me play the type of data that you get for free.
>>: You have a business here. What the hell is this?
>>: The business? The business doesn't inspire me.
>>: Oh, must you be inspired?
>>: Yes.
>> Carlos Busso: So the same video that you saw in the first one, here you have
a detail facial information. And another reason why we are using this type of
data is that we still don't know how we encode this information in our faces. So
by having this detailed information we can learn, first of all, what are the features,
what are the areas that we have to look if we want to implement in this
environment.
So the way that we see it here that we have some emotion and desire that we
want to just meet, those things are encoding our community channels, for
example our facial expressions, speech, hand motions, hand, posture which has
made this information and the listener will recall each of these aspect and will
make the inference about our state.
So by taking this type of data, there are many questions that we are -- actually all
these are by example of our (inaudible). So -- and again the goal here is to
propose new guidelines to Smartroom.
In my particular research I focus in the three first models. I will briefly explain.
Okay. So the first things I show here in this slide is that we are using the -- we're
using the same modalities to convey the same goals. We are using different
modalities to express the same goal. So it must be a -- excuse me.
So it must be a relationship between -- a coordination between gestures and
speech for example. And if we are learning to model correctly that, you can
propose guidance for synthesis and recognitions, okay. So let me explain one of
our studies that we did.
So here what we did is strike feature from the speech, MMCCs, and
prosodic features, and from the face we strike feature from the markers. So for
example we get the head motion, okay, the eyebrow, the lips, all these features
were selected everywhere and strike from the markers. And then we consider
each of these markers as a feature. And then we cluster them in three different
variations so all the results provide these three different regions. So the
proposed approach here was -- it is to map the speech into facial features and
then compute correlations.
The way we do this is by using a fine minimum square error. So basically we
approximate the facial features from the speech and then we compare them
using the facial features. In this case, what we use is first the correlation. So
these are some of the results that we achieve, so black -- dark colors means
highly correlated. And what we are seeing here from MMCCs and prosodic
features you see there is a high correlation between speech and gesture. Okay.
And even some gestures are we may guess that they are not correlated what we
are saying still show high level of correlations here. And we will use this in the
last part of my talk to show that we can synthesize head motion doing just for
example from speech.
Okay. So one important -- another important point that we have is that we are
using the same modalities to express more than one goal. So we're using the
face to express emotion and also will to communicate our intents. So one of the
studies that we develop here in my study at USC was the interplay between
linguistics and effective goals, okay? And our hypothesize that some part of
resembling the face will be constrained by that articulation and will be less free to
express other aspects, for example, emotions while other areas in the face we
have more degree of feeling to express emotion. So the approach that we follow
first of all, we have to take the facial activity. We analyze the facial activity in
different emotions. And then we compare neutral expressive facial with similar
content.
So in our data some portion of the data we ask the one speaker to repeat the
same sentence in emotional and neutral speech. So here we can -- well, I will
explain this in the next slide how we do it.
But basically we can remove the lexical content and study the emotional relation.
So for the activation, the way -- how we estimated the activation is again by
basically the variance of each markers. So that's how we quantify the activity in
the face. So the facial occlusion as you can see here is that the jaw area is the
one that is more active. And this is of course given by the articulation, because
of articulation.
Also you can see here for example that their emotional dependency, for example,
in the angry and happy present are more active than neutral in the face. And but
here if you see these -- if you analyze in more detail you will find for example that
in the upper regions you will find that going from neutral to emotional, for
example angry and happy, you increase almost 100 percent of the activity in the
face. However, if you see, for example, the lower part of the face, you still
increase the activity in the face but only in 30 percent.
Okay. So this gives some evidence that this -- the forehead area is -- has more
degree of feeling to express emotion. So in order to address this problem in
more detail, again what we did was to compare emotional versus neutral
realization from the same sentence. So basically we are using -- one thing we
find the perfect spot, the optimal path between the two sentences.
And we use this path in order to align the feature. For example, this is one
features, one of the marker, for example, for angry and this is for neutral. And
after alignment we can align them and then compare frame by frame these two
cases.
So what we did here is to strike the correlation between these two cases. Okay.
I don't know if I'm being clear. So one of these is neutral and the other one is
emotion, so the comparison again will be between neutral and emotional.
>>: (Inaudible).
>> Carlos Busso: In the speech.
>>: In the speech.
>> Carlos Busso: In the speech. We use the speech for full-time work. Okay.
So that here -- so for example here are the correlation between neutral and sad,
neutral and happy and neutral and angry. These are the classes that we analyze
here.
And here what you observe is that in the lower regions again that means highly
correlated and light colors means no correlation. So you are seeing here that
there is a strong correlation in the jaw area, okay. Basically what this is meaning
that we are following with the speech so the articulation played a key role in both
cases.
However, you see here there is not much correlation in the lower -- in the upper
part of the face, which means that regardless of what we are saying, we could
communicate other non- verbal cues, for example emotion. So here you see for
example that the upper, the correlation is much lower than in other regions.
So this will have implication in emotion recognition, for example if we want to
recognize emotions we may want to focus in the upper, middle and upper part of
the face instead of the lips which still may convey emotional information but will
be constrained by the articulatory goals, okay, so you may have confusion there.
So we did some analysis in which we consider only the upper, the middle and
upper part of the face. And here the -- our goal was to analyze the strengths and
limitation of different modality, speech and facial expressions and see what
happen when we fuse the information. Again, this is controlled data so this
cannot be immediately applied to real application. We will discuss this in the next
slide.
So from the feature, the feature, okay, we extract prosodic features from the
speech and from the face what we did was to split the face in different aspect, in
different area, regions, and then what we did was to use PCA in order to reduce
the individuality of each of these regions. And then we use support to back to
machine for each of the cases that we analyze. So these are the results. This is
just from the speech, this is from facial expressions, and this is the way you've
used the all modalities. This block should be here. Sorry about that.
And some -- so some of the insight that we can get from this experiment, first of
all, is for example you have only speech you are performing slower but facial
expression is about 85 percent only considering the upper regions. Okay.
However, for example, when you fuse them you increase the recognition which
means that emotion is communicating in a multimodal way, so if you want to
recognize emotion in a robust way, you have to consider different modalities.
So some resource for example there is some confusion between in the speech
domain there is confusion between happiness and angry. Okay. But you don't
see this, the confusion in the facial domain. So when you fuse them you are able
to avoid the confusion between these two classes. The same thing happened
here between neutral and happy which are confusing the facial domain but here
they're not confused at all. So with this -- so multimodal can again give you a
more robust informations.
And this is because some emotions are recognized better in one particular
domain than in another. But still our main challenge is if you want to prevent this
in system like for example or Smartroom, one of the problem is that there is no
textures in the face, so we have lips, eyebrow, eyes, we can detect those
modalities. However, if you want to detect muscle that are in this area, you -- so
you still will be -- it will be harder.
Also you have to compensate head orientation and speaker viability. So if you
focus in this speaker, so you see the pose between these two cases is not
actually different, however if you check the face in these two frames so this is
four time bigger than this one in term of pixel. So of course you can compensate
for that and you can put a better camera to have a better resolution images. But
still this is something that you need to consider if you want to go for real
application, to implement this in this model.
Okay. So just before I finish my talk let me show how we can for example, so far
we used the emotion capture system to analyze relationship between gesture
and speech. We also analyze the interplays between a facial and linguistic call,
and we also show how we can use that for emotional recognition.
So let me now explain how we can use this to -- very briefly how we can create,
synthesize some aspect of human behavior. In particular what we are interested
in is head motion synthesis. So here is head motions based on prosthetic
features. And again if we look our big picture here we can use that kind of
information to provide feature to the user in the Smartroom.
So what we learn from analysis is that just speech are (inaudible) so we can use
that for synthesis. So very brief explanation of the system. What we do here is
to extract feature from the speech, we learn the relationship between gesture and
head motions and speech using GMM framework, and we use this to generate
the most likelihood sequence of head motions.
After that, we interpolate in order to have a smooth sequence of transitions, of
head motions, and then we use this sequence to synthesize different head
motions. And let me -- if you are interested in this work, I can give more detail on
how we model each of these blocks.
So here is some example of the (inaudible) that we, some of our resource.
So here please pay attention to synchronization between gestures and speech.
Okay.
>>: And say you just abandoned them?
>> Carlos Busso: Only based on prosodic features.
>>: We lost them at the last turnoff.
>> Carlos Busso: Okay.
>>: Eat your dessert. It tastes yummy. And so you just abandoned them.
>> Carlos Busso: So see here we were successful in modeling the relationship
between the gesture and the speech. And this framework would be extend to
other part of the face, for example, the eyebrow. So the point here is that you
want to consider speech when you synthesize other aspect of -- and the reason
is because there is a coordination between these two gestures. By adding these
gesture, by adding this -- by molding together these two modalities you can
create a natural others.
So one more slide from here. So we ask 17 subjects to assess the how natural
these were perceived, and the only thing that we change here from each
animation was a head motion. So what you notice here is when you don't add
head motion at all, the perception of not running this is very low compared to
other cases. So and our system which is very challenging in most of the cases
was perceived even better than the original head motion sequences which was
very appealing for us.
>>: (Inaudible).
>> Carlos Busso: The reason was the capture data from the participant.
>>: (Inaudible).
>> Carlos Busso: Yes, human data. Yeah, the region was the -- the sequence
of head motion that we synthesize that were produced during the recording.
So to conclude my (inaudible) that we have developed a Smartroom, we address
from how we extract different modalities from the data executions, how we fuse
informations, and we use that information to answer different questions, for
example the -- the group dynamic and also at individual level different type of
behavior. Still we have -- still they are telling us how we collect gesture from
participants so in parallel what we did was to capture emotion, capture data to
answer more finer to make more finer analysis. So we show that there is a
relationship between gesture and speech, we show that there is interplay from
different communicative goals.
And finally we can use this information for applications like emotional recognition
and synthesis. So we hope that what we are presenting here provide better -provide new guidance for the Smartroom. And just so if we come back to this big
here, there's still many challenge, for example I'm sure there is a better way of
processing each of the modalities how we fuse information. Still there are many
open challenges there.
Again if you are interested in speech recognition or emotion, for example, you will
a need a clean scene from the participants. And if you are not willing to break
the assumption that you don't want, for example, a microphone for each of the
participants you want this room to be non-invasive with the sensor, you will have
to deal with far-field -- far-field speech which is challenging, we will challenge
these two applications.
Furthermore, as I already said these still challenge how we track information from
gestures and gaze for example. And still there is a world of opportunities in how
we can use our knowledge from mock-up system to design better human -- better
humans.
And we -- but the main conclusion from my whole dissertation is that
multimodality provides big pictures, not only one, it provide a bigger picture, so
multimodal is the answer for to answer many of these questions. Many of our
work have been published in conference and journals, so these are a list of
select publications. Thank you very much.
(Applause).
Related documents
Download