>> Zhengyou Zhang: Good morning, and it's my pleasure... Busso. Busso, sorry. He's fishing his Ph.D. in...

>> Zhengyou Zhang: Good morning, and it's my pleasure to introduce Carlos Busso. Busso, sorry. He's fishing his Ph.D. in USC, and he got his master degree from the University of Chile and your would be next week, right? >> Carlos Busso: No, next Friday. This Friday. >> Zhengyou Zhang: So please. >> Carlos Busso: Thank you very much. So first of all, thank you very much for the invitation to be here at Microsoft to present part of my work doing my Ph.D. What I will present today is multimodal -- multimodal behavior using instrument spaces. So let me start by showing some examples of the double data we are working on. >>: Piece of rubber and hold it like that and just smokes around. >>: Burnt hair. That's my favorite. >>: So you have no problem with other people lighting up whatever they want? >>: Actually I'm ->>: And stinking up a space? >>: I think you shouldn't be allowed to smoke. >> Carlos Busso: So this video was recorded in our model which we use no intrusive sensors to record that. The second video. >>: You have a business here. What the hell is this? >>: The business? The business doesn't inspire me. >>: Oh, must you be inspired? >>: Yes. I like it for ->> Carlos Busso: So the second video was recorded between two actors, okay. And nice features of that recording is that we put markers on the face and head and the hands. So we have more detailed information from the gesture. >>: They're in two different rooms, right? >> Carlos Busso: Same room. >>: That dress looks like it comes from Asia. >> Carlos Busso: And here even more detail we have just one speaker reading some sentences with different emotions. So yeah, here let me -- they were talking to each other, so we put two cameras looking to each of them. But they were in the same room. So these three videos show different aspect of human communications that we will address during the -- this talk. So what we envision for our system we would like to be a system that will be able to take every single things that happen during the discussion. Everything that a man could -- a person could sensor in the room, we want to take it and just go from the environment, we want to take who is in the room, the locations, the identity of the participants. Also group level, we want to analyze the dynamic of the discussion. And in the middle level we want to model also different aspect, for example, the emotions the artists express by each of the participants. And some examples in which this framework could be very useful is to analyze teamwork collaboration. Also, for example, for applications such as retrieval, this kind of annotation will be very useful. And for also focus group and observation of practice. For example, therapy sessions, okay? >>: (Inaudible) what you use it for? >> Carlos Busso: You can provide annotations for post (inaudible) for example, post analysis. For example, some example here is therapy sessions. So you want -- people so far for example couple's therapies. So they spend a lot of time annotating for example who is speaking, they spend a lot of time analyzing for example if they are pointing to each other, different gesturing that for their purpose are important. So if we can somehow provide annotation of this, some of those aspect that would be very beneficial for them. So here in my Ph.D. my interest was in studying human behavior at multiple levels. So we start from the very beginning so how we acquire the data. And this includes source organizations, speaker identity and so on, building the system. Then how we use this information from different modalities to start detecting who is speaking, for example, or the location of the participants. So this is multimodal fusion. The next step is how we make use of the information that we have to track, for example, interaction to model the user. Going into even more detail, we want to analyze -- we analyzing gestures and speech relationships, emotions. And finally if we are able to detect -- if we are able to understand the relationship between gesture and speech, for example, we can be able to synthesize believable agents that could be incorporated in the room so we would not only have speakers but also you can have agents there that could be used for example to provide feedback to the participants. While the three first block here were studied by using the smaller room environment I will show in the next slide. These two blokes were artists using this controlled environment with multi-caption system. And our goal here is to bring smartness to the Smartroom. So this is the big picture that we have in our lab in which many of my colleagues are working on right now, and my research will address some of these blocks. So I'll let -- my presentation will follow the same structure that I just mentioned, so we will start with the acquisitions, how we use the data, how we model the use -- each of the users and then we will move to a final analysis of gestures and speech. And then I will briefly mention how we can synthesize some aspect of expressive behavior. So this is a room that we have here. So from the very beginning we decide to use noninvasive sensor in this room to try to -- to infer as much as possible from the user. So in these settings we have a table here, we have four cameras in the ceiling, we have an omnidirectional camera behind the table which by the use of the mirrors we can try these kind of images. From audio modalities we have 16 micro ray which is also in the middle of the table. In addition to that, we have extra microphone there with high quality to have a clean signal from the participants. So let me -- so this -- let me explain very briefly how we handle each modality. So let me just clarify that this was done in collaboration with other students in the computer vision laboratory, this part. So what we -- from the sitting cameras what we're interested in is the location of the participants. How we do that, we learn the background mold from the room and then by detecting more reasons we create silhouettes like this. And then we create from this silhouettes we create 3D holes and then we send the local maximum of the holes as we model that as the head of the participant. So you can see here most of the noise came -- based on this approach that we are using here, most of the noise came near the participants. For example, if you move a paper in the desk, that will be considered as moving and that will create some noise here. From the omnidirectional camera what we are interested in is the face of the participants, the angle of each of the participants. And the way we do this is we implement the open CB library to track the face. So the output of this modality will be the angle of each of the participants. From the microphone array we are interested in the acoustic source, and the approach that we follow here is we take the -- we use the time different of arrival approach in which we estimate the delay between each pair of microphone. So there are 16 microphones, so there are 240 pairs of -- and we estimate the delay for we using each of the markers -- each of the microphones. And then we detect the source, for each of these microphone we will have a direction and the hour, but we will take the mold, we will take the angle, we say that the direction of the acoustic source is given by the maximum of the result for each of the pair. For speaker ID, here again we use this microphone and we train Gaussian models, Gaussian mixture models, we train with spectral features and CCs and also we implement background model to detect silence periods. So while most of the techniques that I have just mentioned are very standard so the effort that we make in order to make this system work in realtime was very, very, a big challenge. Just for example in order to make the microphone array to work we implement 240 threads which were synchronized in order to make the process in parallel. Okay. Then after -- so the next question how we can fuse this information to start answering some of the questions that we have. In particular, what we want to know is where are the subjects, who are the subjects, and who is speaking? So it is well known that be sure modalities has better spacial resolutions so we rely on cameras, only cameras to track the location of the participants, and then we use the acoustic modalities as we show here to detect active participant and participant identification. So let me address each of these problem. So the first problem is how we detect the locations. And we implement a background approach in which we train a -we have a background Gaussian model in the middle of the room, which is -which we adapted to each of the measurements. This is done sequentially for each of the speakers. And when you have an consistent measurement you're assigned this to a speaker and those are done with thresholds. And we use the ceiling cameras here and we correct the positions of the ceiling cameras by using the omnidirectional camera as we show in the next slide. So let me play a little bit some video which show the algorithms. So here we detect the first one to this at the beginning when they are entering the room. Okay. And this you see there is some noise near the participants. So why we need two sets of cameras to do this tracking? For example, in this case we only use the ceiling cameras and what will happen is so even if you put threshold, for example that two participants cannot be too close at some point, you will have false speaker, for example. Let me show the video here. One particular case in which they go -- they separate for each other and you create a false participant. But if you know that there is only one face in that direction you can compensate for that, and here you -- so you see for example you are still able to solve the trouble that you had before. So you see multi modalities give you robust information. Okay. So we know where are the participants. Okay. Now the next question is what is the identity of each of these participants and who is speaking? And we address these two problems together. So from the microphone array we have a direction of the source, okay. And we also know that the location of the participants. So what we are interested here, so what we do here is to computer. The distance between the source and the participants. And then based on that, we estimate the porality of each of the participants -- the porality of the speaker A is speaking given the microphone array. At the same time, we have the information from the speaker ID, so we have the porality of what if speaker J is speaking given the measurement for the speaker ID. The only problem that we have here is that -- oh, this, first of all, let me say that these two modalities are independent because one of them will give you information of the spacial locations of the source and the other will give you information about the spectral property of the speech from that speaker. So these two are independent and you can multiple them to fuse them. The only problem is that we may know for example that this is the speaker -- this is the participant that is speaking, but we don't know the identity of this guy. But for the speaker ID, for example, we may know that Matt is speaking. So the things that we do know is the CD arrangement, this part. But we can estimate it by using correlation with some physical constraint, for example two participant cannot be sitting in the same locations. And this is -- so we update these metrics all the time so every -- for every phrase we update these assignments. So a performance of the system in general is pretty much -- is pretty high, 75 percent depending on the metrics that I use to elevate that. And we tested this with three meetings of 20 minutes with 4 participant in the room. You can have different -- you don't need to have four participants. We could take a number of speaker automatically. And this data was very challenging in the sense that you have a casual conversation with interruptions overlapped. This is building realtime. So let me play. >>: Please ask the university to give us their name. So unless there is a specific request ->>: Listen, if you're walking down the street and you see somebody robbing a jewelry store, don't you feel obligated to take out your cell phone and call 911? I mean, if you... >> Carlos Busso: So this is things is working realtime, but you still can answer questions that don't need to be in realtime, for example if you want to annotate different behavior, you can do that. Everything for example for retrieval you can do it post processing. In fact, we are working on new tracking system based on particular filters. But at this point it's not working realtime. So now that we have this information, we can start answering interesting -- other interesting question. For example we can start modeling the group dynamic during the discussions. So the goal here is to automatically strike the dynamic -the flow of interaction, and the way we do here is to estimate different statistics from a -- the participants. For example, the number of times -- of turn that each participant took during the discussion, the duration of the turn, also the transition between the participants, okay, will be -- so who speak after who. This will give informations of the flow of interactions. And the way we evaluate this is we hand annotate the speaker's annotation so we compare this with a mono annotation. So in your left you would see the result from the manual annotations and here on the right you will see what is estimate from our system and you will see they are basically pretty much similar. We keep very -- the same informations. So from this we can start modeling the user. For example here we see that one of the speaker spoke most of the time, so studies have shown this is really -- we can conclude that this speaker was talking into the discussion. Speaker three you see here, if you take the duration of the turn was small, which means that this particular speaker was actively listening. It has many turns but just to agree with the main speaker that was being dominant. Okay. And you can by annotating the transition between speaker you can have this kind of graph, so here you can see for example that the discussion was mainly between speaker one and speaker three and also speaker two, okay. But we see that speaker four was not involved in the discussion. And you can estimate this on time, for example in every one minute and you can see here for example that speaker three in this case spoke -- was, even though he didn't have long turns he was active during the whole discussions while for example speaker four was not completely active at all. However, this speaker four might be the boss, the boss, sorry, could be the boss, and we're interested to retrieve information from him. So we can use this information to retrieve -- to identify the portion in the discussion he was active. And if you have many recordings, if you have many database, if you have a database with different discussions you can start labeling each speaker. For example, you can label this speaker as a quiet person, so this could be useful for user. >>: (Inaudible) just manually listening ->> Carlos Busso: Yes. We manually listen and when each of the participants were speaking. So some remark from this section. So first of all multimodal improve the robustness and the (inaudible) of the whole system and this intelligent environment provides suitable platform to answer these kind of questions. So what we want to do now is go as individuals, what we can say at individual level. Okay. Human behavior is multimodal, okay, we convey intentions, emotions and desires, and so it is not only important what being said, also how it is said. And very important of my research was in emotion, so I will present some of our research here, how we can detect emotions. So the question is why study emotions. And it is well known that emotion affect the way we make decisions. Also affect the way we relate with -- we express with others, okay. And based on our emotion, people would react different to us. So emotion is important part of human interactions. In the content of this multimodal environment, we may want to detect hot spots for example into discussions so also if we are analyzing teamwork by having an emotional, by annotations, that would be very neat. Also, since people rely on emotions to make decisions we want to understand what decisions the people are doing, we may want to have this kind of annotation. And as I said before, group interaction, for example, therapy it would be nice to have annotation of emotional limits. What is the state of the emotion, this is a new field which that start about five years -- well, from (inaudible) perspective and start about five, ten years ago. And so far the standard ways is you train, you collect database, more database, you train your models on that database and you report those results. However, you map those the models to real applications you find problems that for example there is too much variability. So and problem, some of the problem is that speaker dependency. So the emotional from the very beginning the emotional description are not, people do not agree which emotional descriptor you should use, sadness, happiness or do you use for example an activation, balance? And so and also -- so the bottom line is that emotional model to more generalize. So what we want to propose here is more scaleable framework that could be implemented in space like Smartroom. So instead of detecting different emotion, between different emotion, what we want to detect is whether it's emotional or neutral, so this is a binary problem. Okay. And our approach what we are proposing here is to instead of training emotional models, we train neutral models. Why is that? First of all neutral models is neutral models has less variability than emotional models and also you have more data to train robust modeling. So what we are proposing here is that you train neutral models using a big corpora and you have the Wall street Journal, TV data, databases which you can use to train those models. We give some example of how we implement this. So you have input speech so you map, you assess the model using this input speech and then you still using the feature from the speech, you use the fitness measure from the model to make the specification. So this is a two-step approach in which as a first step you estimate the fitness and you use the fitness for classification to build this line. So this block, the next part of here is that this block is completely independent of the emotional description that you are using, you are training neutral models. It is completely independent of the speaker if you use a big corpora to train this part. However, this part you still need emotional database to set this line. I will address how we deal with that. Okay. So I will -- next I will show one, how we can use this particular framework with one popular feature which is the fundamental frequency, the speech. So first of all we train different statistics from the speech using GMM, okay, and also -- so this is our neutral model, training with the -- training with -- yeah. Wall Street Journal, this (inaudible) section of that corpora. Then we use classifier to -- for classification in this state. And since the second block depends on the emotional database that you are using, our approach was to use several databases together so we have different emotions. As you can see, there are three corporas here, different speakers, and even different languages. For example, this database was recorded in German. So you have here different source of variability. Okay. So the next question is which aspect from the pitch you want to use here to train the neutral models. And the way we address this is we find the most emotional prominent use analogy regulation framework. So this is the general framework of the GC regulations. And let me just highlight some aspect why we use these frameworks. So the likelihood distribution is model with this form, okay, and the nice thing is that if you take the ratio between two models for example which are nestled and if you take minus two the log of that, that will be pi squared distributed. So you can use that statistic in order to compare different models. Okay. For example in here in this particular problem what we are interested in whether the variable in our model is useful or not. So how we use this framework. So the first question that we did was first of all answer -- well, first of all, we estimate 39 different statistics from the speech ranging from mean, a maximum, a -- the curvature, the slope or the speech or each speech segment, so we have 39 different statistics from that. And the first question that we did was when you at one at a time, what is the improvement in the local likelihood, okay. And we take the average for each of the cases. This is (inaudible) this is a bit binary, so we will hear more than neutral versus one particular class. And what you see here the most prominent aspect using this approach is the mean, the median, growth statistic from the speech. However, one problem with this approach is for example these two are very correlated. There's a high level of correlation between them since you are included one at a time, so this is not good for classification. So the next approach is that we did was to run for our feature selection so we start adding features to the model until the improvement was not significantly statistic. And then we count the number of times that each of these features was selected. And you can see here that the median was still the top feature, however the mean was never selected here. >>: (Inaudible) experiments, or I'm not sure I'm understanding when you say count the number of times ->> Carlos Busso: Yes. There are three databases. And each database has many different emotional classes. So we are running this experiment from each of these classes so there are many ->>: For each database? >> Carlos Busso: For each database. So in total there is about 20 cases. >>: So (inaudible). >> Carlos Busso: Yes. So the idea here is to find in a more reliable way, not base our -- do not make feature selection for one particular database because that may feed that particular database but in general may not fit all of those. >>: So one of the questions, this feature is like the median and this sort of like overall like statistics of the speech. Are they related in any way to the speaker or the history toward that speaker or you know, like -- or are you looking at them sort of at a moment in time? Like you know there's big differences between male and female (inaudible). Are you somehow taking into account (inaudible). >> Carlos Busso: Well, I think I have a slide showing the -- well, but it's a hidden slide. Yeah, we could talk about later, but we normalize by -- we have a normalization first for a speaker, for each speaker. >>: So we computed the meeting's over, period of time ->> Carlos Busso: Yes. All these datas are ->>: (Inaudible) for the entire audience? >> Carlos Busso: For the whole audience, yes. We have some experiments by analyzing the data in small 10 units, for example voice regions, okay. So you know that a speech is not -- there is a voice segment in which the speech is zero basically so we model each voice regions, we extract basically for each voice region, but there is also lower than one percentage here. I can show some results. >>: You assume that the emotional state for each author is one emotional state, you assume that you have the same person angry, normal and sad for the same person? >> Carlos Busso: Well, the only assumption that we made is that we know neutral speech from that person in order to make that normalization, nothing more than that. We don't require the person to be angry. And so the only thing that we ask -- and this is (inaudible) in many applications. >>: So I wonder from a machine running perspective why do you divide the base and then count the number of times ->> Carlos Busso: Yes. >>: Why don't you just ->> Carlos Busso: Yeah. If we do that, we will be -- my feeling is that if you do that, you will select the best feature for that particular task. >>: What I'm saying is you can (inaudible) a lot of data in so all the (inaudible) rather than divide (inaudible) and then count how many ->> Carlos Busso: The things that different emotion will have different, we reflect different in the speech itself. So by doing this, yeah, you can do a -- it's also ->>: (Inaudible). >> Carlos Busso: Yeah, well, I think -- yeah. I think I did that using, put -- so what I also did was to put everything, all the emotional classes into one category, and then I compare them to -- compared that to neutral. And we extracted the feature from that. But I don't remember whether the features are the same. The best features are -- so if the order and the ranking are the same, I don't remember that. I can check that data. But the goal by doing this individually for each database is try to select the features that are more important in each of these cases. That's why we count the number of times. >>: So it would be interesting to see, for instance, if you generalize this, because your goal is to produce something that result (inaudible). >> Carlos Busso: You will ->>: So you could hold out one data set, you know, and do this kind of thing where you merge all the other ones and train and then this new data set, or you can do this kind of individual and see if that works better, which one works better? >> Carlos Busso: I would show in the two slides this part. Actually in this slide. So based on these two experiments we select the best features, the best from the speech, and then so what we are comparing here is our model with a conventional approach. So let me explain the difference between these two to make things clear. So here in the conventional approach you take these features from this speech and you use this feature from classification. What we're proposing here again is to add this extra block which is trained with neutral model and instead of using the -- this feature itself, we use the fitness measure to make the classification. So using the three other databases these other -- the performance for the three databases and you see that our system is significantly better than the conventional approach. And for me what is even more appealing is that when we add a new database this is in Spanish which was completely was not in the training at all, it's completely different from what we trained. So our system is still able to have a high accuracy. Okay. And the reason for that is that this -- and you see here that performance for the performance for the conventional approach is much lower. And the reason here, why this is happened is because Spanish speech it's also different from neutral speech. It's different from -- so our system is still able to try to differ from here. However, here since we trained the models, these lines with the speech itself there is also do not generalize. So this is one example that we use in speech features, but you can also -actually I also implement these ideas using spectral features, and the results are very interesting. And you can use it also for durations and spectral durations and energy which are also known to convey emotion and information. >>: (Inaudible) by using the features (inaudible)? >> Carlos Busso: Yes. In both cases. >>: Seven features? >> Carlos Busso: Seven features. And that's another thing we are using just only seven features. If you read the previous work people -- what would people usually do is approach I don't like at all is they come up with thousands of features, and I mean thousands of features and then using feature selection they come out to hundreds of features and they use that to train. And if you see the size of the databases I don't know, 500 sentences. So you see there is a huge overlap. And if you train this, if you transfer those model to real application, it will fail. >>: One question here it's not clear to me that, you know, so is the GMM plus, is the power coming from the fact that you're having this abstraction where I'm saying I'm going to take care of the neutrals and first identify that and I can then factor in the model, or is the power coming just from the fact that the GMM itself, you know, is a more powerful like machine than those ->> Carlos Busso: Well, we are using (inaudible) analysis in both cases. >>: Right. But in that case you have a GMM in front of it. >> Carlos Busso: We have a GMM -- yeah, my -- what I think is that the improvement came from the new features that you are getting from the GMM. So look this way. For example ->>: If I took those features, like if I took those same features, right, like and I -you have two things going on here. One is the looking for neutral, right? >> Carlos Busso: Yes. >>: And the second one is introducing the GMM into that, and I'm trying to apart which one of these two gives you the gain, right, because for instance I could -probably, I don't know, but I could think probably of figuring out an approach where you could use a GMM, you know, in that kind of setting from a conventional approach without sort of taking care of the neutral first, just looking at the whole classification problem. Does that make sense? >>: I think what Carlos is doing with the GMM is to find the difference from the neutral model, the data. >> Carlos Busso: It's a fitness measure we have here. And actually we implement it with GMM also for example for spectral features we put in this block a GMM, okay, and then we have the likelihood and we use the likelihood of us input of our specification and it still works. And get good result. >>: It's not looking for neutral models. >> Carlos Busso: We are not doing classification. This is not a classifier. This is just -- this is a model that we get the likelihood of that model of the new input, and we use that information for classification. >>: So how do you do the -- cut out the experiments and vision approach, I mean what exactly do you (inaudible) vision approach here? What features do you use? >> Carlos Busso: The same features, same seven features. >>: So what's the difference here? >> Carlos Busso: The difference is that here -- okay. Let me. So the features that you are training your model are not the features from this speech or the fitness measurement from this block. For example, if you have an emotional model, sorry, if you have an emotional speech, your model will say this is unlikely to be neutral. Okay. >>: The GMM itself, once its underlying space it's the space of the features that you have, the features (inaudible) input and output? >> Carlos Busso: Input take these features. >>: Those features right. >> Carlos Busso: And the output give a likelihood basically. >>: How different this is from neutral. >> Carlos Busso: Yes. >>: And then all we see is just based on just that likelihood or that likelihood and those features. >> Carlos Busso: No, just the likelihood. >>: (Inaudible). (Inaudible talking over). >> Carlos Busso: It's a way of possessing the features before the classification. And let me go why this, I feel -- why I think that this is working is because when you -- so you have a neutral model, okay, and if you have any speech which is -if you have any input speech which is differing in any aspect of this neutral model, for example, you will still be able to -- the likelihood of that particular speech will be low, okay. So you can use that for classification. Now, here in this case, for example if you have a neutral and a happy, for example, in your training data, so you will set the line or the classifications in order to maximize the accuracy in that particular set for those features, however if you put a surprise -- if you have a speech full of surprise, for example, which will be from neutral, from happy and happy you will not be able to recognize because it will fall somewhere in this space. However, here you will be able to say okay, this is still different from neutral, and the likelihood will see result. So if I put aside the setting central here is more robust than the setting central in the speech features itself. >>: (Inaudible) directly use the GMM. >>: For specification? >>: Yeah. >> Carlos Busso: The things that you then need emotional speech, emotional database so this part will be completely dependent on the emotions. So one of the point I have, the main one is that you don't have enough data, emotional data. >>: Just for neutral? >> Carlos Busso: Just neutral. >>: A neutral space. >> Carlos Busso: One of the nice things about this approach because you don't have enough emotional database to train robust modeling but you do have, for example, the journal, Wall Street Journal is huge, so you can use that database to create robust modeling. >>: So (inaudible) number, right? >> Carlos Busso: Yes. >>: (Inaudible) classification? >> Carlos Busso: Yes. >>: But in addition to that you have (inaudible). >> Carlos Busso: No, that's it. >>: So use it here. >>: Just threshold. >> Carlos Busso: Yeah. >>: But (inaudible). >> Carlos Busso: Well, at yeah. >>: (Inaudible). >>: (Inaudible). >> Carlos Busso: Here you have (inaudible) because you are building one GMM for each. >>: This is one (inaudible) for each feature. >> Carlos Busso: Yes. >>: Different. Okay. >>: Basically he used the more classification using the GMM on the left side and on the right side it's just the (inaudible). It's a different classification surface. >>: Right. That's right. >> Carlos Busso: So these (inaudible) because you can -- if you are able to attack clean speech from this model this approach could be easily implemented in this environment. >>: (Inaudible). >> Carlos Busso: (Inaudible). >>: (Inaudible). >>: (Inaudible). >> Carlos Busso: No, no. >>: (Inaudible) one for each. >> Carlos Busso: One for each of these. One for each of these. >>: (Inaudible) uses one. >>: (Inaudible). >>: (Inaudible). >>: Yeah. Okay. >>: (Inaudible) >> Carlos Busso: Sorry about that. This is my picture for neutral model in general. For (inaudible) I use the same picture. Sorry about that. So ->>: I'm sorry. Just a bit curious. So when you use the GMM one dimension but are these the similar features you use for the speech recognition? >> Carlos Busso: Speech recognition? >>: Yes, I (inaudible). >> Carlos Busso: I haven't worked on speech recognition. >>: Sorry in the slide. >> Carlos Busso: In the slide. >>: Speaker verification? >> Carlos Busso: Oh, speaker identification, yeah. No, from that we use MMCCs, spectral properties instead of the speech to recognize that entity of the participants. >>: Not even trying to use the same speaker verification as the feature ->> Carlos Busso: For -- yes. We use -- as I say, we also implement this in HMMs, modeling, training with MMCCs and male frequency. And our result from that, I can show some result later. The results show that the male frequency give better performance than MMCCs. Male frequency are -- so in order to estimate MMCC, you first estimate the male frequency -- male (inaudible) sorry. So you estimate the energy in each of the filters using the male, right. Then from then you apply the sign transfer and then you basically get the MMCCs. So but what our result indicate is in that transformation you lose some information from emotion. Remember this is another problem. You want to recognize emotions, okay. So MMCCs seem not to be the best feature for emotion recognition. >>: Are there like, I'm just looking for a second these like seven -- do you have anything about duration? >> Carlos Busso: Here, no. But duration is also important. >>: But for emotional, right, like speed and ->> Carlos Busso: We have some analysis on that and we have shown that duration is important, important aspect. The goal here was to show one popular features how this system worked. Actually of course if you add different modality, that's part of my future what we can go future world is to fuse different aspect from the speech, MMCCs. Even the speech is multimodal, so you have different speech, you have the spectral property, you have duration, and you can use all these features to improve this data performance here. Okay. So now Smartroom provide a perfect platform to answer some of the question. But if you were to go in more details to answer for example relationship between gestures and speech you will have to face tracking you need to track facial features which is challenging. I will address that in the next slide. So here instead of waiting for or working to have a good detail information from the face what we rely for this analysis is on emotional capture databases. As I said before, this data was collected using the ID interactions. We have ten actors, five stations in each, and then we collect markers from the face, head or for the hands. So let me play the type of data that you get for free. >>: You have a business here. What the hell is this? >>: The business? The business doesn't inspire me. >>: Oh, must you be inspired? >>: Yes. >> Carlos Busso: So the same video that you saw in the first one, here you have a detail facial information. And another reason why we are using this type of data is that we still don't know how we encode this information in our faces. So by having this detailed information we can learn, first of all, what are the features, what are the areas that we have to look if we want to implement in this environment. So the way that we see it here that we have some emotion and desire that we want to just meet, those things are encoding our community channels, for example our facial expressions, speech, hand motions, hand, posture which has made this information and the listener will recall each of these aspect and will make the inference about our state. So by taking this type of data, there are many questions that we are -- actually all these are by example of our (inaudible). So -- and again the goal here is to propose new guidelines to Smartroom. In my particular research I focus in the three first models. I will briefly explain. Okay. So the first things I show here in this slide is that we are using the -- we're using the same modalities to convey the same goals. We are using different modalities to express the same goal. So it must be a -- excuse me. So it must be a relationship between -- a coordination between gestures and speech for example. And if we are learning to model correctly that, you can propose guidance for synthesis and recognitions, okay. So let me explain one of our studies that we did. So here what we did is strike feature from the speech, MMCCs, and prosodic features, and from the face we strike feature from the markers. So for example we get the head motion, okay, the eyebrow, the lips, all these features were selected everywhere and strike from the markers. And then we consider each of these markers as a feature. And then we cluster them in three different variations so all the results provide these three different regions. So the proposed approach here was -- it is to map the speech into facial features and then compute correlations. The way we do this is by using a fine minimum square error. So basically we approximate the facial features from the speech and then we compare them using the facial features. In this case, what we use is first the correlation. So these are some of the results that we achieve, so black -- dark colors means highly correlated. And what we are seeing here from MMCCs and prosodic features you see there is a high correlation between speech and gesture. Okay. And even some gestures are we may guess that they are not correlated what we are saying still show high level of correlations here. And we will use this in the last part of my talk to show that we can synthesize head motion doing just for example from speech. Okay. So one important -- another important point that we have is that we are using the same modalities to express more than one goal. So we're using the face to express emotion and also will to communicate our intents. So one of the studies that we develop here in my study at USC was the interplay between linguistics and effective goals, okay? And our hypothesize that some part of resembling the face will be constrained by that articulation and will be less free to express other aspects, for example, emotions while other areas in the face we have more degree of feeling to express emotion. So the approach that we follow first of all, we have to take the facial activity. We analyze the facial activity in different emotions. And then we compare neutral expressive facial with similar content. So in our data some portion of the data we ask the one speaker to repeat the same sentence in emotional and neutral speech. So here we can -- well, I will explain this in the next slide how we do it. But basically we can remove the lexical content and study the emotional relation. So for the activation, the way -- how we estimated the activation is again by basically the variance of each markers. So that's how we quantify the activity in the face. So the facial occlusion as you can see here is that the jaw area is the one that is more active. And this is of course given by the articulation, because of articulation. Also you can see here for example that their emotional dependency, for example, in the angry and happy present are more active than neutral in the face. And but here if you see these -- if you analyze in more detail you will find for example that in the upper regions you will find that going from neutral to emotional, for example angry and happy, you increase almost 100 percent of the activity in the face. However, if you see, for example, the lower part of the face, you still increase the activity in the face but only in 30 percent. Okay. So this gives some evidence that this -- the forehead area is -- has more degree of feeling to express emotion. So in order to address this problem in more detail, again what we did was to compare emotional versus neutral realization from the same sentence. So basically we are using -- one thing we find the perfect spot, the optimal path between the two sentences. And we use this path in order to align the feature. For example, this is one features, one of the marker, for example, for angry and this is for neutral. And after alignment we can align them and then compare frame by frame these two cases. So what we did here is to strike the correlation between these two cases. Okay. I don't know if I'm being clear. So one of these is neutral and the other one is emotion, so the comparison again will be between neutral and emotional. >>: (Inaudible). >> Carlos Busso: In the speech. >>: In the speech. >> Carlos Busso: In the speech. We use the speech for full-time work. Okay. So that here -- so for example here are the correlation between neutral and sad, neutral and happy and neutral and angry. These are the classes that we analyze here. And here what you observe is that in the lower regions again that means highly correlated and light colors means no correlation. So you are seeing here that there is a strong correlation in the jaw area, okay. Basically what this is meaning that we are following with the speech so the articulation played a key role in both cases. However, you see here there is not much correlation in the lower -- in the upper part of the face, which means that regardless of what we are saying, we could communicate other non- verbal cues, for example emotion. So here you see for example that the upper, the correlation is much lower than in other regions. So this will have implication in emotion recognition, for example if we want to recognize emotions we may want to focus in the upper, middle and upper part of the face instead of the lips which still may convey emotional information but will be constrained by the articulatory goals, okay, so you may have confusion there. So we did some analysis in which we consider only the upper, the middle and upper part of the face. And here the -- our goal was to analyze the strengths and limitation of different modality, speech and facial expressions and see what happen when we fuse the information. Again, this is controlled data so this cannot be immediately applied to real application. We will discuss this in the next slide. So from the feature, the feature, okay, we extract prosodic features from the speech and from the face what we did was to split the face in different aspect, in different area, regions, and then what we did was to use PCA in order to reduce the individuality of each of these regions. And then we use support to back to machine for each of the cases that we analyze. So these are the results. This is just from the speech, this is from facial expressions, and this is the way you've used the all modalities. This block should be here. Sorry about that. And some -- so some of the insight that we can get from this experiment, first of all, is for example you have only speech you are performing slower but facial expression is about 85 percent only considering the upper regions. Okay. However, for example, when you fuse them you increase the recognition which means that emotion is communicating in a multimodal way, so if you want to recognize emotion in a robust way, you have to consider different modalities. So some resource for example there is some confusion between in the speech domain there is confusion between happiness and angry. Okay. But you don't see this, the confusion in the facial domain. So when you fuse them you are able to avoid the confusion between these two classes. The same thing happened here between neutral and happy which are confusing the facial domain but here they're not confused at all. So with this -- so multimodal can again give you a more robust informations. And this is because some emotions are recognized better in one particular domain than in another. But still our main challenge is if you want to prevent this in system like for example or Smartroom, one of the problem is that there is no textures in the face, so we have lips, eyebrow, eyes, we can detect those modalities. However, if you want to detect muscle that are in this area, you -- so you still will be -- it will be harder. Also you have to compensate head orientation and speaker viability. So if you focus in this speaker, so you see the pose between these two cases is not actually different, however if you check the face in these two frames so this is four time bigger than this one in term of pixel. So of course you can compensate for that and you can put a better camera to have a better resolution images. But still this is something that you need to consider if you want to go for real application, to implement this in this model. Okay. So just before I finish my talk let me show how we can for example, so far we used the emotion capture system to analyze relationship between gesture and speech. We also analyze the interplays between a facial and linguistic call, and we also show how we can use that for emotional recognition. So let me now explain how we can use this to -- very briefly how we can create, synthesize some aspect of human behavior. In particular what we are interested in is head motion synthesis. So here is head motions based on prosthetic features. And again if we look our big picture here we can use that kind of information to provide feature to the user in the Smartroom. So what we learn from analysis is that just speech are (inaudible) so we can use that for synthesis. So very brief explanation of the system. What we do here is to extract feature from the speech, we learn the relationship between gesture and head motions and speech using GMM framework, and we use this to generate the most likelihood sequence of head motions. After that, we interpolate in order to have a smooth sequence of transitions, of head motions, and then we use this sequence to synthesize different head motions. And let me -- if you are interested in this work, I can give more detail on how we model each of these blocks. So here is some example of the (inaudible) that we, some of our resource. So here please pay attention to synchronization between gestures and speech. Okay. >>: And say you just abandoned them? >> Carlos Busso: Only based on prosodic features. >>: We lost them at the last turnoff. >> Carlos Busso: Okay. >>: Eat your dessert. It tastes yummy. And so you just abandoned them. >> Carlos Busso: So see here we were successful in modeling the relationship between the gesture and the speech. And this framework would be extend to other part of the face, for example, the eyebrow. So the point here is that you want to consider speech when you synthesize other aspect of -- and the reason is because there is a coordination between these two gestures. By adding these gesture, by adding this -- by molding together these two modalities you can create a natural others. So one more slide from here. So we ask 17 subjects to assess the how natural these were perceived, and the only thing that we change here from each animation was a head motion. So what you notice here is when you don't add head motion at all, the perception of not running this is very low compared to other cases. So and our system which is very challenging in most of the cases was perceived even better than the original head motion sequences which was very appealing for us. >>: (Inaudible). >> Carlos Busso: The reason was the capture data from the participant. >>: (Inaudible). >> Carlos Busso: Yes, human data. Yeah, the region was the -- the sequence of head motion that we synthesize that were produced during the recording. So to conclude my (inaudible) that we have developed a Smartroom, we address from how we extract different modalities from the data executions, how we fuse informations, and we use that information to answer different questions, for example the -- the group dynamic and also at individual level different type of behavior. Still we have -- still they are telling us how we collect gesture from participants so in parallel what we did was to capture emotion, capture data to answer more finer to make more finer analysis. So we show that there is a relationship between gesture and speech, we show that there is interplay from different communicative goals. And finally we can use this information for applications like emotional recognition and synthesis. So we hope that what we are presenting here provide better -provide new guidance for the Smartroom. And just so if we come back to this big here, there's still many challenge, for example I'm sure there is a better way of processing each of the modalities how we fuse information. Still there are many open challenges there. Again if you are interested in speech recognition or emotion, for example, you will a need a clean scene from the participants. And if you are not willing to break the assumption that you don't want, for example, a microphone for each of the participants you want this room to be non-invasive with the sensor, you will have to deal with far-field -- far-field speech which is challenging, we will challenge these two applications. Furthermore, as I already said these still challenge how we track information from gestures and gaze for example. And still there is a world of opportunities in how we can use our knowledge from mock-up system to design better human -- better humans. And we -- but the main conclusion from my whole dissertation is that multimodality provides big pictures, not only one, it provide a bigger picture, so multimodal is the answer for to answer many of these questions. Many of our work have been published in conference and journals, so these are a list of select publications. Thank you very much. (Applause).

>> Zhengyou Zhang: Good morning, and it's my pleasure... Busso. Busso, sorry. He's fishing his Ph.D. in...

Related documents

Products

Support

&gt;&gt; Zhengyou Zhang: Good morning, and it's my pleasure... Busso. Busso, sorry. He's fishing his Ph.D. in...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Zhengyou Zhang: Good morning, and it's my pleasure... Busso. Busso, sorry. He's fishing his Ph.D. in...