16024 >> Zhengyou Zhang: It's my pleasure to introduce Radu Horaud. He's a research director in INRIA, leading the Perception Group at the INRIA Grenoble, and he's done a lot of work. And today he has two interesting pieces of work, and he has been a 2001 program chair and a lot of other editorial duties. >> Radu Horaud: Thank you, thank you and thanks everybody for attending my talk. Today there will be two talks, two short talks. When I submitted a few suggestions to Zhengyou, he told me you should present two instead of one, there will be more people. So I don't know if you came for the first talk or the second talk. Anyways, so the first talk will address the problem of audiovisual fusion, in particular the detection and localization of what we may call 3-D AV objects and using unsupervised clustering. And this work is mainly performed by Vasille, Ph.D. with Lorance (phonetic) who is here and also other contributors. And let's see what we mean by audiovisual clustering. So what we'd like to address here is the problem of audiovisual perception in general. So what I mean by audiovisual perception in general is not necessarily reduced to, for instance, detecting speakers or detecting variable speech or something like this related to, and combining this with visual features, which is usually done. But we like to put this in a more general framework of, let's say, we'll call it computational audiovisual analysis and we'd like to basically are interested in two objects that are both -- can be both seen and heard. So many objects around it can be both seen and heard. And I'm from a computer vision background. So when I started to do this I thought, well, how could we put together a visual data with auditory data. So it turns out that to start with, there is intrinsically difficult problem, because the two sensorial stimuli come into very different formats. And also the main feature of vision is that it provides dense data. And the light sources themselves that produce in fact the light are completely not relevant for the task because what you actually look at is objects that reflect the light. So it's reflections that are relevant in vision. Although auditory is completely the other way around, because reflections of sound, it's really a problem and what you really want to do is to detect acoustic sources. And auditory data is by nature a sparse information as opposed to vision. So our approach that in fact you can, is described in tourism papers. So we decided to use binocular vision and binaural hearing and to combine those two within the framework of 3-D fusion. So it's, the approach is based on finding the 3-D location in space of AV objects. So what we propose is a generic (phonetic) probabilistic modeling which at the end will boil down to an EM algorithm, which in fact what will do maximize the expected completed data likelihood. So on one side there are -- do you see this? On the one side there is the observed data, which is the audio and the visual observations and on the other side is what is called the missing or the hidden data which are the categories, the audiovisual objects that we would like to detect. So it's a completely unsupervised approach. We are not given the audiovisual objects in advance. So just to show you the kind of data that we process. So about a year ago we put together database. In fact, you can access -- I don't have Internet access here. But, anyway, this link goes to the CAVA, what we call the CAVA data set. So just to show you one sample of this dataset, it consists of synchronized stereo pairs and the audio recording is done also with binaural pair of microphones. So there are two things that are visible and two other people who are on both ends of the table who are hidden. They produce only auditory data. They don't produce visual data. And sometimes two people speak at a time sometimes only one person speaks, et cetera. So this is one example of things that we like to process. So it will go on like this. I'll show you in a while how we actually extract AV objects from this kind of, from this kind of thing. So I like to go very quickly through what we mean by binocular and binaural observations. This is pretty straightforward. We process each image pair so that we can extract interest points, and then we match them. So this is classical heap geometry, some rectification and then we have a number of 2-D observations. So through a very simply minded matching technique, we provide in fact 3-D observations in UVD space, where U and V is the location of a pixel that's, say, on the left image and D is the disparity. So what you can see here is this -- in fact what I show here is the XYZ positions, position as we set it with UVD. You can see there are roughly these three other speakers. This is the background and these are some outliers produced by the stereo matching process. So I will call the visual observations F. F 1. FM, et cetera. And similarly we have binaural observations. So what we use here is the integral of time difference. So this is just to show you how this is processed. This is the left microphone, right microphone. And from this we get the correlogram and then we detect ITDs. And then I will call -- these are 1-D observations and then I will refer to the auditory observations as G. G1, et cetera. Then our model is based on the fact that, generic (phonetic) model is based on the fact that both the audio and the visual observations are in fact a mapping from 3-D space to the 1-D space in auditory case and 3-D space in the visual one. So the function -- so this function G -- sorry. So S is the coordinates of AV object out in the world. And so like it would be for instance a speaking person, and then the auditory observation GK are produced by this well known function that is in fact the simply the sound propagation in space, where SM 1 and SM 2 are the positions of the two microphones. And they are localized in individual space. And similarly we have the 3-D visual disparities. So here I just copied classical things where the UVD are given by this projective mapping. Where B is the distance between the two cameras. And this corresponds to rectified camera pair but it can generalize to any other camera configuration. And with this in fact the auditory and visual aid are put on an equal footing. We do not privilege any of the two modalities. And you may notice also that this is very useful to calibrate the camera microphone setup, because if you know, for instance, the location of acoustic source S then the unknowns become SM 1 and SM 2. And then you can very easily estimate the positions of the microphones in camera space. And this is in fact what we did. Okay. So now we have a multiple speaker model. So the parameter, unknown parameters of the system, of the method are the speakers' positions. So they can be arbitrary number N of speakers. And then, because we do not know the categories, we do not know to which category each observation belongs, then we have to introduce silent variables which are the hidden data I mentioned before. So I will refer A is the hidden variables for the video part and A prime are the hidden variables for or the assignment variables for the audio part. And then this is a classical notation that means for instance variable AM equal to N, it means that the visual observation M is assigned to speaker N. But also we introduce an outlier class N plus 1 and we have exactly the similar thing for the auditory part. And then the likelihood model that we also use is quite classical. We consider Goshen model of the likelihood of an observation to belong to a speaker, whether it is visual observation or auditory observation. We also tried to use the T distribution and in fact the methodology that we present is completely independent of the choice of the normal RTD distribution. And also we added an outlier model. So this is uniform distribution that is a non -- as opposed to the Goshen or the T distribution, the uniform component in the mixture model is in our case it's treated as a nonparametric distribution. And at the end we have the model parameters that I wrote down here. So in the classical mixture model, you have to deal with the means. But here the means are replaced by the speaker positions. And then because the video data -- the visual data is in 3-D, we have three-by-three co-variance matrices. And for the auditory data we have scalar variances. So this is the parameter set that we have to determine. And then if you formally, if you try to formally derive the maximum likelihood in the presence of the hidden variables, then you end up with this quadratic form. And what is interesting to remember here is that the two modalities, the auditory and the visual modalities are in fact linked by the fact that the two terms in this maximization problem in fact this -- maybe I should use this. In the first time you can see appear here the function that the maps let's say speaker or AV object onto the visual observation space and here this function, the maps, the same AV object onto the auditory observations. So the SN is the parameter that links the two terms in this expected completed data likelihood. And there are other terms that are due to the uniform distribution that can be omitted because they're constant with respect to the parameters that over which we take the maximization or the minimization. And then here are the hidden variables. They disappear. But they're presented by their probabilities. So this alpha MN and alpha prime KN stand for the classical posterior probabilities of visual or auditory observation to belong to speaker N or to be an outlier. Now, we have to take some distance with the classical EM algorithm mainly because of this nonlinear functions F and G, the effect of this nonlinear narrative is two fold. First, when you take the derivatives of this function with respect to F and, sorry, to S, then the maximization becomes a nonlinear problem and it's not possible any more to make this minimization over F and G, as you know it replays the cluster means with respect to the co-variances. So we have to perform this minimization all at once over the entire parameter set. So although each step is exactly the same, the M step becomes generalized EM, where in fact the M step consists of one step of a nonlinear minimization like Newton rap, for instance, and it replaces the solution as in the standard here. There are theoretical results that show that this GM algorithm convert has the same conversions properties as EM in the sense that it decreases the maximum likelihood exactly like EM does. And so basically the algorithm that we implemented starts by initializing the number of clusters and the cluster centers. So this is a very crucial issue. And currently we do not have a nice elegant solution to the problem of initializing the number of clusters. We would like to consider a general case where people go in and grab the number of AV objects that change over time in the scene and we experimented with face detection to initialize cluster centers, but it does a very bad job in fact. Because as soon as someone turns his face, his head away from the camera, for instance, face detection is not efficient anymore. So we are in the process of actually replacing this face detector that initializes the algorithm with something that is more reliable based on optical flows, more changes in the images, et cetera. Yes. >> In your objective function, can you explain what F is, actually? How do you map the speaker location to the 3-D locations of the steered data? >> Radu Horaud: Yes, in fact I gave -- maybe I can come back ->> Because the points can be almost anywhere, right? The 3-D points. So I'm just wondering how do you actually map this particular location to the visual data that you have. >> Radu Horaud: Yeah, I'm -- yeah, I mean I understand -- I think I understand the question. But basically this is the function F, the function F will map, will take position S, XYZ, and it will project it in UVD space. So what your observations are in UVD space, and what you would like to get is XYZ. But there are many observations associated with a face. So in our case we will see their interest points on the face. So we assume that this should be clustered together and at the end the relevant information is the center of this cluster. And the assumption that this points on Goshen are clearly not valid from a practical point of view. But this is the only probabilistic model that we ->> Do you know in advance how many objects you have in the scene? >> Radu Horaud: Yeah, this is exactly what I was talking about right now. In this case, yes. We start at time zero. We start with a face detector. So the face detector basically we completely rely on it where it says there are three faces and then we take the center of this face detector s as the cluster centers, and we say we have three faces, three clusters. But this clearly is not very good. Yes? >> How does it handle the (inaudible) outside (inaudible). >> Radu Horaud: You will not see a lot. In fact they are not AV objects. So they would be detected. You'll see in a while they have an ITD associated as well as other noises in the room. Someone working around, et cetera. And they are not -- they are detected in the sense where the associated ITD is present. But then they will be treated as outliers, as a matter of fact, it's a very interesting question because you'll see another example where someone speaks while he talks and while I do this you will concentrate on my voice, have your auditory attention, but you don't hear my steps. Although the system will hear the steps and they will see that it will be in fact in this second example here I'll show you in a while, there are two auditory sources. >> On the microphones (inaudible). >> Radu Horaud: In all the examples I'll show you in a while the setup. We have two cameras and two microphones. >> You mentioned the face detector does a poor job. And looks like you don't have any control. You don't do any tracking in association with ->> Radu Horaud: I'll show you in a while. So maybe I just go through this and I will explain to you exactly how. >> (Inaudible). >> Radu Horaud: Yes, the audio -- what do you mean? >> (Inaudible). >> Radu Horaud: Yeah, yeah, it is. I will show in a while. Sorry, maybe I should have shown this ->> There's different things. The function would be different (inaudible) >> Radu Horaud: Yes, but this is carried out by the auditory part. In fact, I forgot to mention this is done in collaboration with a speech and hearing group at the University of Sheffield, and they have taken care of this. So one that we ITD that we get, they have a different function. So we sometimes we use the dummy head and sometimes we also have some recordings, we'll actually have two tiny microphones on someone's head. But, of course, they modelled this and this is, I would say, done before that. So the way we treat the data is that we split each sequence. Our sequence is binocular binaural data, we split it into time intervals so each interval is about one-eighth of a second. So this roughly corresponds to three video frames. And this is why -- so right now the face detector runs in a static manner but we plan to use each of these frames in one-eighth of a second to actually detect some local motion and to rely more on this feature than face detection. So for roughly each time interval there are about 1,000 visual observations and 10 ITDs. So there are 1,000 interest points in 3-D and about 10 ITDs. So I'm saying that visual observation are quite dense. Now many things out there, the auditory observations are much more sparse. In this case we take the 10 most prominent ITDs but there are many iterations, so on, so many ITDs, three of them correspond to the speakers. So this is the kind of result that we obtain. Let me just explain you a little bit so I'll show you in a while the two sequences, and for instance the first sequence there are about 166 time intervals, out of which only 89 have AV information. Some of them have only visual information, et cetera. And we have been able to correctly detect 75 out of this 89. So here we have the percentage of missed AV objects, which is roughly .16. So I think these are percentages, if I remember well. I don't remember exactly. >> (Inaudible). >> Radu Horaud: Good question. No, I think it's .16%. >> Eight or nine ->> Radu Horaud: I'm sorry, yes. 16%. And these are the false detections. So it's about 14%. You're right. And in this case, if you will see in a while, this is the working example. We have much more false alarms, about 43%. And I think I don't remember exactly the reason, but it's basically because sometimes the person walks without speaking and these are detected as AV objects. So this is -- so what I'm going to show you here is a run of the EM algorithms. So this is one time interval is processed. So what you'll see are the iterations of the EM algorithm. So what is marked with a white dot is the AV object being detected. So it's marked white on the right camera and blue on the left camera. So please note because of the presence of many interest points on the t-shirt here, although this person although we initialized it with a head, eventually the cluster center is on the T-shirt not on the -- so this is the situation in 3-D. So these are the stereo data and superimposed are the stereo data are the location of the three persons. And also the system is able to say speaking/nonspeaking. So this is one example. So this is another example where only one person is speaking. In fact, the person in this case is the person in the middle who is speaking. And although it was initialized on the face, because the ITD in fact corresponds to the gaze, right? So eventually the position of the speaker is given by this, which is not necessarily let's say the mouth, the actual object that in fact is the acoustic source. Okay. So maybe I will just interrupt the presentation to show you the same thing with sound so that so the first part of the video goes through the algorithm. So maybe I can just -- so while here you can see here you can see the one dimensional auditory observations. So you'll see that -- so each bar here corresponds to an ITD. So here, for instance, there are three or seven, I don't know, four or five ITDs or something like this. [Demonstration audio] >> Radu Horaud: We had someone else taping, so there's nothing. [Demonstration audio: >> Radu Horaud: So this is the case where in fact there are two auditory sources, the voice and the steps. And the system tends to actually locate the person in the middle of the two. >> So I guess it would be a problem if there's two persons talking and walking at the same time? Clustering might take them ->> Radu Horaud: I think there was another question about whether we have a temporal model. So currently we do not have a temporal model. So what we do, we simply process, run EM algorithm within a time interval of about one-eighth of a second. And then we take the output of this algorithm to initialize the next step. So it's a very simple temporal model. We do not actually have properly a temporal model. So what we show here is just our method running on a sequence. So to answer your question, I think currently we just take, you know, the very short time intervals and we process them in a sequence. So it should be able, I think, to detect two people working in the same time, provided that we have a good initialization at the beginning. So like, for instance, if I start with a face detector that sees only one person and there's a second then the system will think there's only one AV object out there and it will try to find only one AV object. So initialization is really a very important, very important thing here. So we have worked on this. We started on this to work on this topic about a year and a half ago. So these are the results that are currently available. So we discovered in fact that initialization, it's very important. So I will just show you here. At the end I will show more about this. So now just let me -- yes? >> The steps, just taking ITD, the steps actually -- there's no conflict between this (inaudible) and the steps right there. So actually it's helping not hurting. >> Radu Horaud: There is confusion because they correspond to the same ITD. Yes? The feet are not seen in the image. >> How are the microphones positioned? >> Radu Horaud: Like this. Like this. And then it means that the same ITD corresponds to two different AV objects. >> So when did it detect the person and the person is not talking just walking, it adds up for you? >> Radu Horaud: Well, no. The system relies on initialization. So in this case the system sees one face. So it starts by saying okay I have to detect only one cluster, which means I have to detect only one AV object. So there are ITDs that correspond to the voice, ITDs that correspond to the feet and other ITDs. >> You only have two microphones (inaudible) with the feet. >> Radu Horaud: They would be the same. So the system is unable to say there are two AV objects. >> But the (inaudible) is different. It says (inaudible) I think it gives us (inaudible) and then when it steps, it gives a step and the foot position. So then the average (inaudible) the same audio observation, audio sources but give a different visualization. >> (Inaudible) in fact, whether it's moving or not. >> Radu Horaud: No. >> Because the vision point is here. The vision point is on the body. The visual input -- the vision point is over here in Miles area. >> Acoustically speaking, if the steps were coming from his mouth, he could not -- [MULTIPLE SPEAKERS] >> The mouth is the input, the visual data. >> Radu Horaud: The reason for which it marks the person in the middle is because it's roughly the halfway in between the feet and the mouth. >> I think it's just because features, getting features from the chart ->> Radu Horaud: Maybe a combination, yeah. But well, I mean you have an ITD -- ITD gives you a direction, okay. But we initialized within this direction we initialized the cluster center with a face. Although it seems to prefer to go down. I don't know why, exactly. But there is -- if I have told the system at the initialization, hey, there are two auditory sources, there are two AV objects out there it will anyway fail to find one of the objects because there's no visual data associated with it. Anyway, so now I will show a little bit how the data were gathered. And the one point here was to be, was to record this AV data from the perspective of an active speaker. And the dataset that we, that is available on this website is described in the paper that will appear soon in ICMI this year. And it will also provide associated software for audio feature extraction, video feature extraction and also maybe in the future will also put on the side the audiovisual fusion algorithm. So I don't know if this -- but basically this is the setup. So this is a dummy head. So it has two microphones. And we have a helmet on which we put the two cameras. And also so -- so there are two possibilities. Either we use this dummy head or we can put the helmet on someone's head, a physical person, which in this case they will wear the microphones in their ears. And this device here is in fact associated with this camera and it will provide six degrees of freedom motion of the head. So we have stereo pairs at 20 feet per second. Binaural auditory at 44.1 kilohertz. And the head position and orientation at 25 per second as well. These cameras and this camera are perfectly synchronized. And this is the general view of the system. So we use -- by that time we had to, because the auditory software was running on the Windows computer and our software run on Linux computer, we had difficulties to synchronize through NTP, all the computers. So we simply, for audiovisual synchronization, we systematically introduced the club (phonetic). But I think now we're able to get around this problem and synchronize all the computers such that we have synchronized data. So I'm done with the first talk. And the method that does audiovisual in the 3-D domain is based on unsupervised clustering with using the GM algorithm and we actually plan to extend this to deal with a varying number of AV objects. Meaning the number of AV objects can vary over time. People coming in and coming out and not only -- not only people but other kind of AV objects. The reason for which we chose to use binocular, binaural setup it has strong set up with neurophysiologists who study audiovisual attention, which is in fact quite a new field in neurophysiology to study binocular and binaural perception within the framework of attention. And more sophisticated model should include eye and head motions, which I don't have a photograph here, but we've built robotic head that in fact has eye movements. Binocular eye movements and active perception and attention will be studied within this framework. So this is the reason for which we chose to use binocular binaural setup. So I'm done with the first talk. Let's see if we have more questions on this. Yes, please. >> I was hoping you could talk a little bit about the application where you see this. >> Radu Horaud: Yeah, so the application that we'll target here is robot human interaction application. So we'd like to mount such a audiovisual head onto a robot, so we have a collaboration with a robotics company in Paris and France that is very interested to -- it's a small humanoid robot. They're very interested in studying the social interaction of this tiny robot with people. So the first task that the robot will have to do is to be able to distinguish between, let's say, people versus other things in the room. From the visual point of view it's very difficult. And also so we thought that combining audio and visual would be a nice way to be able to distinguish between, let's say, living objects against artifacts. So this is maybe one of the motivations for doing this kind of work. Also, we have a more theoretical motivation, because we notice that audiovisual integration, I'm from a computer vision background, and I just -- Zhengyou just told me when he started to do audio he thought it was easy, but in fact it's not easy at all. And, in fact, in some respects auditory people have discovered the probabilistic setup much earlier than the computer vision people did. But somehow they neglected some other things like the simple physical geometric facts about auditory perception. So when we started this and we tried to put this together, we wanted to emphasize these things. . >> I'm independent in the (inaudible) and this system (inaudible), for example, usually EM, the first thing because you cannot get communication with them. But the key here I'm just curious, for example, that was only one facet. So you get this still robust ->> Radu Horaud: When I talk about initialization, there are two things. One problem is in fact with, let's say, with GMA models, is the number of components. >> The number of ->> Radu Horaud: The number of components. So this is one initialization issue. The other initialization issue is where are the, which are the means and co-variances of the components, the initial values. So we started to address the number of components. In fact, we use the Bick (phonetic) criterion to determine the number of components just based on auditory sources. In fact, we have very nice results there. Although we have to run this over a short period of time like one or two seconds to determine. But, of course, this will give you the number of auditory sources. And then over time a number of auditory sources may change. Someone else comes in, et cetera. So it's not, I think, just a question of initialization as you say. I think it's a much more general issue in how do you allow such a system to all of a sudden to say to the small robots, some new person I haven't heard for the last five minutes is coming in the room. So this new piece of, new observations, how do you take them into account. So I think it goes beyond -- it's a much more difficult problem than just saying that it's properly initializing EM. It's more general. >> My concern is about the audio when you have multiple sources, it's really difficult to (inaudible) it's very difficult to localize them, for example, or doing whatever bussing. And here I was surprised to see with only two microphones only we're able to localize two speakers who are active at the same time. So maybe they are using kind of speech cues to express this or something. >> Radu Horaud: No, no, we just detect as many ITDs as there are possible out there. And then because we know in advance that in this case we know in advance that we have to look for three speakers because we have a face detector at the start. We know we want -- I should not call them speakers, because there are five speakers. I prefer to call them AV objects. So we know that there are AV objects out there. And so the unsupervised clustering technique that I described will consider four clusters, possibly. Three for AV objects and one for outliers. And it turns out that the fourth component is very important for what you are talking about. And the fact that visual observations are systematically associated with auditory observations through generic (phonetic) model which I showed you, which is the exact physics of the problem. I think it's very important, which in fact leads to a generalized EM algorithm but it does a very good job in detecting simultaneously several AV objects. Okay? So I can send you a paper if you are interested if you send me an e-mail, I can send you a paper that describes this in detail. So now I switch to the other talk. Okay. So let's see. I don't know what is the resolution. Maybe I should start with the -- so although it's a completely different application, it will boil down to EM as well. A very different stance of EM but to boil down to EM as well. So what I'm going to present here was in fact presented at CVPR two weeks ago in Anchorage. So the title of the talk matches exactly the title of the CVPR paper. So what we want to do here, we have two objects, two (inaudible) objects. Here in this case, for instance, we have a small wooden mannequin and a person. I'm showing you the images of this but what we have are voxel data that are gathered with a multiple camera system. What you see here is the voxel data for the first object and the voxel data for the second object and the output of the algorithm is the dense matching, one-to-one, between all the voxels in the first set to all the voxels in the second set. So this is the output of the algorithm. And so why are we interested in shape matching? It turns out that it is a fundamental problem in computer vision, not only how do you describe shape but how do you compare shapes. And it's also useful for object recognition and indexing, motion capture, gesture recognition. Also links with computer graphics, for instance, where people are interested into shape morphing and texture morphing so they get around classical motion capture techniques. So all these topics are relevant to -- all these applications and problems are relevant to the problem at hand, which is shape matching. And the methodology that I'm going to present -- this is much shorter than what I did the present talks because I was supposed to give two short talks. So the method that we've implemented goes at follows. So we have, as I said, two 3-D shapes that are represented by voxels in this work but in fact we extended it to meshes and it can also be 2-D shapes for instance are presented by clouds of points or by silhouettes or whatever you like. And then because its shape is described not only by the positions of the point sets but also by the local geometry and topology of these points, then in our case the 3-D shape is strictly identical to a sparse graph representation. And therefore shape matching becomes a problem of match, of graph matching. And we will use spectrography and laplacian embedding to represent such graphs into an isometric space. So this is the key methodology that we've developed. And then once we do this, what is interesting is that the shape matching -- so we start with two articulated shapes. So there's no possible -- so these two shapes, they do not correspond in terms of there's no transformation that maps one shape onto another. But because of this piece of methodology, then the shape articulated shape matching becomes a point registration problem. More precisely it becomes a rigid point registration problem. Although it is not in the classical sense, in the sense that the shapes are not projected onto 2-D or 3-D space but higher dimensional space and the dimensionality of the space corresponds roughly to the complexity of the shape. And then this rigid point registration problem will be solved in a probabilistic framework within maximum likelihood with missing data. look at the CVR paper if you want more details. You can So just the central idea of the method is to embed a graph, sparse graph into an isometric space. So basically this is a graph that I'm representing here and the nodes of this graph will become points into space that is spanned by the eigenvectors of the metrics associated with the graph. In the classical spectograph matching, the metrics that represents the graph currently is the adjacency matrix. So there are a number of people who actually addressed the graph matching problem by looking for a permutation metrics between the two graphs. So this writes analytically like the expression on the top where -- I lost my -Okay. So let's say that the two graphs are represented by metrics A and metrics B. So P is a permutation that takes one metrics and maps it onto the other. And the problem is to find P that minimizes the frobenius norm into these two matrices. And the spectograph matching is a very nice method that to my knowledge was introduced by Umiama (phonetic), I think -- this paper I think it's in 1988 paper. So it's 20 years old. He showed that if the eigenvalues of the adjacency matrices are distant and if they can be ordered, which is a very strong assumption. Distinct and can be ordered, then the problem becomes point matching problem as opposed to matrix let's say alignment problem here. And the nodes in the two graphs are XY for the first graph and YY for the other graph and then in fact if you are able to order the eigenvalues, then there's the matching problem has been solved. There are as many eigenvalues and eigenvectors as there are in fact nodes in the graph. And then all you have to look is to look for orthogonal matrix that alliance the two clouds of points. So it is let's say on paper the problem is solved like this. I should also mention that there are other problems due to the fact that when you compute the eigenvectors there is a sign ambiguity. So in practice it's not that simple. So the problem boils down to hungarian algorithm which is departed matching problem. And it has N cube complexity where N is the number of nodes. >> So X (inaudible). >> Radu Horaud: Sorry. >> Sorry. >> Radu Horaud: The nodes in the graph become points in an euclidian space. So these are -- these are X 1 X 2 for 1 graph. So now for various reasons that I wouldn't detail here, one has to use the laplacian matrix rather than the adjacency matrix which is a much richer description of the geometry and local topology of the graph. So this is how the laplacian matrix is built. So an element of matrix, of this matrix, first of all, let's define double IJ that basically if point, if node I belongs to a neighborhood of node J then you can write the geometric distance, euclidean distance of IJ and take the Goshen distance divided by some parameter that you have to specify. So this actually end zero otherwise. So this, clearly, it's a very local -- it acts only local. Locally on the point set. And then the diagonal terms are computed like this. It's the sum over the -- because the line is symmetric. And then from this you can build laplacian matrix and so another important point here is rather than taking all the eigenvalues, therefore all the eigenvalues, you can just take, skip the first one and then you can take the first K eigenvalues. And then it's okay is the dimension of the space in which you are going to project the graph. So there's no reason for K to be 2 or 3. It can be any number. And then the problem is to add this let's say K smallest eigenvalues for the laplacian matrix, are they distinct. If they are distinct, can they be ordered? And this is just an example of a voxel set here that contains about 20,000 voxels. So there are 20,000 eigenvalues and 20,000 eigenvectors. So this is the shape in let's say projected onto K dimensional space. And what we show here are actually the values of the eigenvalues, and you can see that here, of course this is also flattened by the, flattened by the resolution, but you can see the first 10 or 20 eigenvalues are, you have to go to the fifth significant digit after the comma to rank them. And they are analyzing the geometric and algebraic multiplicity of this kind of thing. Maybe a topic in its own right. But we didn't want to do this. So the question is how do you go around this problem, because now you cannot apply the classical spectrography, because the eigenvalues cannot be ordered anymore. So if you can have a -- so why is it important? Go over why it's important all of them. Because the ordering of the eigenvalues give you an order onto the eigenvectors here. So on one side you have XYZ let's see. On the other side you have YXZ, then it is not -- it is not possible to compare the two shapes, because the transformation is not isometric. So this is why it's important to ->> Is it essentially zero? >> Radu Horaud: In theory -- sorry. In theory, the smallest eigenvalue is zero. So you have to skip this. And then the other eigenvalues are not zero. But in practice they are very small. >> D is what? >> Radu Horaud: D is this matrix. >> Sorry. >> Radu Horaud: W is this metrics, and D is this metrics. And then you take the eigenvalues and eigenvectors of matrix L. >> So DIJ is just the computing distance? >> Radu Horaud: You can take any distance. You can take the -- you can take ->> Zero. D is the (inaudible) matrix. >> Radu Horaud: Sorry, there is confusion between the notation. This is and this are two different things. Sorry. >> DIJ on the top. I assume you ->> Radu Horaud: You can use any distance. You can use the geo distance or Manhattan distance, for instance, voxels are regularly spaced so you can use the Manhattan distance. You can use any distance. You can use the norm 1 distance. >> Because you've ->> Radu Horaud: You can use the path, right, between two voxels. Shortest paths between two voxels. >> Your eigenvalues, your eigenvalues would depend on how you define the distance, right? >> Radu Horaud: Maybe, yeah, yeah. As I told you -- yeah, sure. Maybe. >> I assume that, for example, you just summary your distance, doesn't take away anything, then you shouldn't be just -- you should be relatively (inaudible) in three dimension information in terms of XYZ. If everything just ->> Radu Horaud: In fact, with this representation you start with -- yeah, intuitively you may think that each node as a location in space, XYZ. But in fact we do not use this. What we use, we use the relative distance between node pairs. So what the laplacian matrix contains is not the XYZ coordinates of the point but these distances as -- okay? The laplacian matrix tells you location, the position IJ in the laplacian matrix tells you what is the distance between node I and node J according to this definition. >> Define the neighborhood if I is in the neighborhood of J or -- what kind of area to see that. >> Radu Horaud: In the case of the voxels, I think we use -- sorry. Voxels. It is 27 neighborhood, right? 26 neighborhood, in terms of voxels. It would be like an eight neighborhood in 2-D. >> So if you increased the neighborhood or shrink the neighborhood has some effects on ->> Radu Horaud: Yes, of course. Of course, yes. The reason for it is that when you like to be invariant to this kind of motion, right? So when you have this position and then in this position ideally you'd like to have the same graph representation. And there's a compromise -- it's a compromise. The side of the neighborhood here is one parameter and then also the size of the kernel is another parameter. It turns out that ->> So the exponential function is (inaudible). >> Radu Horaud: Yes. Yes. Yes. Anyway, did I answer your questions? Okay. Okay. So the key innovation here as opposed to this idea is to do the ordering of the eigen or the -- so what we want to do, we want to order, let's say that we have two graphs. So these two graphs will be represented by two sets of eigenvectors. So we'd like to order them in the same order. So one, two, three, match one two three. So, again, the details are in the paper, but we noticed that so what is called here the eigen function is in fact an eigenvector. So if you consider the histograms of the eigenvectors, it is very easy to figure out that the histogram is invariant to the order in which you consider the eigenvectors. So we simply noticed that the histograms of eigen functions, eigenvectors are very reliable signature of the eigenspace. And therefore what I'm showing here on the left, on the left side are some eigen functions, histogram, the histogram of some eigenvectors as stated with the first shape. And this is associated with the second shape, and I'm showing them order by the eigenvalue. So you may notice that although the first two eigenvalues introduce the correct ordering, here there's a problem because this one should match this one and this one should match this one. Also notice here this histogram is reversed, is the negative of this histogram, and this is simply due to the fact when you compute an eigenvector you don't know if you should take the eigenvector or the negative of the eigenvector. So actually by comparing these histograms, and then you can simply find a one-to-one mapping between the eigenvector in the first graph and the eigenvectors in the second graph and maybe disregard some of the, like, for instance, this guy is disregarded and this guy is disregarded. >> So this more represents one eigenvector? >> Radu Horaud: This is the histogram of an eigenvector and an eigenvector has, like, for instance, in the graph, you have 20,000 voxels. So an eigenvector has 20,000 components, right? >> So the way the axis is it depends ->> Radu Horaud: Sorry? >> The axis, the start axis. >> Radu Horaud: Histogram. >> But what does 200 mean? 200 what? 200? >> It's components? >> Radu Horaud: Yeah, these are the -- these are the components. >> You have 20,000 components, right? >> Radu Horaud: I think simply ->> Is it the index ->> Radu Horaud: No, these are the number of components, histogram, these are the number of components that give this value, and components give this value. So like, for instance, here it means there are 125 components that give this value. Sum up to this value. >> The eigenvector and dataset (inaudible) this data set. >> So the one to a thousand that's the value. But the horizontal one counts? >> Radu Horaud: Yeah. >> Because sometimes you would do it the other way around. >> Radu Horaud: Okay. >> The range of the small eigenvalues, how do you know it's related to the motion, not related to noise not values. >> Radu Horaud: That's a good question. What we do, at some point again, you have to start with something. You have 20,000 eigenvalues. We take the 20 smallest ones and we assume that our eigenspace that we want to analyze is within this 20 dimensional space. Then we take all the eigenvectors, so it will be 20 eigenvectors on one side, 20 icon vectors on the other side. We start by ordering them with eigenvalues, and then we compute the histograms and then we match -- we compute the distance between the histograms, and then we match them. So this is an algorithm that is in N power 3. Okay. So for 20 it's not that expensive. It's quite fast, in fact. And it will give you the number of histograms that match and then we put a threshold on the quality of the histogram. So this will give you the final dimension. So in fact in this case, in the example I'm showing you, the dimension is 5. The dimension chosen is 5. But I agree that this is quite ad hoc, and we are trying to figure out how to do this analysis in a more systematic way. If you are on theoretical side, you would like to understand the properties of the very of matrices of very large size with multiplicity of eigenvalues and eigenvectors, et cetera, there are algorithms that give you the exact characteristic polynomial for instance, a very large matrix, and factorizing this polynomial to find the multiplicity of eigenvalues can take a week of computation. So it's a very expensive, very costly. So we decided to skip all this theoretical part and do something that is more pragmatic. But I agree that there are some problems. >> It also depends on the division of motion and what's your ->> Radu Horaud: We do not have any motion here. We just take one position and then another position. >> No motion? >> Radu Horaud: No, no, you take one position and then another position and you try to match that. So now that if we have -- okay. So we did two things. We reduced the dimensionality of the problem but also we aligned the two eigenbases. So now -- so the eigen functional alignment chooses dimension K of the inmetics (phonetic). So we're in the range six to 10. In fact, five in this example. And okay how now we have two sets of points that have different commonality. There is no reason for which the two voxel sets should have the exact mapping, exactly the same number of points. And there will be, because there are so many, there will be multiple matches missing shaped parts. For instance, in one position I'm like this and in the other position I'm like this. In fact, this shape will be missing from one dataset. And there's noise outliers, et cetera, so we have to handle -- so now that we have a way to let's say choose the embeddings and choose the dimensionality and have the initial alignment of the eigen bases. However, we are still left with the problem of solving this problem of multiple match, missing shape parts, et cetera. So this is why we now we are dealing with a point registration problem and, okay, I'm going to skip this. And, again, we are in the framework of maximum likelihood. And more specifically, because the maximum likelihood, because of the presence of the missing data, we cannot directly solve the maximum likelihood problem and here the missing data are the assignments between the two voxel sets. And just show you how the system is supposed to do. So let's say that we have two sets of points. So the first thing that we do, we apply K by K isometric mapping. So in this case it will be a rotation. K by K rotation. So it will be orthogonal matrix with determinant equal to 1 to the first set. So then this set becomes transparent and it is overlapped on the second set. And then we have each point in the second set is the mean of a cluster, and then we have co-variance that represents the size of the cluster and the shape of the cluster. And then we have to decide whether the points from the first set are in liars or outliers. Sometimes there are multiple matches or empty classes or empty clusters. These are the kinds of things we want to solve. And so we put this into a framework of this Roberts matching problem, we put it in the framework of density estimation. So we treat the two points unlike many techniques in point registration, we treat the two sets in a nonsymmetric way. So one point set, let's say, in general the largest one corresponds to an observation set. The other one to cluster centers. And then the problem is to group an observation to M clusters. And also as in the preview stock, we have to choose the distribution for the nice component in the mixture. And some people add a Goshen component as an outlier cluster, which is not really efficient from a practical point of view. I don't want to go into the details here. Other people don't add an outlier component, but they simply use a mixture of T distributions because it has the reputation to be more robust to outliers and we decided as in the previous talk to add a uniform component to the mixture. And therefore I don't want to go into detail, but this is exactly the same kind of modeling that goes through the same kind of formulation. So the point registration now becomes an algorithm where we initialize the transformation cue and the co-variants sigma. And this here just to mention the eigen function that we did before, this is initialized to the identity matrix. I will show you the results. It does very nice initial alignment. And then all you have to initialize are the co-variances. So you can take them very large. And then each step computes the posterior probability. And M step estimates the transformation and then the co-variance. So this is -- I don't want to go into detail but this is an important difference with a classical EM, because you cannot compute the transformations independently of the co-variance. So it is formally it is not EM. It's ECM. Expectation Conditional Maximization, because the transformation cue, the maximization of cue is conditioned by the coolant knowledge of the co-variances and et cetera. So just to show you an example -- whoops. So this is a simulation. Those are the cluster centers. The green dots are the observations and the red dots are outliers. So you may notice -- so this is a random initialization. You'll notice that sometimes the outliers are much closer to a cluster center than the in-lier. Et cetera. So what I will show you now is the evolution of the EM algorithm in this case. So at the end -- so in this example, the in-liers are not corrupted by any noise. This is why we obtain a perfect match. And you notice that all the outliars were disregarded. Although in some cases they're very close to the in-liers, and there are approximately 15 in-liers and 10 outliers and it went from uniform component of the Goshen -- of the mixture model. And we tried this with lots of cases, lots of noise and it works very nicely. So now this has to be applied to our problem. So this is the video that shows the technique. So this is one -- sorry -- six images of the object in one position. And then these are six images of the same object in a different position. So in one position the hands are like this and the other one they touch each other. So there's a topological difference in the shape. And one reason for which we wanted to show this example is to show how such an algorithm handles this topological changes. So this is a voxel reconstruction of the first shape. And this is the voxel reconstruction of the second one. And then these are the embeddings. So on the left we show the eigen functions that correspond to various eigenvectors in the embeddings. So sometimes the embeddings have very weird shape. So this is the -- whoops. This is the initial -- maybe, I don't know, maybe I stop it too...sorry. Maybe I should go back. Yeah, so this is the alignment as provided by the eigen function alignment that I showed you. So the EM algorithm starts here. So these are the various projections in three dimensional space. So this is the results shown independently so that because otherwise it's difficult to see the -- with dense matching like this it's very difficult to see. But believe me that the matchings are correct. You can see that this goes -- the colors help you a lot. This leg goes to this leg, they're crossed, et cetera. So a nice example with a hand, where in one hand is open and the other one finger is bent, I think, and it touches another. Touches another one. This is an example I showed you at the beginning with two different objects, artifact and a person. So this is a sequence, system when someone actually runs. And there is no -- again, we don't need any temporal model because it can handle topological changes in very different articulated poses of the same shape. So it doesn't actually need -- doesn't actually need any initialization or any temporal -- in fact, this model doesn't need any initialization. So five minutes -- too late, anyway. So the method that we propose embeds the articulated shapes in K dimensional arcadian space and what is nice here is K is not choosing in advance and it depends on the complexity of the shape. And typically for people the dimension is the order of five-sixths and matching consists of two steps aligning the two dimensional eigen bases based on the histograms I showed you and finally the optimal transformation that alliance the largest sub sets of points in between the two sets which in fact boils down to rigid point registration problem. And it handles missing shape parts, noisy data, bit points. Explicitly taking into account and carefully designed uniform component and the associated EM algorithm to handle this kind of situations. And I just want to mention that the dimensionality remains a problem, because if you have very complex shapes and then you want to have K equal to 10 or to 15 or something like this, then computing the co-variance matrices of in dimension 15 and computing the -- I don't know if you're familiar with EM. But at some point you have to compute the determinants of this matrices, and you in dimension 15 you have to, you use a power 15 over 2 or something like this, right, in the co-variances. So it can be a problem. It can be a problem, numerical problem. So in all the examples that I showed you, we use a spherical co-variance. And if you want to use a nonspherical co-variance, the EM, it's much more complex because you run into a noncovex optimization problem if you use -- I don't know if you know the classical algorithms that align point sets using in 3-D, for instance, they use quaternions in K dimensional space. It's not quaternion representation is not possible anymore. Plus, because of the presence of the mahalanobis distance instead of the euclidean distance, you end up with a noncovex optimization problem. So then it becomes so it's not that simple to go from spherical co-variances to nonisotropic co-variances. Okay. >> Zhengyou Zhang: Thank you very much. >> Radu Horaud: Just to mention I have a demo. So if you are interested to see this working on my computer, I can show you. (Applause)