16024 >> Zhengyou Zhang: It's my pleasure to introduce Radu... director in INRIA, leading the Perception Group at the INRIA...

advertisement
16024
>> Zhengyou Zhang: It's my pleasure to introduce Radu Horaud. He's a research
director in INRIA, leading the Perception Group at the INRIA Grenoble, and he's
done a lot of work. And today he has two interesting pieces of work, and he
has been a 2001 program chair and a lot of other editorial duties.
>> Radu Horaud: Thank you, thank you and thanks everybody for attending my
talk. Today there will be two talks, two short talks. When I submitted a few
suggestions to Zhengyou, he told me you should present two instead of one,
there will be more people.
So I don't know if you came for the first talk or the second talk. Anyways, so
the first talk will address the problem of audiovisual fusion, in particular
the detection and localization of what we may call 3-D AV objects and using
unsupervised clustering. And this work is mainly performed by Vasille, Ph.D.
with Lorance (phonetic) who is here and also other contributors.
And let's see what we mean by audiovisual clustering. So what we'd like to
address here is the problem of audiovisual perception in general. So what I
mean by audiovisual perception in general is not necessarily reduced to, for
instance, detecting speakers or detecting variable speech or something like
this related to, and combining this with visual features, which is usually
done.
But we like to put this in a more general framework of, let's say, we'll call
it computational audiovisual analysis and we'd like to basically are interested
in two objects that are both -- can be both seen and heard.
So many objects around it can be both seen and heard. And I'm from a computer
vision background. So when I started to do this I thought, well, how could we
put together a visual data with auditory data. So it turns out that to start
with, there is intrinsically difficult problem, because the two sensorial
stimuli come into very different formats.
And also the main feature of vision is that it provides dense data. And the
light sources themselves that produce in fact the light are completely not
relevant for the task because what you actually look at is objects that reflect
the light.
So it's reflections that are relevant in vision. Although auditory is
completely the other way around, because reflections of sound, it's really a
problem and what you really want to do is to detect acoustic sources.
And auditory data is by nature a sparse information as opposed to vision. So
our approach that in fact you can, is described in tourism papers. So we
decided to use binocular vision and binaural hearing and to combine those two
within the framework of 3-D fusion. So it's, the approach is based on finding
the 3-D location in space of AV objects.
So what we propose is a generic (phonetic) probabilistic modeling which at the
end will boil down to an EM algorithm, which in fact what will do maximize the
expected completed data likelihood.
So on one side there are -- do you see this? On the one side there is the
observed data, which is the audio and the visual observations and on the other
side is what is called the missing or the hidden data which are the categories,
the audiovisual objects that we would like to detect. So it's a completely
unsupervised approach. We are not given the audiovisual objects in advance.
So just to show you the kind of data that we process. So about a year ago we
put together database. In fact, you can access -- I don't have Internet access
here. But, anyway, this link goes to the CAVA, what we call the CAVA data set.
So just to show you one sample of this dataset, it consists of synchronized
stereo pairs and the audio recording is done also with binaural pair of
microphones.
So there are two things that are visible and two other people who are on both
ends of the table who are hidden. They produce only auditory data. They don't
produce visual data. And sometimes two people speak at a time sometimes only
one person speaks, et cetera. So this is one example of things that we like to
process.
So it will go on like this. I'll show you in a while how we actually extract
AV objects from this kind of, from this kind of thing.
So I like to go very quickly through what we mean by binocular and binaural
observations. This is pretty straightforward. We process each image pair so
that we can extract interest points, and then we match them. So this is
classical heap geometry, some rectification and then we have a number of 2-D
observations. So through a very simply minded matching technique, we provide
in fact 3-D observations in UVD space, where U and V is the location of a pixel
that's, say, on the left image and D is the disparity.
So what you can see here is this -- in fact what I show here is the XYZ
positions, position as we set it with UVD.
You can see there are roughly these three other speakers. This is the
background and these are some outliers produced by the stereo matching process.
So I will call the visual observations F. F 1. FM, et cetera. And similarly
we have binaural observations. So what we use here is the integral of time
difference.
So this is just to show you how this is processed. This is the left
microphone, right microphone. And from this we get the correlogram and then we
detect ITDs. And then I will call -- these are 1-D observations and then I
will refer to the auditory observations as G. G1, et cetera.
Then our model is based on the fact that, generic (phonetic) model is based on
the fact that both the audio and the visual observations are in fact a mapping
from 3-D space to the 1-D space in auditory case and 3-D space in the visual
one.
So the function -- so this function G -- sorry. So S is the coordinates of AV
object out in the world. And so like it would be for instance a speaking
person, and then the auditory observation GK are produced by this well known
function that is in fact the simply the sound propagation in space, where SM 1
and SM 2 are the positions of the two microphones.
And they are localized in individual space. And similarly we have the 3-D
visual disparities. So here I just copied classical things where the UVD are
given by this projective mapping. Where B is the distance between the two
cameras.
And this corresponds to rectified camera pair but it can generalize to any
other camera configuration. And with this in fact the auditory and visual aid
are put on an equal footing. We do not privilege any of the two modalities.
And you may notice also that this is very useful to calibrate the camera
microphone setup, because if you know, for instance, the location of acoustic
source S then the unknowns become SM 1 and SM 2. And then you can very easily
estimate the positions of the microphones in camera space.
And this is in fact what we did. Okay. So now we have a multiple speaker
model. So the parameter, unknown parameters of the system, of the method are
the speakers' positions. So they can be arbitrary number N of speakers. And
then, because we do not know the categories, we do not know to which category
each observation belongs, then we have to introduce silent variables which are
the hidden data I mentioned before.
So I will refer A is the hidden variables for the video part and A prime are
the hidden variables for or the assignment variables for the audio part. And
then this is a classical notation that means for instance variable AM equal to
N, it means that the visual observation M is assigned to speaker N. But also
we introduce an outlier class N plus 1 and we have exactly the similar thing
for the auditory part.
And then the likelihood model that we also use is quite classical. We consider
Goshen model of the likelihood of an observation to belong to a speaker,
whether it is visual observation or auditory observation. We also tried to use
the T distribution and in fact the methodology that we present is completely
independent of the choice of the normal RTD distribution.
And also we added an outlier model. So this is uniform distribution that is a
non -- as opposed to the Goshen or the T distribution, the uniform component in
the mixture model is in our case it's treated as a nonparametric distribution.
And at the end we have the model parameters that I wrote down here. So in the
classical mixture model, you have to deal with the means. But here the means
are replaced by the speaker positions. And then because the video data -- the
visual data is in 3-D, we have three-by-three co-variance matrices. And for
the auditory data we have scalar variances.
So this is the parameter set that we have to determine. And then if you
formally, if you try to formally derive the maximum likelihood in the presence
of the hidden variables, then you end up with this quadratic form. And what is
interesting to remember here is that the two modalities, the auditory and the
visual modalities are in fact linked by the fact that the two terms in this
maximization problem in fact this -- maybe I should use this.
In the first time you can see appear here the function that the maps let's say
speaker or AV object onto the visual observation space and here this function,
the maps, the same AV object onto the auditory observations. So the SN is the
parameter that links the two terms in this expected completed data likelihood.
And there are other terms that are due to the uniform distribution that can be
omitted because they're constant with respect to the parameters that over which
we take the maximization or the minimization.
And then here are the hidden variables. They disappear. But they're presented
by their probabilities. So this alpha MN and alpha prime KN stand for the
classical posterior probabilities of visual or auditory observation to belong
to speaker N or to be an outlier.
Now, we have to take some distance with the classical EM algorithm mainly
because of this nonlinear functions F and G, the effect of this nonlinear
narrative is two fold. First, when you take the derivatives of this function
with respect to F and, sorry, to S, then the maximization becomes a nonlinear
problem and it's not possible any more to make this minimization over F and G,
as you know it replays the cluster means with respect to the co-variances.
So we have to perform this minimization all at once over the entire parameter
set. So although each step is exactly the same, the M step becomes generalized
EM, where in fact the M step consists of one step of a nonlinear minimization
like Newton rap, for instance, and it replaces the solution as in the standard
here.
There are theoretical results that show that this GM algorithm convert has the
same conversions properties as EM in the sense that it decreases the maximum
likelihood exactly like EM does.
And so basically the algorithm that we implemented starts by initializing the
number of clusters and the cluster centers.
So this is a very crucial issue. And currently we do not have a nice elegant
solution to the problem of initializing the number of clusters. We would like
to consider a general case where people go in and grab the number of AV objects
that change over time in the scene and we experimented with face detection to
initialize cluster centers, but it does a very bad job in fact. Because as
soon as someone turns his face, his head away from the camera, for instance,
face detection is not efficient anymore.
So we are in the process of actually replacing this face detector that
initializes the algorithm with something that is more reliable based on optical
flows, more changes in the images, et cetera. Yes.
>> In your objective function, can you explain what F is, actually? How do
you map the speaker location to the 3-D locations of the steered data?
>> Radu Horaud: Yes, in fact I gave -- maybe I can come back ->> Because the points can be almost anywhere, right? The 3-D points. So I'm
just wondering how do you actually map this particular location to the visual
data that you have.
>> Radu Horaud: Yeah, I'm -- yeah, I mean I understand -- I think I understand
the question. But basically this is the function F, the function F will map,
will take position S, XYZ, and it will project it in UVD space. So what your
observations are in UVD space, and what you would like to get is XYZ. But
there are many observations associated with a face. So in our case we will see
their interest points on the face.
So we assume that this should be clustered together and at the end the relevant
information is the center of this cluster. And the assumption that this points
on Goshen are clearly not valid from a practical point of view. But this is
the only probabilistic model that we ->> Do you know in advance how many objects you have in the scene?
>> Radu Horaud: Yeah, this is exactly what I was talking about right now. In
this case, yes. We start at time zero. We start with a face detector. So the
face detector basically we completely rely on it where it says there are three
faces and then we take the center of this face detector s as the cluster
centers, and we say we have three faces, three clusters. But this clearly is
not very good. Yes?
>> How does it handle the (inaudible) outside (inaudible).
>> Radu Horaud: You will not see a lot. In fact they are not AV objects. So
they would be detected. You'll see in a while they have an ITD associated as
well as other noises in the room. Someone working around, et cetera.
And they are not -- they are detected in the sense where the associated ITD is
present. But then they will be treated as outliers, as a matter of fact, it's
a very interesting question because you'll see another example where someone
speaks while he talks and while I do this you will concentrate on my voice,
have your auditory attention, but you don't hear my steps. Although the system
will hear the steps and they will see that it will be in fact in this second
example here I'll show you in a while, there are two auditory sources.
>> On the microphones (inaudible).
>> Radu Horaud: In all the examples I'll show you in a while the setup. We
have two cameras and two microphones.
>> You mentioned the face detector does a poor job. And looks like you don't
have any control. You don't do any tracking in association with ->> Radu Horaud: I'll show you in a while. So maybe I just go through this and
I will explain to you exactly how.
>> (Inaudible).
>> Radu Horaud: Yes, the audio -- what do you mean?
>> (Inaudible).
>> Radu Horaud: Yeah, yeah, it is. I will show in a while. Sorry, maybe I
should have shown this ->> There's different things. The function would be different (inaudible)
>> Radu Horaud: Yes, but this is carried out by the auditory part. In fact, I
forgot to mention this is done in collaboration with a speech and hearing group
at the University of Sheffield, and they have taken care of this.
So one that we ITD that we get, they have a different function. So we
sometimes we use the dummy head and sometimes we also have some recordings,
we'll actually have two tiny microphones on someone's head.
But, of course, they modelled this and this is, I would say, done before that.
So the way we treat the data is that we split each sequence. Our sequence is
binocular binaural data, we split it into time intervals so each interval is
about one-eighth of a second.
So this roughly corresponds to three video frames. And this is why -- so right
now the face detector runs in a static manner but we plan to use each of these
frames in one-eighth of a second to actually detect some local motion and to
rely more on this feature than face detection.
So for roughly each time interval there are about 1,000 visual observations and
10 ITDs. So there are 1,000 interest points in 3-D and about 10 ITDs. So I'm
saying that visual observation are quite dense. Now many things out there, the
auditory observations are much more sparse. In this case we take the 10 most
prominent ITDs but there are many iterations, so on, so many ITDs, three of
them correspond to the speakers.
So this is the kind of result that we obtain. Let me just explain you a little
bit so I'll show you in a while the two sequences, and for instance the first
sequence there are about 166 time intervals, out of which only 89 have AV
information.
Some of them have only visual information, et cetera. And we have been able to
correctly detect 75 out of this 89. So here we have the percentage of missed
AV objects, which is roughly .16. So I think these are percentages, if I
remember well. I don't remember exactly.
>> (Inaudible).
>> Radu Horaud: Good question. No, I think it's .16%.
>> Eight or nine ->> Radu Horaud: I'm sorry, yes. 16%. And these are the false detections. So
it's about 14%.
You're right. And in this case, if you will see in a while, this is the
working example. We have much more false alarms, about 43%. And I think I
don't remember exactly the reason, but it's basically because sometimes the
person walks without speaking and these are detected as AV objects.
So this is -- so what I'm going to show you here is a run of the EM algorithms.
So this is one time interval is processed. So what you'll see are the
iterations of the EM algorithm.
So what is marked with a white dot is the AV object being detected. So it's
marked white on the right camera and blue on the left camera. So please note
because of the presence of many interest points on the t-shirt here, although
this person although we initialized it with a head, eventually the cluster
center is on the T-shirt not on the -- so this is the situation in 3-D. So
these are the stereo data and superimposed are the stereo data are the location
of the three persons.
And also the system is able to say speaking/nonspeaking. So this is one
example. So this is another example where only one person is speaking. In
fact, the person in this case is the person in the middle who is speaking.
And although it was initialized on the face, because the ITD in fact
corresponds to the gaze, right? So eventually the position of the speaker is
given by this, which is not necessarily let's say the mouth, the actual object
that in fact is the acoustic source.
Okay. So maybe I will just interrupt the presentation to show you the same
thing with sound so that so the first part of the video goes through the
algorithm. So maybe I can just -- so while here you can see here you can see
the one dimensional auditory observations. So you'll see that -- so each bar
here corresponds to an ITD. So here, for instance, there are three or seven, I
don't know, four or five ITDs or something like this.
[Demonstration audio]
>> Radu Horaud:
We had someone else taping, so there's nothing.
[Demonstration audio:
>> Radu Horaud: So this is the case where in fact there are two auditory
sources, the voice and the steps. And the system tends to actually locate the
person in the middle of the two.
>> So I guess it would be a problem if there's two persons talking and walking
at the same time? Clustering might take them ->> Radu Horaud: I think there was another question about whether we have a
temporal model. So currently we do not have a temporal model. So what we do,
we simply process, run EM algorithm within a time interval of about one-eighth
of a second. And then we take the output of this algorithm to initialize the
next step.
So it's a very simple temporal model. We do not actually have properly a
temporal model. So what we show here is just our method running on a sequence.
So to answer your question, I think currently we just take, you know, the very
short time intervals and we process them in a sequence. So it should be able,
I think, to detect two people working in the same time, provided that we have a
good initialization at the beginning.
So like, for instance, if I start with a face detector that sees only one
person and there's a second then the system will think there's only one AV
object out there and it will try to find only one AV object.
So initialization is really a very important, very important thing here.
So we have worked on this. We started on this to work on this topic about a
year and a half ago. So these are the results that are currently available.
So we discovered in fact that initialization, it's very important. So I will
just show you here. At the end I will show more about this.
So now just let me -- yes?
>> The steps, just taking ITD, the steps actually -- there's no conflict
between this (inaudible) and the steps right there. So actually it's helping
not hurting.
>> Radu Horaud: There is confusion because they correspond to the same ITD.
Yes? The feet are not seen in the image.
>> How are the microphones positioned?
>> Radu Horaud: Like this. Like this. And then it means that the same ITD
corresponds to two different AV objects.
>> So when did it detect the person and the person is not talking just
walking, it adds up for you?
>> Radu Horaud: Well, no. The system relies on initialization. So in this
case the system sees one face. So it starts by saying okay I have to detect
only one cluster, which means I have to detect only one AV object. So there
are ITDs that correspond to the voice, ITDs that correspond to the feet and
other ITDs.
>> You only have two microphones (inaudible) with the feet.
>> Radu Horaud: They would be the same. So the system is unable to say there
are two AV objects.
>> But the (inaudible) is different. It says (inaudible) I think it gives us
(inaudible) and then when it steps, it gives a step and the foot position. So
then the average (inaudible) the same audio observation, audio sources but give
a different visualization.
>> (Inaudible) in fact, whether it's moving or not.
>> Radu Horaud: No.
>> Because the vision point is here. The vision point is on the body. The
visual input -- the vision point is over here in Miles area.
>> Acoustically speaking, if the steps were coming from his mouth, he could
not --
[MULTIPLE SPEAKERS]
>> The mouth is the input, the visual data.
>> Radu Horaud: The reason for which it marks the person in the middle is
because it's roughly the halfway in between the feet and the mouth.
>> I think it's just because features, getting features from the chart ->> Radu Horaud: Maybe a combination, yeah. But well, I mean you have an
ITD -- ITD gives you a direction, okay. But we initialized within this
direction we initialized the cluster center with a face. Although it seems to
prefer to go down. I don't know why, exactly. But there is -- if I have told
the system at the initialization, hey, there are two auditory sources, there
are two AV objects out there it will anyway fail to find one of the objects
because there's no visual data associated with it.
Anyway, so now I will show a little bit how the data were gathered. And the
one point here was to be, was to record this AV data from the perspective of an
active speaker. And the dataset that we, that is available on this website is
described in the paper that will appear soon in ICMI this year.
And it will also provide associated software for audio feature extraction,
video feature extraction and also maybe in the future will also put on the side
the audiovisual fusion algorithm. So I don't know if this -- but basically
this is the setup. So this is a dummy head. So it has two microphones. And
we have a helmet on which we put the two cameras. And also so -- so there are
two possibilities. Either we use this dummy head or we can put the helmet on
someone's head, a physical person, which in this case they will wear the
microphones in their ears.
And this device here is in fact associated with this camera and it will provide
six degrees of freedom motion of the head.
So we have stereo pairs at 20 feet per second. Binaural auditory at
44.1 kilohertz. And the head position and orientation at 25 per second as
well. These cameras and this camera are perfectly synchronized.
And this is the general view of the system. So we use -- by that time we had
to, because the auditory software was running on the Windows computer and our
software run on Linux computer, we had difficulties to synchronize through NTP,
all the computers.
So we simply, for audiovisual synchronization, we systematically introduced the
club (phonetic). But I think now we're able to get around this problem and
synchronize all the computers such that we have synchronized data.
So I'm done with the first talk. And the method that does audiovisual in the
3-D domain is based on unsupervised clustering with using the GM algorithm and
we actually plan to extend this to deal with a varying number of AV objects.
Meaning the number of AV objects can vary over time. People coming in and
coming out and not only -- not only people but other kind of AV objects.
The reason for which we chose to use binocular, binaural setup it has strong
set up with neurophysiologists who study audiovisual attention, which is in
fact quite a new field in neurophysiology to study binocular and binaural
perception within the framework of attention.
And more sophisticated model should include eye and head motions, which I don't
have a photograph here, but we've built robotic head that in fact has eye
movements. Binocular eye movements and active perception and attention will be
studied within this framework. So this is the reason for which we chose to use
binocular binaural setup.
So I'm done with the first talk. Let's see if we have more questions on this.
Yes, please.
>> I was hoping you could talk a little bit about the application where you
see this.
>> Radu Horaud: Yeah, so the application that we'll target here is robot human
interaction application. So we'd like to mount such a audiovisual head onto a
robot, so we have a collaboration with a robotics company in Paris and France
that is very interested to -- it's a small humanoid robot. They're very
interested in studying the social interaction of this tiny robot with people.
So the first task that the robot will have to do is to be able to distinguish
between, let's say, people versus other things in the room. From the visual
point of view it's very difficult. And also so we thought that combining audio
and visual would be a nice way to be able to distinguish between, let's say,
living objects against artifacts. So this is maybe one of the motivations for
doing this kind of work.
Also, we have a more theoretical motivation, because we notice that audiovisual
integration, I'm from a computer vision background, and I just -- Zhengyou just
told me when he started to do audio he thought it was easy, but in fact it's
not easy at all. And, in fact, in some respects auditory people have
discovered the probabilistic setup much earlier than the computer vision people
did. But somehow they neglected some other things like the simple physical
geometric facts about auditory perception.
So when we started this and we tried to put this together, we wanted to
emphasize these things. .
>> I'm independent in the (inaudible) and this system (inaudible), for
example, usually EM, the first thing because you cannot get communication with
them. But the key here I'm just curious, for example, that was only one facet.
So you get this still robust ->> Radu Horaud: When I talk about initialization, there are two things. One
problem is in fact with, let's say, with GMA models, is the number of
components.
>> The number of ->> Radu Horaud: The number of components. So this is one initialization
issue. The other initialization issue is where are the, which are the means
and co-variances of the components, the initial values.
So we started to address the number of components. In fact, we use the Bick
(phonetic) criterion to determine the number of components just based on
auditory sources. In fact, we have very nice results there. Although we have
to run this over a short period of time like one or two seconds to determine.
But, of course, this will give you the number of auditory sources. And then
over time a number of auditory sources may change. Someone else comes in, et
cetera. So it's not, I think, just a question of initialization as you say. I
think it's a much more general issue in how do you allow such a system to all
of a sudden to say to the small robots, some new person I haven't heard for the
last five minutes is coming in the room.
So this new piece of, new observations, how do you take them into account. So
I think it goes beyond -- it's a much more difficult problem than just saying
that it's properly initializing EM. It's more general.
>> My concern is about the audio when you have multiple sources, it's really
difficult to (inaudible) it's very difficult to localize them, for example, or
doing whatever bussing. And here I was surprised to see with only two
microphones only we're able to localize two speakers who are active at the same
time. So maybe they are using kind of speech cues to express this or
something.
>> Radu Horaud: No, no, we just detect as many ITDs as there are possible out
there. And then because we know in advance that in this case we know in
advance that we have to look for three speakers because we have a face detector
at the start. We know we want -- I should not call them speakers, because
there are five speakers. I prefer to call them AV objects. So we know that
there are AV objects out there. And so the unsupervised clustering technique
that I described will consider four clusters, possibly. Three for AV objects
and one for outliers. And it turns out that the fourth component is very
important for what you are talking about. And the fact that visual
observations are systematically associated with auditory observations through
generic (phonetic) model which I showed you, which is the exact physics of the
problem. I think it's very important, which in fact leads to a generalized EM
algorithm but it does a very good job in detecting simultaneously several AV
objects.
Okay? So I can send you a paper if you are interested if you send me an
e-mail, I can send you a paper that describes this in detail.
So now I switch to the other talk. Okay. So let's see. I don't know what is
the resolution. Maybe I should start with the -- so although it's a completely
different application, it will boil down to EM as well. A very different
stance of EM but to boil down to EM as well.
So what I'm going to present here was in fact presented at CVPR two weeks ago
in Anchorage. So the title of the talk matches exactly the title of the CVPR
paper.
So what we want to do here, we have two objects, two (inaudible) objects. Here
in this case, for instance, we have a small wooden mannequin and a person. I'm
showing you the images of this but what we have are voxel data that are
gathered with a multiple camera system.
What you see here is the voxel data for the first object and the voxel data for
the second object and the output of the algorithm is the dense matching,
one-to-one, between all the voxels in the first set to all the voxels in the
second set. So this is the output of the algorithm.
And so why are we interested in shape matching? It turns out that it is a
fundamental problem in computer vision, not only how do you describe shape but
how do you compare shapes. And it's also useful for object recognition and
indexing, motion capture, gesture recognition. Also links with computer
graphics, for instance, where people are interested into shape morphing and
texture morphing so they get around classical motion capture techniques.
So all these topics are relevant to -- all these applications and problems are
relevant to the problem at hand, which is shape matching. And the methodology
that I'm going to present -- this is much shorter than what I did the present
talks because I was supposed to give two short talks.
So the method that we've implemented goes at follows. So we have, as I said,
two 3-D shapes that are represented by voxels in this work but in fact we
extended it to meshes and it can also be 2-D shapes for instance are presented
by clouds of points or by silhouettes or whatever you like.
And then because its shape is described not only by the positions of the point
sets but also by the local geometry and topology of these points, then in our
case the 3-D shape is strictly identical to a sparse graph representation. And
therefore shape matching becomes a problem of match, of graph matching. And we
will use spectrography and laplacian embedding to represent such graphs into an
isometric space. So this is the key methodology that we've developed.
And then once we do this, what is interesting is that the shape matching -- so
we start with two articulated shapes. So there's no possible -- so these two
shapes, they do not correspond in terms of there's no transformation that maps
one shape onto another.
But because of this piece of methodology, then the shape articulated shape
matching becomes a point registration problem. More precisely it becomes a
rigid point registration problem.
Although it is not in the classical sense, in the sense that the shapes are not
projected onto 2-D or 3-D space but higher dimensional space and the
dimensionality of the space corresponds roughly to the complexity of the shape.
And then this rigid point registration problem will be solved in a
probabilistic framework within maximum likelihood with missing data.
look at the CVR paper if you want more details.
You can
So just the central idea of the method is to embed a graph, sparse graph into
an isometric space. So basically this is a graph that I'm representing here
and the nodes of this graph will become points into space that is spanned by
the eigenvectors of the metrics associated with the graph.
In the classical spectograph matching, the metrics that represents the graph
currently is the adjacency matrix. So there are a number of people who
actually addressed the graph matching problem by looking for a permutation
metrics between the two graphs. So this writes analytically like the
expression on the top where -- I lost my -Okay. So let's say that the two graphs are represented by metrics A and
metrics B. So P is a permutation that takes one metrics and maps it onto the
other. And the problem is to find P that minimizes the frobenius norm into
these two matrices. And the spectograph matching is a very nice method that to
my knowledge was introduced by Umiama (phonetic), I think -- this paper I think
it's in 1988 paper. So it's 20 years old.
He showed that if the eigenvalues of the adjacency matrices are distant and if
they can be ordered, which is a very strong assumption. Distinct and can be
ordered, then the problem becomes point matching problem as opposed to matrix
let's say alignment problem here. And the nodes in the two graphs are XY for
the first graph and YY for the other graph and then in fact if you are able to
order the eigenvalues, then there's the matching problem has been solved.
There are as many eigenvalues and eigenvectors as there are in fact nodes in
the graph. And then all you have to look is to look for orthogonal matrix that
alliance the two clouds of points. So it is let's say on paper the problem is
solved like this. I should also mention that there are other problems due to
the fact that when you compute the eigenvectors there is a sign ambiguity. So
in practice it's not that simple. So the problem boils down to hungarian
algorithm which is departed matching problem. And it has N cube complexity
where N is the number of nodes.
>> So X (inaudible).
>> Radu Horaud: Sorry.
>> Sorry.
>> Radu Horaud: The nodes in the graph become points in an euclidian space.
So these are -- these are X 1 X 2 for 1 graph.
So now for various reasons that I wouldn't detail here, one has to use the
laplacian matrix rather than the adjacency matrix which is a much richer
description of the geometry and local topology of the graph.
So this is how the laplacian matrix is built. So an element of matrix, of this
matrix, first of all, let's define double IJ that basically if point, if node I
belongs to a neighborhood of node J then you can write the geometric distance,
euclidean distance of IJ and take the Goshen distance divided by some parameter
that you have to specify.
So this actually end zero otherwise. So this, clearly, it's a very local -- it
acts only local. Locally on the point set. And then the diagonal terms are
computed like this. It's the sum over the -- because the line is symmetric.
And then from this you can build laplacian matrix and so another important
point here is rather than taking all the eigenvalues, therefore all the
eigenvalues, you can just take, skip the first one and then you can take the
first K eigenvalues.
And then it's okay is the dimension of the space in which you are going to
project the graph. So there's no reason for K to be 2 or 3. It can be any
number. And then the problem is to add this let's say K smallest eigenvalues
for the laplacian matrix, are they distinct. If they are distinct, can they be
ordered?
And this is just an example of a voxel set here that contains about 20,000
voxels. So there are 20,000 eigenvalues and 20,000 eigenvectors. So this is
the shape in let's say projected onto K dimensional space. And what we show
here are actually the values of the eigenvalues, and you can see that here, of
course this is also flattened by the, flattened by the resolution, but you can
see the first 10 or 20 eigenvalues are, you have to go to the fifth significant
digit after the comma to rank them. And they are analyzing the geometric and
algebraic multiplicity of this kind of thing. Maybe a topic in its own right.
But we didn't want to do this. So the question is how do you go around this
problem, because now you cannot apply the classical spectrography, because the
eigenvalues cannot be ordered anymore.
So if you can have a -- so why is it important? Go over why it's important all
of them. Because the ordering of the eigenvalues give you an order onto the
eigenvectors here. So on one side you have XYZ let's see. On the other side
you have YXZ, then it is not -- it is not possible to compare the two shapes,
because the transformation is not isometric.
So this is why it's important to ->> Is it essentially zero?
>> Radu Horaud: In theory -- sorry. In theory, the smallest eigenvalue is
zero. So you have to skip this. And then the other eigenvalues are not zero.
But in practice they are very small.
>> D is what?
>> Radu Horaud: D is this matrix.
>> Sorry.
>> Radu Horaud: W is this metrics, and D is this metrics. And then you take
the eigenvalues and eigenvectors of matrix L.
>> So DIJ is just the computing distance?
>> Radu Horaud: You can take any distance. You can take the -- you can
take ->> Zero. D is the (inaudible) matrix.
>> Radu Horaud: Sorry, there is confusion between the notation. This is and
this are two different things. Sorry.
>> DIJ on the top. I assume you ->> Radu Horaud: You can use any distance. You can use the geo distance or
Manhattan distance, for instance, voxels are regularly spaced so you can use
the Manhattan distance. You can use any distance. You can use the norm 1
distance.
>> Because you've ->> Radu Horaud: You can use the path, right, between two voxels. Shortest
paths between two voxels.
>> Your eigenvalues, your eigenvalues would depend on how you define the
distance, right?
>> Radu Horaud: Maybe, yeah, yeah. As I told you -- yeah, sure. Maybe.
>> I assume that, for example, you just summary your distance, doesn't take
away anything, then you shouldn't be just -- you should be relatively
(inaudible) in three dimension information in terms of XYZ. If everything
just ->> Radu Horaud: In fact, with this representation you start with -- yeah,
intuitively you may think that each node as a location in space, XYZ. But in
fact we do not use this. What we use, we use the relative distance between
node pairs. So what the laplacian matrix contains is not the XYZ coordinates
of the point but these distances as -- okay? The laplacian matrix tells you
location, the position IJ in the laplacian matrix tells you what is the
distance between node I and node J according to this definition.
>> Define the neighborhood if I is in the neighborhood of J or -- what kind of
area to see that.
>> Radu Horaud: In the case of the voxels, I think we use -- sorry. Voxels.
It is 27 neighborhood, right? 26 neighborhood, in terms of voxels. It would
be like an eight neighborhood in 2-D.
>> So if you increased the neighborhood or shrink the neighborhood has some
effects on ->> Radu Horaud: Yes, of course. Of course, yes. The reason for it is that
when you like to be invariant to this kind of motion, right? So when you have
this position and then in this position ideally you'd like to have the same
graph representation.
And there's a compromise -- it's a compromise. The side of the neighborhood
here is one parameter and then also the size of the kernel is another
parameter. It turns out that ->> So the exponential function is (inaudible).
>> Radu Horaud: Yes. Yes. Yes. Anyway, did I answer your questions? Okay.
Okay. So the key innovation here as opposed to this idea is to do the ordering
of the eigen or the -- so what we want to do, we want to order, let's say that
we have two graphs. So these two graphs will be represented by two sets of
eigenvectors. So we'd like to order them in the same order. So one, two,
three, match one two three. So, again, the details are in the paper, but we
noticed that so what is called here the eigen function is in fact an
eigenvector. So if you consider the histograms of the eigenvectors, it is very
easy to figure out that the histogram is invariant to the order in which you
consider the eigenvectors. So we simply noticed that the histograms of eigen
functions, eigenvectors are very reliable signature of the eigenspace. And
therefore what I'm showing here on the left, on the left side are some eigen
functions, histogram, the histogram of some eigenvectors as stated with the
first shape.
And this is associated with the second shape, and I'm showing them order by the
eigenvalue. So you may notice that although the first two eigenvalues
introduce the correct ordering, here there's a problem because this one should
match this one and this one should match this one. Also notice here this
histogram is reversed, is the negative of this histogram, and this is simply
due to the fact when you compute an eigenvector you don't know if you should
take the eigenvector or the negative of the eigenvector.
So actually by comparing these histograms, and then you can simply find a
one-to-one mapping between the eigenvector in the first graph and the
eigenvectors in the second graph and maybe disregard some of the, like, for
instance, this guy is disregarded and this guy is disregarded.
>> So this more represents one eigenvector?
>> Radu Horaud: This is the histogram of an eigenvector and an eigenvector
has, like, for instance, in the graph, you have 20,000 voxels. So an
eigenvector has 20,000 components, right?
>> So the way the axis is it depends ->> Radu Horaud: Sorry?
>> The axis, the start axis.
>> Radu Horaud: Histogram.
>> But what does 200 mean? 200 what? 200?
>> It's components?
>> Radu Horaud: Yeah, these are the -- these are the components.
>> You have 20,000 components, right?
>> Radu Horaud: I think simply ->> Is it the index ->> Radu Horaud: No, these are the number of components, histogram, these are
the number of components that give this value, and components give this value.
So like, for instance, here it means there are 125 components that give this
value. Sum up to this value.
>> The eigenvector and dataset (inaudible) this data set.
>> So the one to a thousand that's the value. But the horizontal one counts?
>> Radu Horaud: Yeah.
>> Because sometimes you would do it the other way around.
>> Radu Horaud: Okay.
>> The range of the small eigenvalues, how do you know it's related to the
motion, not related to noise not values.
>> Radu Horaud: That's a good question. What we do, at some point again, you
have to start with something. You have 20,000 eigenvalues. We take the 20
smallest ones and we assume that our eigenspace that we want to analyze is
within this 20 dimensional space.
Then we take all the eigenvectors, so it will be 20 eigenvectors on one side,
20 icon vectors on the other side. We start by ordering them with eigenvalues,
and then we compute the histograms and then we match -- we compute the distance
between the histograms, and then we match them. So this is an algorithm that
is in N power 3.
Okay. So for 20 it's not that expensive. It's quite fast, in fact. And it
will give you the number of histograms that match and then we put a threshold
on the quality of the histogram. So this will give you the final dimension.
So in fact in this case, in the example I'm showing you, the dimension is 5.
The dimension chosen is 5. But I agree that this is quite ad hoc, and we are
trying to figure out how to do this analysis in a more systematic way.
If you are on theoretical side, you would like to understand the properties of
the very of matrices of very large size with multiplicity of eigenvalues and
eigenvectors, et cetera, there are algorithms that give you the exact
characteristic polynomial for instance, a very large matrix, and factorizing
this polynomial to find the multiplicity of eigenvalues can take a week of
computation. So it's a very expensive, very costly.
So we decided to skip all this theoretical part and do something that is more
pragmatic. But I agree that there are some problems.
>> It also depends on the division of motion and what's your ->> Radu Horaud: We do not have any motion here. We just take one position and
then another position.
>> No motion?
>> Radu Horaud: No, no, you take one position and then another position and
you try to match that.
So now that if we have -- okay. So we did two things. We reduced the
dimensionality of the problem but also we aligned the two eigenbases. So
now -- so the eigen functional alignment chooses dimension K of the inmetics
(phonetic). So we're in the range six to 10. In fact, five in this example.
And okay how now we have two sets of points that have different commonality.
There is no reason for which the two voxel sets should have the exact mapping,
exactly the same number of points.
And there will be, because there are so many, there will be multiple matches
missing shaped parts. For instance, in one position I'm like this and in the
other position I'm like this. In fact, this shape will be missing from one
dataset.
And there's noise outliers, et cetera, so we have to handle -- so now that we
have a way to let's say choose the embeddings and choose the dimensionality and
have the initial alignment of the eigen bases. However, we are still left with
the problem of solving this problem of multiple match, missing shape parts, et
cetera.
So this is why we now we are dealing with a point registration problem and,
okay, I'm going to skip this. And, again, we are in the framework of maximum
likelihood. And more specifically, because the maximum likelihood, because of
the presence of the missing data, we cannot directly solve the maximum
likelihood problem and here the missing data are the assignments between the
two voxel sets.
And just show you how the system is supposed to do. So let's say that we have
two sets of points. So the first thing that we do, we apply K by K isometric
mapping. So in this case it will be a rotation. K by K rotation. So it will
be orthogonal matrix with determinant equal to 1 to the first set.
So then this set becomes transparent and it is overlapped on the second set.
And then we have each point in the second set is the mean of a cluster, and
then we have co-variance that represents the size of the cluster and the shape
of the cluster. And then we have to decide whether the points from the first
set are in liars or outliers. Sometimes there are multiple matches or empty
classes or empty clusters. These are the kinds of things we want to solve.
And so we put this into a framework of this Roberts matching problem, we put it
in the framework of density estimation. So we treat the two points unlike many
techniques in point registration, we treat the two sets in a nonsymmetric way.
So one point set, let's say, in general the largest one corresponds to an
observation set. The other one to cluster centers.
And then the problem is to group an observation to M clusters. And also as in
the preview stock, we have to choose the distribution for the nice component in
the mixture. And some people add a Goshen component as an outlier cluster,
which is not really efficient from a practical point of view.
I don't want to go into the details here. Other people don't add an outlier
component, but they simply use a mixture of T distributions because it has the
reputation to be more robust to outliers and we decided as in the previous talk
to add a uniform component to the mixture. And therefore I don't want to go
into detail, but this is exactly the same kind of modeling that goes through
the same kind of formulation.
So the point registration now becomes an algorithm where we initialize the
transformation cue and the co-variants sigma. And this here just to mention
the eigen function that we did before, this is initialized to the identity
matrix. I will show you the results. It does very nice initial alignment.
And then all you have to initialize are the co-variances. So you can take them
very large. And then each step computes the posterior probability. And M step
estimates the transformation and then the co-variance. So this is -- I don't
want to go into detail but this is an important difference with a classical EM,
because you cannot compute the transformations independently of the
co-variance. So it is formally it is not EM. It's ECM. Expectation
Conditional Maximization, because the transformation cue, the maximization of
cue is conditioned by the coolant knowledge of the co-variances and et cetera.
So just to show you an example -- whoops. So this is a simulation. Those are
the cluster centers. The green dots are the observations and the red dots are
outliers. So you may notice -- so this is a random initialization. You'll
notice that sometimes the outliers are much closer to a cluster center than the
in-lier. Et cetera. So what I will show you now is the evolution of the EM
algorithm in this case.
So at the end -- so in this example, the in-liers are not corrupted by any
noise. This is why we obtain a perfect match. And you notice that all the
outliars were disregarded. Although in some cases they're very close to the
in-liers, and there are approximately 15 in-liers and 10 outliers and it went
from uniform component of the Goshen -- of the mixture model.
And we tried this with lots of cases, lots of noise and it works very nicely.
So now this has to be applied to our problem. So this is the video that shows
the technique. So this is one -- sorry -- six images of the object in one
position. And then these are six images of the same object in a different
position.
So in one position the hands are like this and the other one they touch each
other. So there's a topological difference in the shape. And one reason for
which we wanted to show this example is to show how such an algorithm handles
this topological changes.
So this is a voxel reconstruction of the first shape. And this is the voxel
reconstruction of the second one. And then these are the embeddings. So on
the left we show the eigen functions that correspond to various eigenvectors in
the embeddings.
So sometimes the embeddings have very weird shape. So this is the -- whoops.
This is the initial -- maybe, I don't know, maybe I stop it too...sorry. Maybe
I should go back. Yeah, so this is the alignment as provided by the eigen
function alignment that I showed you. So the EM algorithm starts here.
So these are the various projections in three dimensional space. So this is
the results shown independently so that because otherwise it's difficult to see
the -- with dense matching like this it's very difficult to see. But believe
me that the matchings are correct. You can see that this goes -- the colors
help you a lot.
This leg goes to this leg, they're crossed, et cetera. So a nice example with
a hand, where in one hand is open and the other one finger is bent, I think,
and it touches another. Touches another one.
This is an example I showed you at the beginning with two different objects,
artifact and a person. So this is a sequence, system when someone actually
runs. And there is no -- again, we don't need any temporal model because it
can handle topological changes in very different articulated poses of the same
shape. So it doesn't actually need -- doesn't actually need any initialization
or any temporal -- in fact, this model doesn't need any initialization.
So five minutes -- too late, anyway. So the method that we propose embeds the
articulated shapes in K dimensional arcadian space and what is nice here is K
is not choosing in advance and it depends on the complexity of the shape.
And typically for people the dimension is the order of five-sixths and matching
consists of two steps aligning the two dimensional eigen bases based on the
histograms I showed you and finally the optimal transformation that alliance
the largest sub sets of points in between the two sets which in fact boils down
to rigid point registration problem.
And it handles missing shape parts, noisy data, bit points. Explicitly taking
into account and carefully designed uniform component and the associated EM
algorithm to handle this kind of situations. And I just want to mention that
the dimensionality remains a problem, because if you have very complex shapes
and then you want to have K equal to 10 or to 15 or something like this, then
computing the co-variance matrices of in dimension 15 and computing the -- I
don't know if you're familiar with EM. But at some point you have to compute
the determinants of this matrices, and you in dimension 15 you have to, you use
a power 15 over 2 or something like this, right, in the co-variances.
So it can be a problem. It can be a problem, numerical problem. So in all the
examples that I showed you, we use a spherical co-variance. And if you want to
use a nonspherical co-variance, the EM, it's much more complex because you run
into a noncovex optimization problem if you use -- I don't know if you know the
classical algorithms that align point sets using in 3-D, for instance, they use
quaternions in K dimensional space. It's not quaternion representation is not
possible anymore. Plus, because of the presence of the mahalanobis distance
instead of the euclidean distance, you end up with a noncovex optimization
problem. So then it becomes so it's not that simple to go from spherical
co-variances to nonisotropic co-variances.
Okay.
>> Zhengyou Zhang: Thank you very much.
>> Radu Horaud: Just to mention I have a demo. So if you are interested to
see this working on my computer, I can show you.
(Applause)
Download