# &gt;&gt; Misha Bilenko: So today we have Dominique Perrault-Joncas... frozen lake all the way. Dominique has gone to...

advertisement ```&gt;&gt; Misha Bilenko: So today we have Dominique Perrault-Joncas coming across the
frozen lake all the way. Dominique has gone to McGill before he came to University of
Washington. He recently won the Birnbaum prize from this department and he will tell
us today all about manifold preservation and metric learning.
&gt;&gt; Dominique Perrault-Joncas: Thank you Misha. So I would first like to Misha and
[inaudible] for organizing this talk and for inviting me here. I'm going to talk today about
metric learning and preserving the intrinsic geometry. This is joint work with, Marina
Meila, Professor Meila at the University of Washington who is my advisor for my PhD
right now.
A brief overview. I will give some background on the problem. I will cover some
differential geometry theory just to get you acquainted with the framework, because I
think it is good to be reminded of it even if you are somewhat familiar with differential
geometry. I will discuss the discretized problem and the algorithm and then I am going
to show some examples and applications. I will give a brief summary and show some
future research, and maybe I will go through the consistency of push forward metric but
maybe not. I am not sure that that is the most important thing to do.
So the problem we are trying to track here is the curse of dimensionality as there is an
abundance of large data sets that are high dimensional and this leads to some problem in
terms of interpretation, in terms of computation and in terms of analysis and so the idea is
to try to reduce the dimension of the data set, and underlying this idea is to try to find
some low dimensional representation of this data that will preserve all of the important
information or at least most of it. So that is kind of the goal today.
So I will start with giving you a toy example. It is going to kind of follow us through the
talk and so it is worth going over what this example is. So effectively what we are going
to be looking at are images, so it is going to be faces, gray faces, 64 x 64 pixels and so,
and we're going to have about 700 or so of those images. We have a fairly high
dimensional data set. I mean it is still small by today's standards but I think it captures
the idea, but this data set has some very interesting attributes, which is that it represents
one specific face that is only moving right to left and up and down. So in this data set
what we have implicitly is only 2 degrees of freedom and so we are going to try to
leverage this attribute of the data set to help us in our analysis.
&gt;&gt;: So I understand, it is all images of the same face taken from different points of view?
&gt;&gt; Dominique Perrault-Joncas: Let me show you the data set itself. So this is one
embedding of this data set. Every single ellipse on here is one of the faces and I am
showing you a subset of those faces. So it is always this particular face as being rotated
one way or the other or that is tilted up or down.
&gt;&gt;: So the difference is just a point of view of the camera with respect to the face?
&gt;&gt; Dominique Perrault-Joncas: Exactly. It is either the head or the camera depending on
how you want to see it and we have 2 degrees of freedom even though in terms of
representation every image is 64 x 64. The idea is to effectively find an embedding like
this one and try to use this embedding to do analysis or to do some interpretation. So in
this case it is very easy, and we found that one dimension represents the angle of the head
in terms of left and right and the other dimension represents the angle of the head from
up-and-down.
So, so far so good, and actually there is a lot of embedding that exists. I am showing you
a subset here and they all do various things. The question then is which one makes the
most sense. The underlying idea for dimensionality reduction is what I said before is to
assume that the data lies on a low dimensional smooth manifold that lives in high
dimensional space, so that the high dimensional space is a space that the data and the low
dimensional manifold is the data set or what the data set would look like if we had every
point. And the idea is to recover this. Now there exist standard techniques such as PCA,
principal component analysis that are going to make an assumption that the manifold is a
linear space. And so they are just going to project on linear space. But in practice this is
often violated. The data set will curve and twirl and do many things in high dimensional
space, and this has led to no linear dimensionality reduction, which means trying to find
what's the curve of the data set in the high dimensional space and try to take that into
account when you project it into the low dimensional space.
So there exists already many algorithms to do that. I have already shown three of them,
but this is I think they list that certainly is not exhaustive but covers some of the
important ideas that have been developed in this field, so local linear embedding is a
fairly--well, they all work with the same idea, which is that you use local information and
you try to propagate it in such a way as to recover the full manifold, and so local linear
embedding, what it does effectively is it tries to express every point as a linear
combination of its neighbors. It is going to use that to then recover a lower dimensional
representation that preserves the local weighted average. The Laplacian Eigen maps and
the diffusion map work slightly differently in that they, what they do is they define a
random walk on the set of points and then use that diffusion process, so to speak, to
define an embedding so it does a Eigen composition of the random walk or the diffusion
process, and uses this Eigen composition as the embedding.
And then you are going to have other algorithms like your linear tangent space alignment
that is based on the idea of doing the local PCA in then patching those PCA together into
one embedding or as a map which tries to preserve the very specific geometrical object
which is the geodesics, so distances on the manifold is what you are trying to preserve
when you do Isomap. But there are known problems with these methods. Often they will
fail to recover the geometry and there exists no formal framework to re-compare them
and I showed you a few but they are really is a very, very long list of method and it is not
clear when you start which one makes sense to use and why. So to answer those two
problems, what we first have to do is actually define what it means to recover a low
dimensional manifold.
For example, if I use a very simple manifold classic example which is a Swiss roll which
has a hole in the middle of it, we have two potential embedding. This would be the
Isomap and this would be linear tangent space alignment and it is not clear from this
evidence which one is correct. We kind of know that this is probably wrong in some
sense, but we don't have a framework to say whether this is correct or not. So in
underlying all of this we are assuming that recovering the geometry is an important goal
and we know that this is possible, that if there is a manifold we have theoretical
guarantees that there exists at least an embedding of the data. So if we know that we
have a manifold dimension D we have guarantee that there exists an embedding of this
object into a lower dimensional space which is of dimensional 2-D, so you can--any
manifold of dimension D you can embed in R2D effectively.
So we know that there is a way. We know that there is an option for us. But rather than
try to do what a lot of manifold embedding methods are doing, which is to try to find the
proper embedding that preserves what they care the most about, we are going to try to go
around this and say well, is there a way to correct an embedding? Is there a way to say
well, I have distorted my space and I am going to recover the proper metric even though
my embedding is wrong? And this is called the recovering the Romanian metric. So to
formalize things a little bit, when I am talking about an embedding I mean that I have
some manifold and I am finding a map from that manifold to a new manifold and that
means it is going sense every point from one manifold to whatever point on the other
manifold or to the image of the map, and it also has the meaning that if it is an
embedding, it is differentiable which means that it also maps point in the tangent space of
the original manifold to points in the tangent space of the manifold in which you are
embedding, which in our case is going to be the Euclidean space.
This is important because geometry is defining terms of the Riemannian metric which is
just an inner product here that is defined on the tangent space, and so what that means is
if you know the inner product on a tangent space for a given geometry then you know
everything you care about. The idea then for us would be to say how do I recover what
the inner product should be in the embedding to correspond to the original geometry? So
to drive the point of what it means to have the Romanian metric, it means that you can do
inner product but that means that I can do all of the usual geometry. I can compute
angles between vectors in the tangent space, which means if I have a line, two lines
crossing each other I can try to say well, at what angle are they crossing each other. It
also means I can compute the length of the line on the manifold, and it means I can
compute the volume of a subset of the manifold or the full manifold, so effectively I have
the volume element which is quite important if you're trying to do say define a Hilbert
space on the manifold, which is kind of the key to do kernels for Gaussian process. So if
you have a space of functions on your manifold and you want to use those functions to do
any form of regression or classification, well what you actually need is the volume
element, so this is where we start thinking a little more in terms of learning and statistics
and less in terms of geometry. The key idea is that all of this is encoded in the Romanian
metric.
Before we go forward it is still worth defining one more thing which isometry. If I have
an embedding or a map from one manifold to another, we say that the manifolds are
isometric if this map satisfies a simple condition that under the Jacobian of the map the
inner product in both spaces is equal. So all we are saying here is if you do know your
inner product then you define your geometry, but then the point of having an isometric
embedding is just in question of preserving the inner product. That means that given an
embedding, I can try to define what is called a push forward, which is to say if I have the
original manifold then I am mapping into a new space. New space has its own inner
product, how do I correct the inner product so that it is equal to the original one? And
this is effectively how you do this. It is by looking at the inverse of the Jacobian or the
pseudo-inverse depending whether this is for rank, so that the inner product in the
embedded space is now equal to your inner product in the original space. Yeah.
&gt;&gt;: Can you give an example of an isometric map?
&gt;&gt; Dominique Perrault-Joncas: I mean the identity map would be an isometric map.
You could unfold the--I don't know if that is what you were thinking of, but if you have
the Swiss roll you could just unroll it and it would preserve all of the geometry. It
wouldn't have the same curvature, but that is because the curvature you are losing is the
extrinsic curvature rather than the intrinsic curvature, so that would be an example. But
in effect, what I am doing will be to define isometries. Instead of trying to find the map
that creates the isometry, I went the other way around and said given a map how to I turn
it into an isometry? What Romanian metric must have existed in the manifold in which I
am mapping to so that now it is isometric?
&gt;&gt;: Why does that apply to let say images, like a collection of images?
&gt;&gt; Dominique Perrault-Joncas: Because if you are mapping into Euclidean space, twodimensional Euclidean space, the Euclidean space has its own metric which is the identity
metric, and that metric no longer preserves the inner product in the embedding and the
idea is to say what should have been the metric in the Euclidean space so that now it
preserves the inner product. When you are doing an embedding there is the question of
does the manifold in which you are embedding already have a metric and usually it is an
implicit metric, and Euclidean space is the identity one. But I could have defined
Euclidean space with a different metric so that when I mapped it turned out to be an
isometric embedding. So it is kind of the subset or the topology of the metric into which
you are mapping and it is geometry which is the metric.
&gt;&gt;: [inaudible] eventually [inaudible] do this embedding [inaudible] which represents it
and [inaudible] vector which represents the metric.
&gt;&gt; Dominique Perrault-Joncas: It's going to be, it's not going to be a vector. It is going
to be a quadratic form that is going to define the inner product at that point.
&gt;&gt;: So it is actually the dimension that you map to is higher than the dimension of the
Euclidean dimension of the space because you also have this [inaudible].
&gt;&gt; Dominique Perrault-Joncas: Right, so in terms of memory, yes. You are keeping
your round D squared, so if you are mapping in dimension D, you also have to keep
around this quadratic form which is D squared. And the hope is that D is so much
smaller than the original space that you still are doing a considerable dimensionality
reduction; that is the underlying idea.
&gt;&gt;: [inaudible] required for volume preservation always? Because it seems like in the
practical application, practical situations as long as values are reserved that [inaudible]
good properties [inaudible], do you need it…
&gt;&gt; Dominique Perrault-Joncas: I don't think you need, no. I think it is an interesting
question that I actually am wondering about because you are right. You tend to define
your Hilbert space just in terms of your volume, but as it turns out I think it is actually
simpler to just get both at the same time. It is true that you actually don't need your
tangent space; it can be useful in some cases and I know examples where it is useful, but
it turns out that it is actually very easy to obtain the metric, or obtain the object that
contains it.
So the idea is how my going to find this push forward metric which is the corrected
metric for the embedding. The idea is to use the Laplace-Beltrami Operator that is
defined in terms of the metric and it has some very nice properties, one of which that
comes out of this work is that it contains all of the geometry of the object and at the same
time it is coordinate free. It can be expressed in coordinates, local coordinates but it is
coordinate free, which means that if you have computed the Laplacian for the object any
way you represent the object the Laplacian stays the same and it contains all of the
important geometry. So what happens is if you compute, if you want to recover the
metric given the Laplacian, all you need to do is apply the Laplacian to a product of
coordinates and that will tell you what is the metric, well actually the inverse of the
metric for those two coordinates. And so you compute it for all pairs of the embedding or
your manifold and it will give you the full inverse of the quadratic form, which means
that if you have an embedding, the same trick applies and that by definition gives you-you still need to prove it, but this will give you the push forward metric. The point is if
you do know the Laplacian you know how to correct your space. And so in practical
terms and I think I have explained that already, but given a sample of my manifold, a
sample according to some density, and in the embedding of that manifold in Euclidean
space or some space, I can recover the full geometry by simply applying this trick locally
and so I will recover at every point a quadratic form that expresses the inner product at
that point. It gives me the volume element or any other geometrical object that I am
interested in. And so the idea is to get an approximation of the Laplacian on the discrete
space that will then give me all of the geometry that I am interested in. And the idea is
that this already exists. It is a common trick that probably some of you or most of you
are familiar with which is to construct a graph Laplacian by applying a kernel between
every pair of points. Here I am using the classic Gaussian kernel, and then defining this
thing here which is just a random walk on the graph Laplacian or on the graph, and this is
called a normalized graph Laplacian, and if you take the limit as you let the bandwidth go
to zero, this is known to converge through the Laplace-Beltrami Operator on the
manifold. Now here I am assuming actually that the sampling density is constant. If it
isn't, there is a trick to re-normalize the graph so that the graph Laplacian now converges
to the…
&gt;&gt;: How would you know that the [inaudible]?
&gt;&gt; Dominique Perrault-Joncas: I am sorry, what?
&gt;&gt;: How could you tell [inaudible] sampling [inaudible] based on [inaudible]?
&gt;&gt; Dominique Perrault-Joncas: Well, that is kind of the idea of re-normalizing the graph
which is that you can re-normalize it without knowing what it is, and it will give you the
same object in the limit, so you don't actually need to know whether it is uniform or not.
So I am showing you just, it is just to avoid being overly complex in this slide, but there
is a trick that guarantees that you will have the right Laplacian.
So this means that for us to recover the Riemannian metric it is a simple case of obtaining
the graph Laplacian, finding an embedding, and then applying the Laplacian to the
coordinates of this embedding. This will give us the geometry so we will end up with an
embedding of every point with a quadratic form at every point. And this is what it looks
like. So I am taking this object which I call the hourglass and then I am using the, I
believe, the diffusion map or the Eigen map to embed this object which kind of changes
the form. It makes this part less curved and this part of a little flatter, and so what I get
effectively is this object. So every line here is expressing the graph. It is just telling you
what your neighbors are so you still have this notion topology and every ellipse
represents a quadratic form at that point. And so if an ellipse in that space is circular, that
means that locally the embedding is isometric. If the ellipse is distorted it is telling you
how much you have distorted your geometry at that point, and so we are finding that in
the middle this is fairly isometric but as I move towards the edge of the hourglass I am
starting to see that the metric is being effectively pulled, which means that every unit
point is worth more so that means it is being compressed, so what you are seeing is
actually the opposite of what is happening. The longer the metric becomes the more
every point is aware so it means it is being compressed compared to the original metric.
And here actually at the edge we are starting to see some artifact because it is starting to
be difficult for the curvature. So when you are defining the Laplacian you are starting to
see points that shouldn't be observed because it is not sampled densely enough. There is
kind of implicit idea that this works well provided that the manifold is smooth and doesn't
curve too fast, but if it does you need more sample points.
Okay, it so what does that mean in terms of the original data set I presented to you guys?
Well, it means that if I look at the free embedding that I first showed we get a clearer idea
of what is happening, so if I use the Isomap I am seeing that effectively the embedding is
fairly close to isometric. It is not quite correct. It is being stretched, well, it is actually
being stretched in this direction and is being compressed in that direction, but otherwise it
seems like it is preserving the geometry of the data set. Well, if I am using inner
[inaudible] space alignment, I am seeing quite a fair bit of distortion at the edges here and
finally the Laplacian Eigen map which is still one of my favorites has a lot of distortion
in it. So this now allows us to make actual explicit statement about what is happening in
the embedding.
&gt;&gt;: [inaudible] as far as the shape of the [inaudible]. So if the [inaudible] is distorted by
the [inaudible]. It is distorted but it is distorted in the same fashion or point that we can
actually…
&gt;&gt; Dominique Perrault-Joncas: Right and actually that is an important point because
now if it is only being linearly distorted in a different dimension it is easy to use one
point and then rescale everything which I will do in a second.
&gt;&gt;: [inaudible]. This is local time and space [inaudible] so because [inaudible].
Isometric [inaudible].
&gt;&gt; Dominique Perrault-Joncas: I mean that is kind of what I am trying to--I am using the
code that I'm trying to show is that actually we think that we are doing a good job. Even
the algorithms that are trying to be isometric.
&gt;&gt;: [inaudible].
&gt;&gt; Dominique Perrault-Joncas: This one?
&gt;&gt;: No. [inaudible].
&gt;&gt; Dominique Perrault-Joncas: [inaudible] Okay, yeah?
&gt;&gt;: [inaudible].
&gt;&gt; Dominique Perrault-Joncas: I haven't tried it. To be able to define an algorithm that
will always isometrically embed your data for any manifold is equivalent to Nash's
theorem, and so I take for granted that we don't have an algorithm that will be isometric.
There exist theorems where you can show it’s isometric if you are in L2 and so if you
have an infinite space is a easier to construct something. But if you are trying to embed
in a Euclidean space of lower dimension, it is very hard. You could use Nash’s
embedding theorem that is guarantees isometrics, but it means twirling around, around
dimensions and so numerically it is very, very unstable.
&gt;&gt;: [inaudible]. To be isometric [inaudible]. [inaudible] is not isometric. [inaudible].
Non-spatial [inaudible]. Distance.
&gt;&gt; Dominique Perrault-Joncas: So I am not sure I am going to fully address your
question, but I think that the main point that comes out of this is assuming that you do
have an algorithm that will always give you an isometric embedding that preserves
everything that you want, most likely you are giving yourself too much trouble, because
the idea is it is true that this embedding is no longer isometric, but I do know by how
much. That means I can still do computation in this embedding without having to worry
about the fact that if I ignore the quadratic form it is not isometric, which means that I
can instead of trying to find the best embedding, I can find a quick embedding and then
correct it. So the idea is that it can save you time. So Isomap is already known to be a
computationally intensive approach to the problem. You can try to define something as
simple as random projection that are superfast as long as you have enough of them, you
are defining an embedding and then you can correct it.
So that is where I am heading to. I am not trying to say that there doesn't exist the perfect
algorithm. Maybe there is. I don't think it is easy to find one that will work all of the
time, but I am trying to say that maybe we shouldn't try to find one because we can do
better. Given an embedding we now know how to do all of the geometric operation you
may care to do. So it is about finding a quick and efficient algorithm and then correcting
them I think is the best way I can approach the problem. I don't know if that is how you
feel. Does that answer your question?
&gt;&gt;: [inaudible] how this approach can [inaudible]. [inaudible] approach.
&gt;&gt; Dominique Perrault-Joncas: Yeah, I think we should cover that off-line. But to make
a final point and I will continue after that and we can talk, is that geodesic is not the thing
that I am the most interest in. The volume element as Misha pointed out may be the only
thing that I actually care about. LTSA which seems to work initially but actually turns
out to be slightly distorted, but this can be corrected as we pointed out by just essentially
if the distortion is along two dimensions it is easy to use the metric to then rescale
everything and so this is what I did here. Have something that looks isometric but
actually is slightly distorted and then I used one point to locally transform everything and
then it turns out that pretty much everywhere it is now isometric. But if I have something
that is distorting too much like the [inaudible] Isomap, then I can only apply this locally,
so if I had to use one point here to correct the space, I will be able to correct around that
point but the rest of the space remains distorted. That is very intuitive.
But this is where I think it becomes slightly more interesting which is to say well,
actually I don't need to worry about the embedding if I know the metric for a given
embedding I can do computation. So what I do is I consider an original manifold and I
have a line around it and I am trying to compute the geodesic, I can do it say in the
Isomap that is supposed to preserve the geodesic or into any other space. So if I found
the distance in the embedding I am getting wildly different results. If I am using the
shortest path in the graph defined by embedding I get very different results. But if I
correct the distance here by the metric I am effectively getting the same thing. Now it is
not perfect, but I am seeing that all of the embedding now has the same geometry. I can
compute the same geodesic irrespective of which embedding I am taking and I can do
this for volumes.
&gt;&gt;: [inaudible] the kernel that you used to compute to approximate the Laplacian is the
right, well it's the Laplacian that goes to the metric, the metric of interest [inaudible].
&gt;&gt; Dominique Perrault-Joncas: Yeah. So the idea is that kernel, the Laplacian that I am
using to define the metric is kind of key because it contains all of the geometry. It
doesn't--if there are errors it may no longer be the original geometry. It is an
approximation of it, but in the end of itself it will always compute the same geometry,
and that is actually a very important point because it might turn out that the original
geometry is not what you care about. That just means that you have to find a different
Laplacian and that object will give you the same geometry irrespective of what
embedding you have. So I can play the same trick with volume element and effectively I
am finding the same thing, more errors in part because I am using a courser [inaudible] of
the space. But effectively I cannot even compute the volume element for some
embeddings because it doesn't always make sense, but if I use a correction I am pretty
much all in the same ballpark.
So that kind of ends the geometry part and now I am trying to move and explain what it
means in terms of learning. And so now that I have established effectively that the
Laplacian contains all of the geometry, the idea is how can you use that to your
advantage. It is known that in a Euclidean space this operator, which is the usual Laplace
operator, if you look at this SPD, it defines a special Gaussian process. So I am assuming
that this here is Gaussian white noise and you can show that the covariance matrix of this
object, well, first this is a linear operator. That means that uses a linear combination of
the Gaussian white noise, and so it will be Gaussian and then the question is what is the
covariance matrix and you can show that it is a matern Gaussian process. And so the idea
is this operator effectively defined what a matern Gaussian process is, and then you can
use that to your advantage by saying actually I can replace the Laplacian by the Laplacian
of my manifold and now I have been able to define a matern Gaussian process on the
manifold, and I can use that covariance matrix now to do semi-supervised learning,
whereby I know the value of the function at certain points and I am trying to predict it
over the rest of the manifold.
&gt;&gt;: [inaudible] Gaussian process [inaudible].
&gt;&gt; Dominique Perrault-Joncas: So the matern refers to the covariance matrix, or the
covariance function. So a classic one is the square root exponential covariance function
but it has the disadvantage that it is infinitely differentiable so it is a very smooth process.
The matern is one whereby you can actually control how smooth it is. The alpha
parameter here determines how many derivatives the Gaussian process will have.
&gt;&gt;: Oh. Okay. I get it.
&gt;&gt; Dominique Perrault-Joncas: So it gives you much more control in terms of when you
are trying to do prediction as to how smooth is your kernel or your Gaussian process
depending on how you want to think about it. So given the Laplacian, now I can define a
matern process and I can define a covariance matrix. Actually this should have been with
respect to the manifold. But I can then do learning, you know, predict what the value of a
function should be at other points given what I know based on the geometry, so implicitly
I am learning the geometry of the manifold and I am using that to propagate the
information. So you can do that, you can think of this just as being a kernel regularizer
for your regression, for your classification.
What is actually interesting is that because now we have the metric you can also define
the precision matrix not only on the points but through the embedded space. That means
that instead of just learning on the points I can learn around the points, so if I have a new
point that arrives and I know where to embed it, I will immediately know what the value
of that point should be or what the cost of that point shall be. So this is a slight departure
from traditional semi-supervised learning in that I am actually doing inductive learning,
but in the embedding space.
So here is an example of how this works out. I am using the three embeddings that we
follow through for the faces data set, and what I am doing is I am embedding the points
for all embedding of interest, and then I am applying a matern kernel with respect to the
embedding space, so I am actually assuming that this is the correct geometry and
applying a Gaussian matern kernel to then predict what the value of the function should
be. So what I am doing effectively is I am keeping track of about a little less than a
quarter of the values of the heads position and I am using it to predict what the value
should be elsewhere.
So if I am using the embedding space, I am getting quite a fair bit of error but if I am
actually using the entrance geometry I am doing a lot better. So that is kind of the idea
which is to say that the original geometry of the data for this particular data set turns out
to be the right way to propagate the information.
&gt;&gt;: [inaudible].
&gt;&gt; Dominique Perrault-Joncas: This is how wrong a particular point is because these are
angles and it goes from 0 to 180, and that is telling you compared to what the actual value
of the point should be.
&gt;&gt;: So these are the points so you have you say, I don't remember how many points…
&gt;&gt; Dominique Perrault-Joncas: There are about 700.
&gt;&gt;: 700 you use as your training sample [inaudible] and so these are just a test points?
&gt;&gt; Dominique Perrault-Joncas: I am actually showing you both the test and the training
set. So the tests are about, yeah, I probably should remove them, but I am using about a
quarter of them and they will all effectively be right. You are going to have the right
value at that point so they are not contributing to the error.
&gt;&gt;: [inaudible] Eigen maps these seem to not be performing well. Is it because of the
sparsity on the right side? So what is…?
&gt;&gt; Dominique Perrault-Joncas: I just think that what it is doing is it's not recovering the
right geometry and so it is actually, well first the information doesn't propagate, like the
kernel was a stationary kernel, and so effectively by stretching the space it's, it is not
propagating the information the right way. But you are right that the sparsity here is the
problem and what is happening is that effectively you are doing regression on part of the
space that is now part of the ambient space, because you would assume that there is no
manifold here. But because your kernel is defined for the whole of the greeting space,
you are not using this space intelligently. While if you define it here, you are only
defining it on the space you know, so this is actually not inductive. This is now, the low
one here is just transductive which means I can only computed it on the points I know. I
think it is both the fact that there is, you are computing it with respect to the ambient
space but also because you are distorting it too much on the space.
&gt;&gt;: So [inaudible] I can see [inaudible] from [inaudible] Gaussian [inaudible].
&gt;&gt; Dominique Perrault-Joncas: You raise a very important point which is that I think as
I mentioned earlier the original geometry is not necessarily the best geometry. I am
focusing on this because it is the first step saying well first can I recover it. But second
what comes out of this is the fact that all the geometry is contained in the Laplacian, if
you think a different geometry is more appropriate, you should focus on defining a
Laplacian that is appropriate for that geometry. Now when you say you used your first
eigenvectors on the Laplacian for your embedding and that helps with clustering, what
you are doing implicitly is defining your geometry. So the map that uses the eigenvectors
to embed are effectively this map for this particular data set and this is now a different
geometry. It turns out that here it is the wrong geometry, but for clustering it tends to be
better. So the question is and it is actually one of my open problems of how can we use
what we are trying to do to select what geometry we want, what Laplacian we want for
this particular problem. I think this is what comes out of this. I am focusing on
recovering the geometry, but an off shot of this is that we now have a way to characterize
geometry and we should think about what kind of operation we want which then leads to
what kind of Laplacian you want.
&gt;&gt;: Did you ever think about the underlying process that generated the data? Like when
I look at this problem to me it looks like it is natural to think about some kind of model
that generates the state of whatever presents different views of the same sculpture head
and essentially what you are trying to do is recovering the parameters of, well not really
parameters but the, yeah the settings of that model when it generated this data. It's one of
the settings of the model I need [inaudible] rotation at that degree, but there is also a lot
of other things that you could work correspond to other dimensions [inaudible] how does
this map to that view of the world?
&gt;&gt; Dominique Perrault-Joncas: So here I am using the very straightforward vision of the
sampling process. I am assuming IID according to some distribution on the manifold, but
you make a very interesting point that there might be a different process that is at play
and is of interest. And I think the best answer to that is to say well, can I get an idea of
what operator is generating this, hopefully a linear operator, like a translation operator in
your case, or in this case. And if you do that in the same way that this defines a Gaussian
process, if you have a linear operator, you can define a different Gaussian process that is
representing the transfer of information better. Actually there is a paper on this by
Skolokoff I think in 2008 paper that says well, if we look at say a common filter which is
actually a generative process more in line with what you are saying, you can turn this into
a differential operator. That operator leads to a Gaussian process which is also equivalent
to a kernel, and so there is a kernel that is intrinsic to your process. And so what I am
using here is I am using this operator because it's the simplest one to use if all I am doing
is trying to interpolate between points. But if I know that there is something generating
the points, I am going to use a different operator that tries to mimic what I think is a
generator and that will lead to a different regressor. I will get different covariance matrix
and I will be able use that to effectively transfer the information. But hopefully, I mean
hopefully this would be with respect to the Laplacian that represented the geometry, but it
might be that it isn't. And if it isn't that is when you start saying well, if I want to do what
you said but I want to do it in a lower dimensional space, I have to be able to represent
this operator with respect to the correct space. And then I can start computing the metric
and what is the operator with respect to the metric. So I guess what I am saying is your
operator so it becomes geometry invariant, you may need to compute the metric
explicitly. In this problem I didn't go through the whole issue of saying what is the
metric at every point because I only use the Laplacian, but there are cases where the
process might require you to find a coordinate system which is in embedding, and then
compute the metric there to express the correct operator. Maybe I am telling you more
than what you really wanted to know, but yes, I think there is a natural way to think about
what you are saying, which is in terms of what operator generated the process.
I guess I summarized a fair bit already, but the original idea of this work was to say well,
we know that the embedding algorithm failed in some sense at recovering the geometry
of the manifold and most of the solution has been about trying to find one algorithm that
does better than the other one, and instead what we try to say is actually don't worry
about trying to find the correct algorithm; try to find how to correct it in a meaningful
way by defining a metric that is faithful to the geometry, the entrancing geometry of the
data or the geometry of interest to you. This means that it frees you from having to use
more complex algorithms. You can use simple algorithms and then recover the geometry
for that embedding. It also means that we have kind of unified all of the methods by
saying that they can be made equivalent through this object and now we can say to what
extent, so when I say they are equivalent, I mean only in the asymptotic limit. That
means that they might be more biased, or more variance for a particular embedding in the
discrete case, and so we can start trying to say well, how much does that matter in
practice. And now we kind of have a tool to compare the variance manifold through the
metric.
And now it also gives us a natural way to define the Gaussian process or regularizers by
which I mean like L2 type regularizer such as kernels that are fateful to the intrinsic
geometry or the geometry of interest. And the only challenge we get out of this is that
now we have, if we actually work in a coordinate chart, we have to carry around a
quadratic form that is going to be of dimension D2 and it also adds some complexity to
the implementation of any method. So where I see this going, the question to me now is
more about trying to say maybe the intrinsic geometry is not the correct one; what I care
about is the geometry that helps me the most with my problem and can I put a prior on
what type of geometry of interest and use the points then to find which one is the correct
one to use here. And how do I define a Laplacian that is robust to noise of variability and
this is kind of a tricky question which is why people don't always like manifold learning,
which is that the Laplacian or the embedding will be notoriously sensitive to noise. And
I think that there is some work being done in that respect which is to try to think about
what geometry you are trying to learn. So it is kind of related to this, but there it is some
very nice work by Stefan [inaudible] that is trying to define a geometry with respect to
groups by saying well, what are the natural invariants in my space and that gives you
much nicer embedding, much nicer geometry.
And then I didn't discuss this but I can show the consistency of this algorithm but only for
one embedding and that is the Laplacian Eigen map, and so for me it is a question of can
I extend that to general embedding and then try to think about what are the bias of my
method and [inaudible]. So I think that sums it up.
[applause].
&gt;&gt; Misha Bilenko: Questions?
&gt;&gt;: If typically you would use [inaudible] supervised where you are [inaudible] so if you
wanted to push back [inaudible] the label dimension [inaudible] into the algorithm how
straightforward or not would it be given all the [inaudible] in the supervised metric
learning [inaudible]?
&gt;&gt; Dominique Perrault-Joncas: So the idea is that if I don't know, if I don't have a
supervised context, I will go unsupervised and I will just assume that one geometry is
more interesting than another either by arguments or just because it is the one I have. If I
do have a semi-supervised learning problem, that's where I am thinking about well, I can
just do the unsupervised part and then use it afterwards to help me with my supervised or
I can try to define, you know, a subset of geometries and then try to find the one that
matches the best. Implicitly that's what people already are doing. What I am thinking of
here is this here. Here I have the bandwidth. That is going to define my graph. That is
going to help define my Laplacian. Now if you are in semi-supervised learning, you are
going to try to use cross validation or any other method of model selection to find what is
the optimal epsilon and that is in itself defining a geometry. The geometry changed as
you increased your bandwidth. It becomes less and less interesting at some point but if it
is too localized it becomes a geometry where the points are disconnected.
And so I think this is already being done in effect.
&gt;&gt;: [inaudible] when you have the locality being influenced by the supervision.
&gt;&gt;: But you can look at in a different way you could say since this all this process is
unsupervised, so you could say one of the [inaudible] [inaudible] different purposes you
want to compress it. So you want to find [inaudible] that would be useful for many tasks
that you don't know [inaudible]. [inaudible] why do it this way [inaudible] might go
directly to find a representation that will be best for the task that I am interested in but if
you have multiple tasks or even unknown tasks this is where this kind of [inaudible]
representation might be interesting.
&gt;&gt;: I guess another question would be then can you have sort of like a midway where
you would have representation but then the task could come in and that [inaudible]'s
solution where you have it, you have part of the [inaudible] but then it could be corrected
using this [inaudible] other task?
&gt;&gt;: You have to keep also the original data but if you keep the original data so what is
the point of this? [inaudible]
&gt;&gt;: Well it depends on what that [inaudible] is [inaudible] if it's just compression or if
it's, if it's just compression then sure as long as [inaudible].
&gt;&gt;: [inaudible] between the top and the [inaudible].
&gt;&gt; Dominique Perrault-Joncas: If it's not very related you will just move in the space of
potential geometries you are learning. I mean implicitly here as I said you can rescale the
graph to, so that the sampling density doesn't matter, but if you don't rescale it, does
matter. And there is actually a parameter that lets you tweak that and so you can let the
sampling density affect the embedding and that, it's kind of clearer than just simply
changing the, if all you're saying is to what extent do I want the sampling density to
affect the representation of my data and I try to find with respect a task or not. If I do it
without respect to a task then I say well, maybe I know I will be doing clustering but I
don't know what is the clustering yet and so I am going to let it be in effect.
&gt;&gt;: [inaudible] making assumptions here [inaudible] making the assumption that the task
that will come will be expressed on the manifold and not on its embedding in the high
dimensional space because if, you know, we assume that I have a classification task. If I
actually to solve it if I need [inaudible] embedding in the manifold of the original space,
well I just dump this information when I did this [inaudible].
&gt;&gt; Dominique Perrault-Joncas: Wait. Are you implying that the ambient space of the
original is important? Is that which you mean? [inaudible] the objective…
&gt;&gt;: [inaudible] make it would make an assumption what you would think is a reasonable
assumption that if I have low dimensional data embedded in high dimensional space, then
for most tasks we really only care on the low dimensional data nature of the task and we
are not, we do not care about the high dimension. So think about your faces, right? If the
task, eventually someone tells you the task is pick a specific pixel and say whether the
intensity on this intensity is higher or lower than the threshold. This has nothing to do
with a low dimensional representation of the data. And any such embedding that you will
find even if it [inaudible] embedding it will fail miserably on this task, so you make that
line such that any reasonable task that you will ask to do with respect to this data will
actually care about this low dimensional information and not the high dimension.
&gt;&gt; Dominique Perrault-Joncas: I think we just agree. My question being that the
ambient space of the data is important and that is what you are saying here I think. It is a
fundamental question of whether the full space has some important information to tell
you.
&gt;&gt;: [inaudible] have to make the assumption that the ultimate, that the task you want to
perform is sufficient statistics for that task are the intrinsic, are the measurements in the
intrinsic low dimensional space.
&gt;&gt;: It seems like a very reasonable assumption. This is going back to what you are
saying obviously once I compressed, okay I lost something. Can somebody tell me
exactly [inaudible] on this?
&gt;&gt;: [inaudible] compressed halfway in such a way that the rest of the compression could
then be performed by the task.
&gt;&gt;: Only if it is a [inaudible] less compression.
&gt;&gt; Dominique Perrault-Joncas: I think what is explicit here is which part you are losing,
which is the extrinsic geometry which is the ambient space, but you are preserving all of
the intrinsic geometry. It turns out that for high dimensional data it is often the case that
the intrinsic space is actually washing off the information. If you try to do supervised
learning in the high dimensional space, you can't re-propagate information because you
have so many dimensions to move around that nothing gets propagated or it all gets
propagated informally. So it tends to perform better, it might not be the right one, but we
just don't have the tool to really express what is happening in high dimensional space.
And so I feel like that it turns out that in many cases it is the best assumption that we can
make because it is so hard to deal with high dimension and what it means to propagate in
high dimension, propagate information. What is the covariance that is natural for high
dimensional space?
&gt;&gt;: So you are saying that you were doing more work currently on comparing the
[inaudible] kernels roughly? [inaudible] playing with the [inaudible] kernel [inaudible].
&gt;&gt; Dominique Perrault-Joncas: It is the same idea of like using this [inaudible] GP and
then comparing it to people who are doing kernels but who are using Laplacian
regularizer. Actually Zhou and Belkin just came out with a paper in [inaudible] saying
they consider the higher order regularization of the Laplacian. That defines a kernel and
this kernel is equivalent to the GP process if you remove this K. So it's really an
equivalent approach is what I'm saying. Was there more that you wanted to know about
this or… Okay. And this goes back to the Skolokoff paper that I mentioned that says
there is a link between Gaussian process, kernels and differential operator. You can
move between the three pretty much as you will, or a regularizer.
&gt;&gt; Misha Bilenko: Thanks Dominique.
&gt;&gt; Dominique Perrault-Joncas: Thank you.
[applause].
```