>> Misha Bilenko: So today we have Dominique Perrault-Joncas coming across the frozen lake all the way. Dominique has gone to McGill before he came to University of Washington. He recently won the Birnbaum prize from this department and he will tell us today all about manifold preservation and metric learning. >> Dominique Perrault-Joncas: Thank you Misha. So I would first like to Misha and [inaudible] for organizing this talk and for inviting me here. I'm going to talk today about metric learning and preserving the intrinsic geometry. This is joint work with, Marina Meila, Professor Meila at the University of Washington who is my advisor for my PhD right now. A brief overview. I will give some background on the problem. I will cover some differential geometry theory just to get you acquainted with the framework, because I think it is good to be reminded of it even if you are somewhat familiar with differential geometry. I will discuss the discretized problem and the algorithm and then I am going to show some examples and applications. I will give a brief summary and show some future research, and maybe I will go through the consistency of push forward metric but maybe not. I am not sure that that is the most important thing to do. So the problem we are trying to track here is the curse of dimensionality as there is an abundance of large data sets that are high dimensional and this leads to some problem in terms of interpretation, in terms of computation and in terms of analysis and so the idea is to try to reduce the dimension of the data set, and underlying this idea is to try to find some low dimensional representation of this data that will preserve all of the important information or at least most of it. So that is kind of the goal today. So I will start with giving you a toy example. It is going to kind of follow us through the talk and so it is worth going over what this example is. So effectively what we are going to be looking at are images, so it is going to be faces, gray faces, 64 x 64 pixels and so, and we're going to have about 700 or so of those images. We have a fairly high dimensional data set. I mean it is still small by today's standards but I think it captures the idea, but this data set has some very interesting attributes, which is that it represents one specific face that is only moving right to left and up and down. So in this data set what we have implicitly is only 2 degrees of freedom and so we are going to try to leverage this attribute of the data set to help us in our analysis. >>: So I understand, it is all images of the same face taken from different points of view? >> Dominique Perrault-Joncas: Let me show you the data set itself. So this is one embedding of this data set. Every single ellipse on here is one of the faces and I am showing you a subset of those faces. So it is always this particular face as being rotated one way or the other or that is tilted up or down. >>: So the difference is just a point of view of the camera with respect to the face? >> Dominique Perrault-Joncas: Exactly. It is either the head or the camera depending on how you want to see it and we have 2 degrees of freedom even though in terms of representation every image is 64 x 64. The idea is to effectively find an embedding like this one and try to use this embedding to do analysis or to do some interpretation. So in this case it is very easy, and we found that one dimension represents the angle of the head in terms of left and right and the other dimension represents the angle of the head from up-and-down. So, so far so good, and actually there is a lot of embedding that exists. I am showing you a subset here and they all do various things. The question then is which one makes the most sense. The underlying idea for dimensionality reduction is what I said before is to assume that the data lies on a low dimensional smooth manifold that lives in high dimensional space, so that the high dimensional space is a space that the data and the low dimensional manifold is the data set or what the data set would look like if we had every point. And the idea is to recover this. Now there exist standard techniques such as PCA, principal component analysis that are going to make an assumption that the manifold is a linear space. And so they are just going to project on linear space. But in practice this is often violated. The data set will curve and twirl and do many things in high dimensional space, and this has led to no linear dimensionality reduction, which means trying to find what's the curve of the data set in the high dimensional space and try to take that into account when you project it into the low dimensional space. So there exists already many algorithms to do that. I have already shown three of them, but this is I think they list that certainly is not exhaustive but covers some of the important ideas that have been developed in this field, so local linear embedding is a fairly--well, they all work with the same idea, which is that you use local information and you try to propagate it in such a way as to recover the full manifold, and so local linear embedding, what it does effectively is it tries to express every point as a linear combination of its neighbors. It is going to use that to then recover a lower dimensional representation that preserves the local weighted average. The Laplacian Eigen maps and the diffusion map work slightly differently in that they, what they do is they define a random walk on the set of points and then use that diffusion process, so to speak, to define an embedding so it does a Eigen composition of the random walk or the diffusion process, and uses this Eigen composition as the embedding. And then you are going to have other algorithms like your linear tangent space alignment that is based on the idea of doing the local PCA in then patching those PCA together into one embedding or as a map which tries to preserve the very specific geometrical object which is the geodesics, so distances on the manifold is what you are trying to preserve when you do Isomap. But there are known problems with these methods. Often they will fail to recover the geometry and there exists no formal framework to re-compare them and I showed you a few but they are really is a very, very long list of method and it is not clear when you start which one makes sense to use and why. So to answer those two problems, what we first have to do is actually define what it means to recover a low dimensional manifold. For example, if I use a very simple manifold classic example which is a Swiss roll which has a hole in the middle of it, we have two potential embedding. This would be the Isomap and this would be linear tangent space alignment and it is not clear from this evidence which one is correct. We kind of know that this is probably wrong in some sense, but we don't have a framework to say whether this is correct or not. So in underlying all of this we are assuming that recovering the geometry is an important goal and we know that this is possible, that if there is a manifold we have theoretical guarantees that there exists at least an embedding of the data. So if we know that we have a manifold dimension D we have guarantee that there exists an embedding of this object into a lower dimensional space which is of dimensional 2-D, so you can--any manifold of dimension D you can embed in R2D effectively. So we know that there is a way. We know that there is an option for us. But rather than try to do what a lot of manifold embedding methods are doing, which is to try to find the proper embedding that preserves what they care the most about, we are going to try to go around this and say well, is there a way to correct an embedding? Is there a way to say well, I have distorted my space and I am going to recover the proper metric even though my embedding is wrong? And this is called the recovering the Romanian metric. So to formalize things a little bit, when I am talking about an embedding I mean that I have some manifold and I am finding a map from that manifold to a new manifold and that means it is going sense every point from one manifold to whatever point on the other manifold or to the image of the map, and it also has the meaning that if it is an embedding, it is differentiable which means that it also maps point in the tangent space of the original manifold to points in the tangent space of the manifold in which you are embedding, which in our case is going to be the Euclidean space. This is important because geometry is defining terms of the Riemannian metric which is just an inner product here that is defined on the tangent space, and so what that means is if you know the inner product on a tangent space for a given geometry then you know everything you care about. The idea then for us would be to say how do I recover what the inner product should be in the embedding to correspond to the original geometry? So to drive the point of what it means to have the Romanian metric, it means that you can do inner product but that means that I can do all of the usual geometry. I can compute angles between vectors in the tangent space, which means if I have a line, two lines crossing each other I can try to say well, at what angle are they crossing each other. It also means I can compute the length of the line on the manifold, and it means I can compute the volume of a subset of the manifold or the full manifold, so effectively I have the volume element which is quite important if you're trying to do say define a Hilbert space on the manifold, which is kind of the key to do kernels for Gaussian process. So if you have a space of functions on your manifold and you want to use those functions to do any form of regression or classification, well what you actually need is the volume element, so this is where we start thinking a little more in terms of learning and statistics and less in terms of geometry. The key idea is that all of this is encoded in the Romanian metric. Before we go forward it is still worth defining one more thing which isometry. If I have an embedding or a map from one manifold to another, we say that the manifolds are isometric if this map satisfies a simple condition that under the Jacobian of the map the inner product in both spaces is equal. So all we are saying here is if you do know your inner product then you define your geometry, but then the point of having an isometric embedding is just in question of preserving the inner product. That means that given an embedding, I can try to define what is called a push forward, which is to say if I have the original manifold then I am mapping into a new space. New space has its own inner product, how do I correct the inner product so that it is equal to the original one? And this is effectively how you do this. It is by looking at the inverse of the Jacobian or the pseudo-inverse depending whether this is for rank, so that the inner product in the embedded space is now equal to your inner product in the original space. Yeah. >>: Can you give an example of an isometric map? >> Dominique Perrault-Joncas: I mean the identity map would be an isometric map. You could unfold the--I don't know if that is what you were thinking of, but if you have the Swiss roll you could just unroll it and it would preserve all of the geometry. It wouldn't have the same curvature, but that is because the curvature you are losing is the extrinsic curvature rather than the intrinsic curvature, so that would be an example. But in effect, what I am doing will be to define isometries. Instead of trying to find the map that creates the isometry, I went the other way around and said given a map how to I turn it into an isometry? What Romanian metric must have existed in the manifold in which I am mapping to so that now it is isometric? >>: Why does that apply to let say images, like a collection of images? >> Dominique Perrault-Joncas: Because if you are mapping into Euclidean space, twodimensional Euclidean space, the Euclidean space has its own metric which is the identity metric, and that metric no longer preserves the inner product in the embedding and the idea is to say what should have been the metric in the Euclidean space so that now it preserves the inner product. When you are doing an embedding there is the question of does the manifold in which you are embedding already have a metric and usually it is an implicit metric, and Euclidean space is the identity one. But I could have defined Euclidean space with a different metric so that when I mapped it turned out to be an isometric embedding. So it is kind of the subset or the topology of the metric into which you are mapping and it is geometry which is the metric. >>: [inaudible] eventually [inaudible] do this embedding [inaudible] which represents it and [inaudible] vector which represents the metric. >> Dominique Perrault-Joncas: It's going to be, it's not going to be a vector. It is going to be a quadratic form that is going to define the inner product at that point. >>: So it is actually the dimension that you map to is higher than the dimension of the Euclidean dimension of the space because you also have this [inaudible]. >> Dominique Perrault-Joncas: Right, so in terms of memory, yes. You are keeping your round D squared, so if you are mapping in dimension D, you also have to keep around this quadratic form which is D squared. And the hope is that D is so much smaller than the original space that you still are doing a considerable dimensionality reduction; that is the underlying idea. >>: [inaudible] required for volume preservation always? Because it seems like in the practical application, practical situations as long as values are reserved that [inaudible] good properties [inaudible], do you need it… >> Dominique Perrault-Joncas: I don't think you need, no. I think it is an interesting question that I actually am wondering about because you are right. You tend to define your Hilbert space just in terms of your volume, but as it turns out I think it is actually simpler to just get both at the same time. It is true that you actually don't need your tangent space; it can be useful in some cases and I know examples where it is useful, but it turns out that it is actually very easy to obtain the metric, or obtain the object that contains it. So the idea is how my going to find this push forward metric which is the corrected metric for the embedding. The idea is to use the Laplace-Beltrami Operator that is defined in terms of the metric and it has some very nice properties, one of which that comes out of this work is that it contains all of the geometry of the object and at the same time it is coordinate free. It can be expressed in coordinates, local coordinates but it is coordinate free, which means that if you have computed the Laplacian for the object any way you represent the object the Laplacian stays the same and it contains all of the important geometry. So what happens is if you compute, if you want to recover the metric given the Laplacian, all you need to do is apply the Laplacian to a product of coordinates and that will tell you what is the metric, well actually the inverse of the metric for those two coordinates. And so you compute it for all pairs of the embedding or your manifold and it will give you the full inverse of the quadratic form, which means that if you have an embedding, the same trick applies and that by definition gives you-you still need to prove it, but this will give you the push forward metric. The point is if you do know the Laplacian you know how to correct your space. And so in practical terms and I think I have explained that already, but given a sample of my manifold, a sample according to some density, and in the embedding of that manifold in Euclidean space or some space, I can recover the full geometry by simply applying this trick locally and so I will recover at every point a quadratic form that expresses the inner product at that point. It gives me the volume element or any other geometrical object that I am interested in. And so the idea is to get an approximation of the Laplacian on the discrete space that will then give me all of the geometry that I am interested in. And the idea is that this already exists. It is a common trick that probably some of you or most of you are familiar with which is to construct a graph Laplacian by applying a kernel between every pair of points. Here I am using the classic Gaussian kernel, and then defining this thing here which is just a random walk on the graph Laplacian or on the graph, and this is called a normalized graph Laplacian, and if you take the limit as you let the bandwidth go to zero, this is known to converge through the Laplace-Beltrami Operator on the manifold. Now here I am assuming actually that the sampling density is constant. If it isn't, there is a trick to re-normalize the graph so that the graph Laplacian now converges to the… >>: How would you know that the [inaudible]? >> Dominique Perrault-Joncas: I am sorry, what? >>: How could you tell [inaudible] sampling [inaudible] based on [inaudible]? >> Dominique Perrault-Joncas: Well, that is kind of the idea of re-normalizing the graph which is that you can re-normalize it without knowing what it is, and it will give you the same object in the limit, so you don't actually need to know whether it is uniform or not. So I am showing you just, it is just to avoid being overly complex in this slide, but there is a trick that guarantees that you will have the right Laplacian. So this means that for us to recover the Riemannian metric it is a simple case of obtaining the graph Laplacian, finding an embedding, and then applying the Laplacian to the coordinates of this embedding. This will give us the geometry so we will end up with an embedding of every point with a quadratic form at every point. And this is what it looks like. So I am taking this object which I call the hourglass and then I am using the, I believe, the diffusion map or the Eigen map to embed this object which kind of changes the form. It makes this part less curved and this part of a little flatter, and so what I get effectively is this object. So every line here is expressing the graph. It is just telling you what your neighbors are so you still have this notion topology and every ellipse represents a quadratic form at that point. And so if an ellipse in that space is circular, that means that locally the embedding is isometric. If the ellipse is distorted it is telling you how much you have distorted your geometry at that point, and so we are finding that in the middle this is fairly isometric but as I move towards the edge of the hourglass I am starting to see that the metric is being effectively pulled, which means that every unit point is worth more so that means it is being compressed, so what you are seeing is actually the opposite of what is happening. The longer the metric becomes the more every point is aware so it means it is being compressed compared to the original metric. And here actually at the edge we are starting to see some artifact because it is starting to be difficult for the curvature. So when you are defining the Laplacian you are starting to see points that shouldn't be observed because it is not sampled densely enough. There is kind of implicit idea that this works well provided that the manifold is smooth and doesn't curve too fast, but if it does you need more sample points. Okay, it so what does that mean in terms of the original data set I presented to you guys? Well, it means that if I look at the free embedding that I first showed we get a clearer idea of what is happening, so if I use the Isomap I am seeing that effectively the embedding is fairly close to isometric. It is not quite correct. It is being stretched, well, it is actually being stretched in this direction and is being compressed in that direction, but otherwise it seems like it is preserving the geometry of the data set. Well, if I am using inner [inaudible] space alignment, I am seeing quite a fair bit of distortion at the edges here and finally the Laplacian Eigen map which is still one of my favorites has a lot of distortion in it. So this now allows us to make actual explicit statement about what is happening in the embedding. >>: [inaudible] as far as the shape of the [inaudible]. So if the [inaudible] is distorted by the [inaudible]. It is distorted but it is distorted in the same fashion or point that we can actually… >> Dominique Perrault-Joncas: Right and actually that is an important point because now if it is only being linearly distorted in a different dimension it is easy to use one point and then rescale everything which I will do in a second. >>: [inaudible]. This is local time and space [inaudible] so because [inaudible]. Isometric [inaudible]. >> Dominique Perrault-Joncas: I mean that is kind of what I am trying to--I am using the code that I'm trying to show is that actually we think that we are doing a good job. Even the algorithms that are trying to be isometric. >>: [inaudible]. >> Dominique Perrault-Joncas: This one? >>: No. [inaudible]. >> Dominique Perrault-Joncas: [inaudible] Okay, yeah? >>: [inaudible]. >> Dominique Perrault-Joncas: I haven't tried it. To be able to define an algorithm that will always isometrically embed your data for any manifold is equivalent to Nash's theorem, and so I take for granted that we don't have an algorithm that will be isometric. There exist theorems where you can show it’s isometric if you are in L2 and so if you have an infinite space is a easier to construct something. But if you are trying to embed in a Euclidean space of lower dimension, it is very hard. You could use Nash’s embedding theorem that is guarantees isometrics, but it means twirling around, around dimensions and so numerically it is very, very unstable. >>: [inaudible]. To be isometric [inaudible]. [inaudible] is not isometric. [inaudible]. Non-spatial [inaudible]. Distance. >> Dominique Perrault-Joncas: So I am not sure I am going to fully address your question, but I think that the main point that comes out of this is assuming that you do have an algorithm that will always give you an isometric embedding that preserves everything that you want, most likely you are giving yourself too much trouble, because the idea is it is true that this embedding is no longer isometric, but I do know by how much. That means I can still do computation in this embedding without having to worry about the fact that if I ignore the quadratic form it is not isometric, which means that I can instead of trying to find the best embedding, I can find a quick embedding and then correct it. So the idea is that it can save you time. So Isomap is already known to be a computationally intensive approach to the problem. You can try to define something as simple as random projection that are superfast as long as you have enough of them, you are defining an embedding and then you can correct it. So that is where I am heading to. I am not trying to say that there doesn't exist the perfect algorithm. Maybe there is. I don't think it is easy to find one that will work all of the time, but I am trying to say that maybe we shouldn't try to find one because we can do better. Given an embedding we now know how to do all of the geometric operation you may care to do. So it is about finding a quick and efficient algorithm and then correcting them I think is the best way I can approach the problem. I don't know if that is how you feel. Does that answer your question? >>: [inaudible] how this approach can [inaudible]. [inaudible] approach. >> Dominique Perrault-Joncas: Yeah, I think we should cover that off-line. But to make a final point and I will continue after that and we can talk, is that geodesic is not the thing that I am the most interest in. The volume element as Misha pointed out may be the only thing that I actually care about. LTSA which seems to work initially but actually turns out to be slightly distorted, but this can be corrected as we pointed out by just essentially if the distortion is along two dimensions it is easy to use the metric to then rescale everything and so this is what I did here. Have something that looks isometric but actually is slightly distorted and then I used one point to locally transform everything and then it turns out that pretty much everywhere it is now isometric. But if I have something that is distorting too much like the [inaudible] Isomap, then I can only apply this locally, so if I had to use one point here to correct the space, I will be able to correct around that point but the rest of the space remains distorted. That is very intuitive. But this is where I think it becomes slightly more interesting which is to say well, actually I don't need to worry about the embedding if I know the metric for a given embedding I can do computation. So what I do is I consider an original manifold and I have a line around it and I am trying to compute the geodesic, I can do it say in the Isomap that is supposed to preserve the geodesic or into any other space. So if I found the distance in the embedding I am getting wildly different results. If I am using the shortest path in the graph defined by embedding I get very different results. But if I correct the distance here by the metric I am effectively getting the same thing. Now it is not perfect, but I am seeing that all of the embedding now has the same geometry. I can compute the same geodesic irrespective of which embedding I am taking and I can do this for volumes. >>: [inaudible] the kernel that you used to compute to approximate the Laplacian is the right, well it's the Laplacian that goes to the metric, the metric of interest [inaudible]. >> Dominique Perrault-Joncas: Yeah. So the idea is that kernel, the Laplacian that I am using to define the metric is kind of key because it contains all of the geometry. It doesn't--if there are errors it may no longer be the original geometry. It is an approximation of it, but in the end of itself it will always compute the same geometry, and that is actually a very important point because it might turn out that the original geometry is not what you care about. That just means that you have to find a different Laplacian and that object will give you the same geometry irrespective of what embedding you have. So I can play the same trick with volume element and effectively I am finding the same thing, more errors in part because I am using a courser [inaudible] of the space. But effectively I cannot even compute the volume element for some embeddings because it doesn't always make sense, but if I use a correction I am pretty much all in the same ballpark. So that kind of ends the geometry part and now I am trying to move and explain what it means in terms of learning. And so now that I have established effectively that the Laplacian contains all of the geometry, the idea is how can you use that to your advantage. It is known that in a Euclidean space this operator, which is the usual Laplace operator, if you look at this SPD, it defines a special Gaussian process. So I am assuming that this here is Gaussian white noise and you can show that the covariance matrix of this object, well, first this is a linear operator. That means that uses a linear combination of the Gaussian white noise, and so it will be Gaussian and then the question is what is the covariance matrix and you can show that it is a matern Gaussian process. And so the idea is this operator effectively defined what a matern Gaussian process is, and then you can use that to your advantage by saying actually I can replace the Laplacian by the Laplacian of my manifold and now I have been able to define a matern Gaussian process on the manifold, and I can use that covariance matrix now to do semi-supervised learning, whereby I know the value of the function at certain points and I am trying to predict it over the rest of the manifold. >>: [inaudible] Gaussian process [inaudible]. >> Dominique Perrault-Joncas: So the matern refers to the covariance matrix, or the covariance function. So a classic one is the square root exponential covariance function but it has the disadvantage that it is infinitely differentiable so it is a very smooth process. The matern is one whereby you can actually control how smooth it is. The alpha parameter here determines how many derivatives the Gaussian process will have. >>: Oh. Okay. I get it. >> Dominique Perrault-Joncas: So it gives you much more control in terms of when you are trying to do prediction as to how smooth is your kernel or your Gaussian process depending on how you want to think about it. So given the Laplacian, now I can define a matern process and I can define a covariance matrix. Actually this should have been with respect to the manifold. But I can then do learning, you know, predict what the value of a function should be at other points given what I know based on the geometry, so implicitly I am learning the geometry of the manifold and I am using that to propagate the information. So you can do that, you can think of this just as being a kernel regularizer for your regression, for your classification. What is actually interesting is that because now we have the metric you can also define the precision matrix not only on the points but through the embedded space. That means that instead of just learning on the points I can learn around the points, so if I have a new point that arrives and I know where to embed it, I will immediately know what the value of that point should be or what the cost of that point shall be. So this is a slight departure from traditional semi-supervised learning in that I am actually doing inductive learning, but in the embedding space. So here is an example of how this works out. I am using the three embeddings that we follow through for the faces data set, and what I am doing is I am embedding the points for all embedding of interest, and then I am applying a matern kernel with respect to the embedding space, so I am actually assuming that this is the correct geometry and applying a Gaussian matern kernel to then predict what the value of the function should be. So what I am doing effectively is I am keeping track of about a little less than a quarter of the values of the heads position and I am using it to predict what the value should be elsewhere. So if I am using the embedding space, I am getting quite a fair bit of error but if I am actually using the entrance geometry I am doing a lot better. So that is kind of the idea which is to say that the original geometry of the data for this particular data set turns out to be the right way to propagate the information. >>: [inaudible]. >> Dominique Perrault-Joncas: This is how wrong a particular point is because these are angles and it goes from 0 to 180, and that is telling you compared to what the actual value of the point should be. >>: So these are the points so you have you say, I don't remember how many points… >> Dominique Perrault-Joncas: There are about 700. >>: 700 you use as your training sample [inaudible] and so these are just a test points? >> Dominique Perrault-Joncas: I am actually showing you both the test and the training set. So the tests are about, yeah, I probably should remove them, but I am using about a quarter of them and they will all effectively be right. You are going to have the right value at that point so they are not contributing to the error. >>: [inaudible] Eigen maps these seem to not be performing well. Is it because of the sparsity on the right side? So what is…? >> Dominique Perrault-Joncas: I just think that what it is doing is it's not recovering the right geometry and so it is actually, well first the information doesn't propagate, like the kernel was a stationary kernel, and so effectively by stretching the space it's, it is not propagating the information the right way. But you are right that the sparsity here is the problem and what is happening is that effectively you are doing regression on part of the space that is now part of the ambient space, because you would assume that there is no manifold here. But because your kernel is defined for the whole of the greeting space, you are not using this space intelligently. While if you define it here, you are only defining it on the space you know, so this is actually not inductive. This is now, the low one here is just transductive which means I can only computed it on the points I know. I think it is both the fact that there is, you are computing it with respect to the ambient space but also because you are distorting it too much on the space. >>: So [inaudible] I can see [inaudible] from [inaudible] Gaussian [inaudible]. >> Dominique Perrault-Joncas: You raise a very important point which is that I think as I mentioned earlier the original geometry is not necessarily the best geometry. I am focusing on this because it is the first step saying well first can I recover it. But second what comes out of this is the fact that all the geometry is contained in the Laplacian, if you think a different geometry is more appropriate, you should focus on defining a Laplacian that is appropriate for that geometry. Now when you say you used your first eigenvectors on the Laplacian for your embedding and that helps with clustering, what you are doing implicitly is defining your geometry. So the map that uses the eigenvectors to embed are effectively this map for this particular data set and this is now a different geometry. It turns out that here it is the wrong geometry, but for clustering it tends to be better. So the question is and it is actually one of my open problems of how can we use what we are trying to do to select what geometry we want, what Laplacian we want for this particular problem. I think this is what comes out of this. I am focusing on recovering the geometry, but an off shot of this is that we now have a way to characterize geometry and we should think about what kind of operation we want which then leads to what kind of Laplacian you want. >>: Did you ever think about the underlying process that generated the data? Like when I look at this problem to me it looks like it is natural to think about some kind of model that generates the state of whatever presents different views of the same sculpture head and essentially what you are trying to do is recovering the parameters of, well not really parameters but the, yeah the settings of that model when it generated this data. It's one of the settings of the model I need [inaudible] rotation at that degree, but there is also a lot of other things that you could work correspond to other dimensions [inaudible] how does this map to that view of the world? >> Dominique Perrault-Joncas: So here I am using the very straightforward vision of the sampling process. I am assuming IID according to some distribution on the manifold, but you make a very interesting point that there might be a different process that is at play and is of interest. And I think the best answer to that is to say well, can I get an idea of what operator is generating this, hopefully a linear operator, like a translation operator in your case, or in this case. And if you do that in the same way that this defines a Gaussian process, if you have a linear operator, you can define a different Gaussian process that is representing the transfer of information better. Actually there is a paper on this by Skolokoff I think in 2008 paper that says well, if we look at say a common filter which is actually a generative process more in line with what you are saying, you can turn this into a differential operator. That operator leads to a Gaussian process which is also equivalent to a kernel, and so there is a kernel that is intrinsic to your process. And so what I am using here is I am using this operator because it's the simplest one to use if all I am doing is trying to interpolate between points. But if I know that there is something generating the points, I am going to use a different operator that tries to mimic what I think is a generator and that will lead to a different regressor. I will get different covariance matrix and I will be able use that to effectively transfer the information. But hopefully, I mean hopefully this would be with respect to the Laplacian that represented the geometry, but it might be that it isn't. And if it isn't that is when you start saying well, if I want to do what you said but I want to do it in a lower dimensional space, I have to be able to represent this operator with respect to the correct space. And then I can start computing the metric and what is the operator with respect to the metric. So I guess what I am saying is your operator so it becomes geometry invariant, you may need to compute the metric explicitly. In this problem I didn't go through the whole issue of saying what is the metric at every point because I only use the Laplacian, but there are cases where the process might require you to find a coordinate system which is in embedding, and then compute the metric there to express the correct operator. Maybe I am telling you more than what you really wanted to know, but yes, I think there is a natural way to think about what you are saying, which is in terms of what operator generated the process. I guess I summarized a fair bit already, but the original idea of this work was to say well, we know that the embedding algorithm failed in some sense at recovering the geometry of the manifold and most of the solution has been about trying to find one algorithm that does better than the other one, and instead what we try to say is actually don't worry about trying to find the correct algorithm; try to find how to correct it in a meaningful way by defining a metric that is faithful to the geometry, the entrancing geometry of the data or the geometry of interest to you. This means that it frees you from having to use more complex algorithms. You can use simple algorithms and then recover the geometry for that embedding. It also means that we have kind of unified all of the methods by saying that they can be made equivalent through this object and now we can say to what extent, so when I say they are equivalent, I mean only in the asymptotic limit. That means that they might be more biased, or more variance for a particular embedding in the discrete case, and so we can start trying to say well, how much does that matter in practice. And now we kind of have a tool to compare the variance manifold through the metric. And now it also gives us a natural way to define the Gaussian process or regularizers by which I mean like L2 type regularizer such as kernels that are fateful to the intrinsic geometry or the geometry of interest. And the only challenge we get out of this is that now we have, if we actually work in a coordinate chart, we have to carry around a quadratic form that is going to be of dimension D2 and it also adds some complexity to the implementation of any method. So where I see this going, the question to me now is more about trying to say maybe the intrinsic geometry is not the correct one; what I care about is the geometry that helps me the most with my problem and can I put a prior on what type of geometry of interest and use the points then to find which one is the correct one to use here. And how do I define a Laplacian that is robust to noise of variability and this is kind of a tricky question which is why people don't always like manifold learning, which is that the Laplacian or the embedding will be notoriously sensitive to noise. And I think that there is some work being done in that respect which is to try to think about what geometry you are trying to learn. So it is kind of related to this, but there it is some very nice work by Stefan [inaudible] that is trying to define a geometry with respect to groups by saying well, what are the natural invariants in my space and that gives you much nicer embedding, much nicer geometry. And then I didn't discuss this but I can show the consistency of this algorithm but only for one embedding and that is the Laplacian Eigen map, and so for me it is a question of can I extend that to general embedding and then try to think about what are the bias of my method and [inaudible]. So I think that sums it up. [applause]. >> Misha Bilenko: Questions? >>: If typically you would use [inaudible] supervised where you are [inaudible] so if you wanted to push back [inaudible] the label dimension [inaudible] into the algorithm how straightforward or not would it be given all the [inaudible] in the supervised metric learning [inaudible]? >> Dominique Perrault-Joncas: So the idea is that if I don't know, if I don't have a supervised context, I will go unsupervised and I will just assume that one geometry is more interesting than another either by arguments or just because it is the one I have. If I do have a semi-supervised learning problem, that's where I am thinking about well, I can just do the unsupervised part and then use it afterwards to help me with my supervised or I can try to define, you know, a subset of geometries and then try to find the one that matches the best. Implicitly that's what people already are doing. What I am thinking of here is this here. Here I have the bandwidth. That is going to define my graph. That is going to help define my Laplacian. Now if you are in semi-supervised learning, you are going to try to use cross validation or any other method of model selection to find what is the optimal epsilon and that is in itself defining a geometry. The geometry changed as you increased your bandwidth. It becomes less and less interesting at some point but if it is too localized it becomes a geometry where the points are disconnected. And so I think this is already being done in effect. >>: [inaudible] when you have the locality being influenced by the supervision. >>: But you can look at in a different way you could say since this all this process is unsupervised, so you could say one of the [inaudible] [inaudible] different purposes you want to compress it. So you want to find [inaudible] that would be useful for many tasks that you don't know [inaudible]. [inaudible] why do it this way [inaudible] might go directly to find a representation that will be best for the task that I am interested in but if you have multiple tasks or even unknown tasks this is where this kind of [inaudible] representation might be interesting. >>: I guess another question would be then can you have sort of like a midway where you would have representation but then the task could come in and that [inaudible]'s solution where you have it, you have part of the [inaudible] but then it could be corrected using this [inaudible] other task? >>: You have to keep also the original data but if you keep the original data so what is the point of this? [inaudible] >>: Well it depends on what that [inaudible] is [inaudible] if it's just compression or if it's, if it's just compression then sure as long as [inaudible]. >>: [inaudible] between the top and the [inaudible]. >> Dominique Perrault-Joncas: If it's not very related you will just move in the space of potential geometries you are learning. I mean implicitly here as I said you can rescale the graph to, so that the sampling density doesn't matter, but if you don't rescale it, does matter. And there is actually a parameter that lets you tweak that and so you can let the sampling density affect the embedding and that, it's kind of clearer than just simply changing the, if all you're saying is to what extent do I want the sampling density to affect the representation of my data and I try to find with respect a task or not. If I do it without respect to a task then I say well, maybe I know I will be doing clustering but I don't know what is the clustering yet and so I am going to let it be in effect. >>: [inaudible] making assumptions here [inaudible] making the assumption that the task that will come will be expressed on the manifold and not on its embedding in the high dimensional space because if, you know, we assume that I have a classification task. If I actually to solve it if I need [inaudible] embedding in the manifold of the original space, well I just dump this information when I did this [inaudible]. >> Dominique Perrault-Joncas: Wait. Are you implying that the ambient space of the original is important? Is that which you mean? [inaudible] the objective… >>: [inaudible] make it would make an assumption what you would think is a reasonable assumption that if I have low dimensional data embedded in high dimensional space, then for most tasks we really only care on the low dimensional data nature of the task and we are not, we do not care about the high dimension. So think about your faces, right? If the task, eventually someone tells you the task is pick a specific pixel and say whether the intensity on this intensity is higher or lower than the threshold. This has nothing to do with a low dimensional representation of the data. And any such embedding that you will find even if it [inaudible] embedding it will fail miserably on this task, so you make that line such that any reasonable task that you will ask to do with respect to this data will actually care about this low dimensional information and not the high dimension. >> Dominique Perrault-Joncas: I think we just agree. My question being that the ambient space of the data is important and that is what you are saying here I think. It is a fundamental question of whether the full space has some important information to tell you. >>: [inaudible] have to make the assumption that the ultimate, that the task you want to perform is sufficient statistics for that task are the intrinsic, are the measurements in the intrinsic low dimensional space. >>: It seems like a very reasonable assumption. This is going back to what you are saying obviously once I compressed, okay I lost something. Can somebody tell me exactly [inaudible] on this? >>: [inaudible] compressed halfway in such a way that the rest of the compression could then be performed by the task. >>: Only if it is a [inaudible] less compression. >> Dominique Perrault-Joncas: I think what is explicit here is which part you are losing, which is the extrinsic geometry which is the ambient space, but you are preserving all of the intrinsic geometry. It turns out that for high dimensional data it is often the case that the intrinsic space is actually washing off the information. If you try to do supervised learning in the high dimensional space, you can't re-propagate information because you have so many dimensions to move around that nothing gets propagated or it all gets propagated informally. So it tends to perform better, it might not be the right one, but we just don't have the tool to really express what is happening in high dimensional space. And so I feel like that it turns out that in many cases it is the best assumption that we can make because it is so hard to deal with high dimension and what it means to propagate in high dimension, propagate information. What is the covariance that is natural for high dimensional space? >>: So you are saying that you were doing more work currently on comparing the [inaudible] kernels roughly? [inaudible] playing with the [inaudible] kernel [inaudible]. >> Dominique Perrault-Joncas: It is the same idea of like using this [inaudible] GP and then comparing it to people who are doing kernels but who are using Laplacian regularizer. Actually Zhou and Belkin just came out with a paper in [inaudible] saying they consider the higher order regularization of the Laplacian. That defines a kernel and this kernel is equivalent to the GP process if you remove this K. So it's really an equivalent approach is what I'm saying. Was there more that you wanted to know about this or… Okay. And this goes back to the Skolokoff paper that I mentioned that says there is a link between Gaussian process, kernels and differential operator. You can move between the three pretty much as you will, or a regularizer. >> Misha Bilenko: Thanks Dominique. >> Dominique Perrault-Joncas: Thank you. [applause].