>> Chris Burges: Okay. We're going to get... today. He's visiting us for three days from the...

>> Chris Burges: Okay. We're going to get going. It's a pleasure to have Sumit Chopra here today. He's visiting us for three days from the Courant Institute of Mathematical Sciences in New York University. He's finishing up his PhD with Yann LeCun and he's going to talk about Energy Based Models. And if you'd like to chat with him and you haven't had the chance yet there's still a few open slots on the calender. Just approach me after the talk and we'll set that up. Thank you. >> Sumit Chopra: Thank you very much, Chris. So yeah I'll be talking about my work that I've been enrolled in over the last three and-a-half, four years as part of my PhD under the supervision of Professor Yann LeCun from NYU. And this is in collaboration with my colleagues I had from the Computer Science Department and a part of it with our friends from the Economics Department, namely Tumpy(phonetic), Professor Caplin and Professor Lee. So yeah the working was primarily in two parts. The first is during learning in a relational setting. So we proposed a bunch of novel algorithms that can do regression and other relational setting in particular. And the second was learning a similarity metric discriminatively. And the underlying theme that connects the two -- these two works is the Energy Based Model framework, which we've applied to both the examples. Hopefully I'll be able to communicate to you by the end that we can do a bunch of cool things with such a framework. So just to motivate you towards the two problems. In many available problems what you have is you can't assume that the data is independently united particularly distributed from an underlying distribution D that you don't know. Examples include automatic fraud detection, wide marketing, collaborated filtering, web-page classification and many more realistic place prediction in particular. So what we have in these is samples related to each other by complex ways. And these relationships between samples influence each other's value of the unknown variables. So for example, consider the web-base classification problem. You're given a bunch of web pages and their contents and you're asked -- the problem is to label the web page as in whether it belongs to like whether it is a commercial web page or a university web page and so on. So consider a web page, along with its contents and its label, and suppose you also know the links that this web page connects to. Right. So with the underlying assumption that linking web pages will tend to discuss similar topics with this link information you can say something about the labels of these other web pages, as well. Or in other words, there's lot of information in this link structure that should be exploited and not just an IID kind of thing. So the question is can we exploit such information in addition to the individual features? And as far as similarity metric is concerned. So suppose I give you a bunch of images and I ask you the following question and that is: Give me a mapping that maps these images to a low dimensional output space so that similar images in the input space are mapped to nearby points in the output space and dissimilar images are mapped to far away points in the output space. And know that the criteria of similarity and dissimilarity could be anything, as in this is something that I'll be giving you. For instance I could say that two airplanes are similar if they differ by one as a mortal angle or one elevation angle. Or I could say that two airplanes are similar if they have the same lighting conditions. So the bottom line is that the mapping should only be faithful to the similarity measure that I give you and should ignore all the irrelevant transformations. And the third thing that I ask from you is some sort of out of sample guarantee that given a new image and you don't know the relationship with the respect to the training data can you match this new image faithfully without retraining the system again? So that's a question that I'll be answering. And so this is -- you can view this problem as equal to searching for a good feature space, whereby you would end up with dissimilar objects clustered around the same corner and hence you can do classification or regression becomes easier in that space. So very briefly, as I said, the underlying models are Energy -- like the underlying theme is Energy Based Models. So just a brief introduction of what they are. So you're given suppose variable X and then variable to be predicted Y. What Energy Based Model says is that you assign an energy to these two variables E, a scaler, a normalized energy and that sort of captures the dependencies between these samples -- these variables rather. And this energy function can be viewed as some sort of compatibility measure. So lower energy would imply high compatibility between the two values of the variables and high energy implies low compatibility. So in particular in this case you are given a different animal, which is observed and you have the set of labels Y, which you want to infer, and your correct energy function should assign a low energy to the animal class and a high energy to all the others. And note that we don't really care about whether the energy of an airplane is higher than the car or the car is higher than the airplane. All we need to do is we need to ensure that the correct energy is lower than all the incorrect answers. So in France now for a new sample X would simply involve searching for a Y that produces the minimum energy to learn such an energy function. So -- and as far as learning is concerned it boils down to looking for an energy function that assigns low answers -- low energies to the correct answers and high energies to the incorrect answers according to this inference algorithm. So what this boils down to is the following. So you have this observed variable XI and initially supposed to start with such an energy function. And when you have a higher energy given to Y being the correct answer and low energy to some other incorrect answer, YI bar, which we call the most offending incorrect answer because this is like the most troublesome answer for your machine and this is like the incorrect answer with the least energy. And this is what exactly your machine would be producing right now as your inference. So the learning should involve pushing down on the energies of the correct answer and pulling up on the energies of incorrect answer to get this sort of desired energy surface. And this can be done by minimizing a loss functional with respect to the set of parameters, W, that define this energy function. That is the broadbased idea behind energy-based learning and details can be found in the tutorial that we recently wrote on Energy Based Setting. So yeah, coming to the first part of the talk with involves more factor graph models for doing relational aggression. As I said before samples are not assumed to be IID in such a setting, rather they are related in complex ways. Furthermore these dependencies could either be direct as in they could be given as part of the data in case of the link structure of webs or it could be hidden as in not given to you as part of the data. So now it is like a two-phase problem. First you need to infer these relationships, one. And second, use these relationships to do some form of collective prediction. So in particular we apply our framework for real estate price prediction problem and I'll talk about file data about this problem. In fact I'll present my framework in the light of this problem to -- for better, for easier understanding. And -- but yeah I'd like to say that this is a fairly general framework and can be applied to other data, as well. For example we are right now trying to extend it to the social network data for slash-dots for instance. So yeah of course a lot of previous work has been done in this area more recently, but the trouble with most of these algorithms is that they only cater to classification problems as in various outputs discreet. And it's not straightforward to generalize them to continuous variables and hence use them for regression. So to this end we propose a novel framework for relational aggression using factor graphs and we propose efficient inference and learning algorithms for the same and being in an Energy Based Setting we are able to handle nonexponential family of functions, as well, and not necessarily log linear and apply to the problem of real estate. So -- yeah, so the question is how is this real estate price prediction relational? Well, clearly my poor one bedroom, one bathroom house will be much cheaper than for example Chris' five bathroom, five bedroom house. So -- or in other words, this aspect of the price is so-called intrinsic price, that is a function of only its individual features like bedrooms, bathrooms and so on. But also a one bedroom, one bathroom house in a poor locality is -- will be cheaper than a similarly sized house in a very high-end locality. In other words, the price also depends on the function -- also is a function of the quality of the designability of the neighborhood in which it lies. And this in turn is function in terms of the desirability of the other houses that make up that neighborhood and this is where the relational aspect of the price comes in. And the second point is you really don't know this desirability as in it is not given as part of the data and it is hidden and so you need to infer that, as well. And so this is in line with the location, location mantra that most realtors have been using. So keeping this in mind we model the price as a product of two quantities, namely its intrinsic price and desirability of its location. Or thinking in terms of energy based setting what you have is an energy function E1 that captures dependency of the price with the house specific features and energy function E2 that captures the dependency of price with the desirability and so combine the two. Or more formally these relationships between these variables can be captured in the so-called energy based factor graphs. So this is like energy based factor graphs. And so just to give you a short introduction of what factor graph in energy based setting would look like. You have a bunch of variables for your problems. Some of them are observed. Others are unobserved. And you define an energy function, E, or all your variables. One way to do it is defining a global energy function where all the variables would -- can result in complications as in if each of your variables is very high dimensional. So if you were doing an inference you would end up searching for a very -- inside a very huge space like trying to search for all the possible combinations of variables. However, if you know something about the structure of this underlying energy function, in the sense if you are not only subset of variables interact with each other. So what you could do is you could split this energy function into a sum of smaller energy functions, where each of them take only a subset of variables into account. And then the final energy is nothing but the sum of these smaller functions. So each of these functions is called a factor that captures the dependency between the variables that it takes. So it's very similar to what a problemistic factor graph would look like where you have a huge joint distribution and you're factorizing it with a subset of variables to make it more manageable. So -- uh-huh? >> Question: So this -- this bottom line -- >> Sumit Chopra: Uh-huh. >> Question: -- you know, is this like a theorem or something that you can always represent? >> Sumit Chopra: No. No. It's like ->> Question: (Inaudible) -- the sum? >> Sumit Chopra: No, it's not really a theorem. What it says is if you know something about the structure of your energy function then you can factorize them. For example, in the case of real estate price prediction we know that features of a house don't really interact directly with the desirability of the location. Rather they interact with the price. So then you can split it into two halves. I mean, it's very similar to what you have in a problemistic setting, right? You know the relationship, when you know the dependencies between variables then one way to represent it is through the entire joint distribution. However, if you know the link structure of the variables then you can break it into a bunch of parts. That is provided you know the link structure or some sort of dependencies between variables. >> Question: We have probability theory and we know the rules that govern it and we can prove that the two things are equivalent, I probably don't understand because I don't know the underlying laws that govern these energy (inaudible) ->> Sumit Chopra: Yeah. I mean, all you need to -- well in that case all you need to know is what the dependencies look like, right? And then yeah of course you can prove it. But here define such energy function that is the sum of the small energy functions. >> Question: This is just a choice that you make. >> Sumit Chopra: Yeah, yeah. It's essentially just a choice that we make. Yes. >> Question: What functions (inaudible) what kind of constraints ->> Sumit Chopra: No constraints. >> Question: No constraints at all? >> Sumit Chopra: Sorry? Positivity, not really. Yeah, I mean they're a bunch of loss functions where you don't need to have a positive energy constraints. So yeah, so it's like a choice that we make by splitting this huge energy functions into a sum of energy functions, smaller energy functions using the prior knowledge of the problem at hand. >> Question: So why are they (inaudible) -- function? >> Sumit Chopra: Uh, yeah. Yeah. So... >> Question: Well, it can't be arbitrary because on the left-hand side -- >> Sumit Chopra: Uh-huh. >> Question: -- you have to look at all the possible values of X, Y and Z. >> Sumit Chopra: Uh-huh. >> Question: And you can calculate how many values that function -- nodes that function can take. If it is arbitrary function it can take that many, but on the right-hand side we have much few integration. >> Question: It's not an equation, it's just a definition of the left-hand side. >> Sumit Chopra: Yes, yes, that's exactly. >> Question: The right-hand side (indiscernible) ->> Sumit Chopra: Yes. >> Question: So that left-hand side can't be ->> Sumit Chopra: Okay, okay. Yeah. So in the case of house price prediction what we have now is a factor graph for a single house that looks something like this, that takes features, price and desirability into account and basically the sum of the two is your energy or all the features. But here the desirability of the location in turn depends on the desirability of other houses. Right? Or in other words these variables contract with other factor graphs of other houses. So more formally this sort of thing is represented by what we call relational factor graph, where the idea is to have a single factor graph that captures dependency among the samples, all the training samples and not just how one factor graph for every house. So in particular this is how we define a factor graph for the relational -- for the house price prediction problem. So for every house -- for every house we assign a single factor EXYZ. This is nonrelational and parametric in nature and it captures the dependency between the price, its individual features and this estimated desirability. So that's the nonrelational factor and this estimated desirability in turn depends on the actual desirability of the location of the training of the neighboring training samples. So to encode this dependency we define another factor graph and associate that with the house, EZZ. And this is relational factor graph and nonparametric in nature. We repeat this process for all the other houses to have this huge factor graph that captures both individual dependencies and the dependencies between the desirabilities. And as I said now the energy over the entire set of variables is basically the sum of energy of the factors. And yeah. So assuming that you've learned these -- so yeah one more thing that I wanted to point out is that the -- EXYZ, EIXYZ are like parametric factors with parameters W and they share the parameters among each other. So now for a new test house X0, the inference involves creating two new factors, building the links with its neighbors and doing the following minimization over the unknown variables D0 and Y0 with respect to that house. >> Question: W and the Zs. >> Sumit Chopra: The Ws and the Zs. So yeah, they can be somewhat parameters are sort of hidden variables. Yeah. I mean, we use Zs to compute this YI. So yeah so clearly for a test this is like an approximation because ideally what we would have wanted was a Z0 that results with desirability of the training samples and but that would have led to minimization over the entire set of Zs over the training samples to come up with the proper answer, Y0. But to avoid such an -and that obviously is infeasible if you do that with the specter of the training every test point. So we remove that dependency with the specter Z and of course it is approximation but it makes sense from the point of your house price prediction because this data, the training data is essentially some historic data to us. Yes? >> Question: (Inaudible) -- W is? >> Sumit Chopra: The parameters of these factors. So yeah so DZs are basically the training data with historic data to us and the test point would be some point in the future, in the distant future. So clearly the desirability of that point might not have -- will not have an effect on the past desirabilities. So that makes ->> Question: So here the dot samples are observed ->> Sumit Chopra: Yes. >> Question: And the (inaudible) samples are unobserved? >> Sumit Chopra: Yes, yes, yes. So in particular the training involves minimizing an energy loss, E, over the three sets of variable, the unobserved variables, which is nothing but the sum of the factors. And here we have a theorem that says that if both the factors are a quadratic function of D then the second factor can be merged into the first and be featured as a function. So what you have is now the loss function, the energy loss now reduces to minimization only with respect to W's and Z's with each house having only a single factor, EI bar, and I'll go into detail about the nature of what EI bar looks like now or in a moment. The learning algorithm so then again is basically nothing but a generalized EM type of algorithm. In the E phase you fix W and minimize with respect to Z and in the M phase you fix Z and minimize with respect to the W's. >> Question: What was D again? >> Sumit Chopra: So D was like an estimated desirability of the house from its training samples. >> Question: Okay. >> Sumit Chopra: From its nearby training samples. So M phase, as I said since it shares the parameters among the factors it can -- it's somewhat easier to compute and you can do it using (inaudible) descent over W's. In the E phase again since the two factors are merged into one and you have a single factor we show that learning again reduces to back promulgating gradients with respect to Z. But now here note that the gradients are back promulgated or a bunch of samples and not just over a single sample. Yeah? >> Question: (Inaudible) -- expectation make up the exterior? I mean ->> Sumit Chopra: Yeah, it's like a proper E phase, but it is more like a coordinated descent kind of thing. >> Question: (Inaudible). >> Sumit Chopra: Yeah, yeah. Not really computing distribution as such. Yeah. Um, so yeah in particular the nonrelational factor, the EXYZ is basically square difference between the log while we work in the log domain. So the square difference between the predicted price and the actual price. So this is the predicted price. Where G now is the parametric function where The w comes into play and this is sort of measuring the intrinsic price by taking into account only the house specific variables. XH. And DI is the estimated desirability from its neighbors. The relational factor now is again square difference between the DI and this nonparametric function that basically takes us input the neighborhood -- observed neighborhood feature of the house like coming from census track like median household income and so on and also the Z's, the learned Z's of the training neighboring training samples and does the (indiscernible). So and this is how a single factor now looks like. It takes in house variables into G to get the log of the intrinsic price. Takes in the neighborhood features and the Z's of the neighboring training samples into the nonparametric function to get the log of the desirability and sum the two to get the predicted price -- the log of the predicted price in the -- and then the energy simply the square difference between the true answer and the predicted one. Or to give you a little more intuition about what's happening in the nonrelational -- in the relational factor is so you have a bunch of training samples. Each is associated with a ZI. And now when a new sample comes you compute its neighbors and using the Z's of its neighbors you are doing this smooth interpretation. So what you are effectively here doing is learning this smooth desirability manifold over the entire geographic area. Yeah? >> Question: (Inaudible) -- hard boundary like a railroad track or something? >> Sumit Chopra: Yeah. So for algorithm doesn't take that into account. It's like -- it essentially only takes the (inaudible) distance. But yeah that's a part of the future work that we're working on and not only incorporating boundaries, but also for instance right now the number of neighbors are fixed, which we -- compute using cross-validation, but for a bunch of houses how can you sort of incorporate the variable number of neighbors? >> Question: (Inaudible) -- like in condo and some other place like a farm out in the middle of nowhere (inaudible)? >> Sumit Chopra: Uh, yeah, but for the moment we're only working with single-family residences so that sort of removes. >> Question: (Inaudible). >> Sumit Chopra: Yeah. Yeah. >> Question: (Inaudible) -- are a dime a dozen with lots of examples of the same thing and then there's others which is very unique. >> Sumit Chopra: Yeah, yeah. >> Question: How can you work with those (inaudible)? >> Sumit Chopra: Hmmm. That's a good question that ->> Question: (Inaudible) across ->> Question: It looks very smooth. >> Sumit Chopra: No this, is not the actual manifold that we've learned. >> Question: What is it? >> Sumit Chopra: This is just to show you that it's a manifold, just a cartoon here basically. >> Question: Oh. >> Sumit Chopra: So yeah. >> Question: So you could have very abrupt changes this street is (inaudible). >> Sumit Chopra: Yeah. >> Question: -- capture that kind of ->> Sumit Chopra: For the moment not. But yeah as I said that is a part of -- we'll see. So yeah -- so now given -- now as you learn this manifold for a new house you have its input features. You have its house specific features. You plug into the G function to get the intrinsic price. You plug in the location in this manifold to get the desirability. So yes -- so learning now simple energy loss with some regularization that ensures smoothness over the manifold and E phase now reduces which is minimization with respect to Z now (inaudible) program that we solve using the conjugate area. And yeah so coming to your point, well not really your point, but essentially what we are doing here is maximizing the conditional likelihood of the unobserved variables given the observed variables where the likelihood is defined as the broad-span distribution which is marginalized over the hidden variables. And this is equal well into the usual distribution where energy now incorporates the marginalization. This is like the free energy, if you want. Map estimation with respect to hidden variables. >> Question: (Inaudible). >> Sumit Chopra: Uh, yeah. Here they are. Yeah, yeah. But yeah in a general sense they might not be. Well it's square distance, though. Yeah. So and this is done by minimizing the negative log like the loss, which obviously is difficult to minimize this log of the partition function. So but here we note that since the energy is (inaudible) the contrast of term vanishes when you are computing gradients. So what you have is a simple energy loss along with a map estimation. So coming to the experiments. We tried the -- on the variable data set provided to us by FirstAmerican.com, I think. And transactions -- so it included transactions from the Los Angeles County in the year 2004 and since it's a real world it's fairly diverse and it spans 1754 census tracks and 28 school districts. And minimal preprocessing was done, like for example price, area and income variables were mapped into log domain and one of (inaudible) used for nonnumeric discreet variables and we used only single-family residences and we saw the data go into the sale dates and take the first 90% as training set and the rest 10% as a test set. And a bunch of house specific variables that we include were usual stuff, living area, bedrooms, bathrooms and so on and the neighborhood variables came from census track and school district information like median household income and average time to commute to work. >> Question: (Inaudible). >> Question: Uh, yes. >> Question: (Inaudible) -- because I thought there was a factoring between house specific features and then location. >> Sumit Chopra: Yeah, yeah, yeah, yeah. >> Question: -- that's a little strange. >> Sumit Chopra: No, yeah, maybe I'm wrong here. It does not. Yeah. It does not. >> Question: (Inaudible) ->> Sumit Chopra: One here. I mean, the data set was spanning just the one year. >> Question: Okay. >> Sumit Chopra: So we take the first 90% which boils down to around 42 or 43 weeks. Yeah. >> Question: Did you use the previous set price? >> Sumit Chopra: Yes, that's we used that. >> Question: Did you have the data for previous sale? >> Sumit Chopra: Yes. >> Question: So this variable this list doesn't exactly didn't say that, it says previous sell but not when. >> Sumit Chopra: What didn't say that? >> Question: The (inaudible) sale was. >> Sumit Chopra: You mean the date? >> Question: Yeah. >> Sumit Chopra: Yeah. We don't use the date. >> Question: Well that's important, isn't it? >> Sumit Chopra: Um, yes. >> Question: You know the original owner and someone lived there 50 years and died in the house, the previous set price is going to be different than if it sold last year. >> Sumit Chopra: Yeah, yeah, I agree. We should, yeah. In fact ->> Question: Sort of be a sampling, too, didn't you say you took the first 46 weeks as training and that test data? >> Sumit Chopra: Roughly boiled down to that, first 90% of the house is training. >> Question: In region where prices are (inaudible) higher, lower, mid-stream, might be (inaudible)? >> Sumit Chopra: Ummm, you mean because of seasonal changes? >> Question: (Inaudible)->> Sumit Chopra: Yeah, like I mean if you do the other way as in you just randomly pick then you are not really doing prediction in that case. It will be like doing a (inaudible) prediction kind of a thing. Right? So yeah, I mean we did -- one sort of drawback and this is that we only have a single year data. So you can't do much as encode inflation or seasonal changes, but right now that is again a future work where we are in the process of gathering data from the past 30 years. We're obviously will be encoding features like time, like inflation and seasonal changes. Yeah. So and yeah, you are right, I mean previous sale price should somehow be rated by when the thing was sold. Yeah. Yeah. So yeah and a base -- a bunch of baseline methods that we compare to are those that are used normally in the past for this particular problem, namely nearest neighbor. You pick the training samples and averaged the price. Linear regression ->> Question: (Inaudible). >> Sumit Chopra: In location. >> Question: The location ->> Sumit Chopra: Just physical location. >> Question: (Inaudible) ->> Sumit Chopra: Actually we tried both and I think location does much better job than -- yeah. Locally rated linear regression where a local model -- a locally linear model over the space which is globally nonlinear and a fully connected network. And what we report here is for every house we compute so far absolute forecasting error, which is the absolute error divided by the actual value so that takes into account if there is any outlyer of price. And in every column we report the percentage of houses with less than 5% or less than 10, less than 15 and so on. So clearly you would want these numbers to be higher, as in more houses should be -- should have less percentage error. And we see that we do fairly better job as compared to the other algorithms. And it's hybrid in the sense that it's a combination of two things, both the nonparametric model and -- that computes the desirability and the parametric model. Uh-huh? >> Question: (Indiscernible) a baseline what the list price is for the sale? I guess the question is how good are the appraisers? How good ->> Sumit Chopra: Yeah, yeah, yeah. >> Question: Also, what models do real estate companies use? They must have models they trust because typically they might set the price high so that (inaudible) ->> Sumit Chopra: Yeah. >> Question: But I'd be surprised if this list price is off by 15% in 80% of the cases. >> Question: Okay. Let's -- couples must have -- is it all local expertise or do they have models that you use (inaudible)? >> Sumit Chopra: I think it's a -- well, I think it's a bit of both. But I'm not sure about the models they use because obviously there is no way to have access to them other than these which were traditionally published in the legislature that eh economists have used. Yeah, of course, yeah, yeah. So and here what we show is the launt(phonetic) desirability on the test houses so each point is a house, a test house and it's color coded according to the value of its estimated desirability. So red means higher desirability. Blue means low desirability. And if you're familiar with the Los Angeles area, so it's doing something really reasonable. Areas like Beverly Hills, Santa Monica, Here, Malibu along the coast and Pasadena, they are all red, indicating they are highly desirable. Areas like downtown and down in the desert, they are all blue. While in the valley it's like moderately desirable. So that's something interesting we thought was happening with respect to this model. And another thing that we did with this was try to answer typical sellers dilemma, like whether making a particular modification to a house will increase its value or not and if so then by how much. So what we do is once we've run the model we put up the value of that attribute by one unit and ask the model to predict the value of the (inaudible) house. And we also have the original predicted price and we compute the sensitivity issue which sort of measures the expected gain in price by unit change and not by attribute. So what we show here is bedroom sensitivity manifold. Again each point is a house and color coded according to its sensitivity. So you see that in the downtown area, which is fairly congested, adding one more bedroom to the house is much more valuable than as opposed to adding another bedroom in the five bedroom mansion out in the desert or out in the valley. So yeah again we thought that something interesting was going on. Yeah, and that ends my first part so -- and as part of the future work obviously one straightforward extension is to include the time variable and second as we discussed sort of to incorporate the hard boundaries and have sort of have a dynamic neighborhood for every house rather than a static one. And yeah we've been planning to extend this technique to other domains like as I said slash-dot data we have from -- in collaboration with Sterns(Phonetic) School of Business. So the idea there is given a whole bunch of comments by different users and the source article you want to come up with a prediction of the rating of that comment that would be given to it by different moderators. So what we model this problem is in the following way. That you have a comment whose rating would depend not only on the preceding comment and the original article, but also on the so-called mood of the user or his or her intellectual ability, which is hidden. As in some users tend to generally write funny comments. Some users tend to write generally stupid comments, so on and so on. So hopefully we plan to extend this to sort of capture that mood of the user. Yes. So the second part involves learning similarity metric discriminatively where we designed a technique called DrLIM, which stands for Dimensionally Reduction by Learning and Invariant Mapping. Hopefully I'll be able to convince you that this is rather intelligent DR. So yeah. So as I said, given a bunch of images I -- can you generate lower dimensional mapping so that similar objects are closer to each other and dissimilar objects are further apart? And also have some other sample guaranteed to this problem. Well, so you might say that okay there are a whole lot of previous algorithms and you pick one and provide it with a similarity and you get the answer. Okay. Fair enough. So I pick my favorite algorithm. That's LLE. I provide to LLE the information, explicit information that is two plains are similar if they are -- if they differ by one angle. One two angle or one elevation angle. And this is what I get as output. It has completely ignored the angle or elevation information and rather clustered the points according to the lighting conditions. And same here. And there is absolutely -- I mean it is a highly degenerate manifold that LLE constructs. So the question is what went wrong here? The trouble with LLE and most of the other algorithms here is that they rely on a computable distance metric in their input space. In the case of LLE hence you see the lighting being the major factor of clustering. Well, there are those that don't really depend on the distance measure, but they do not generate explicit mapping for you so you don't have any other sample guarantees for such things. And just to convince you that it's not -- these requirements are important and not really useful generating pretty pictures because you can have certain clarification of verification problems where the number of classes is very large and the training samples per class is large. Variability among them and you also have a bunch of unseen classes you are not trained on. For example in Face verification you train on a bunch of subjects and you will test on a bunch of subjects you are not seen on. You want to have that out of sample guarantee. So yeah just to summarize about the object once again. We want to map from higher dimensional space to lower dimensional space, which maps similar and the similarity could be anything. So the similar sample to nearby points in output space and dissimilar samples to faraway points. And it should not require an arbitrary computable distance metric in the input space and hence should be invariant within transformation and have some out of sample guarantee. So what we propose is a simple three-step algorithm. The first step involves building a neighborhood graph. So based on whatever similarity you choose, you create similar links among the samples and all the other pairs of samples are considered dissimilar to each other. Step two involves choosing this function -- parametric function GW that matches the higher dimensional points to the lower dimensional output space. Step c involves training the parameters W so that similar points are together and faraway points are -- dissimilar points are far away. So the question remains: What goes inside this GW and how do you train it? Pretty much anything can go inside GW, it could either a linear function, a convolutional network. It completely depends on the problem that you have at hand for example -- yeah. And as far as training is concerned, we use the so-called siamese architecture that was first explored by Brilmayer. So what it does is it keeps two identical copies of the parametric function GW that share the same sort of weights and you have a bunch of important images like two of them which are similar or dissimilar. You plug them in and generate the features in the output space and the energy is given by any distance measure in the output space. So note that this measure is in the output space rather than input space and to learn these rates you minimize this contracted loss or what this loss is doing is if you have similar images then the label YI associated with these images is 0. This part of the loss function is activated, which is nothing but a quadratic loss. So minimizing this loss is equal to minimizing this energy function or this distance in the output space. However, the samples are dissimilar, YI is one and this part is activated and minimizing the loss now increasing the energy or the distance in the output space. And by some margin M here as in because we are seeking a smooth manifold in a bunch of our experiments and so you don't want to push the dissimilar functions far away apart and hence generate these clusters. Although we might need clusters in a bunch of situations, which I'll talk about. >> Question: (Inaudible) -- situations where it's not as similar or dissimilar (inaudible) too much? I am moving, I mean, similar to the smooth thing. >> Sumit Chopra: Yeah, yeah, yeah. >> Question: (Inaudible) points. >> Sumit Chopra: Yes. Yeah, that is one thing algorithm assumes anything that is not labeled similar is dissimilar. Yeah, yeah. But hopefully that thing might be taken care of by this margin thing because you're not really pushing all the points very far apart. So you'll be generating a smoother manifold, so hopefully you'll have, you know, you are right. Yeah. So yeah so that's it. That's the algorithm. And we tried this thing on the 4's and 9's digits from the M data set and 4's and 9's because they are fairly similar to each other even when you see them and hopefully reported a difficult task. So we take the randomly chosen 3,000 samples for training and 1,000 samples for testing and the GW was a four letter convolutional network in our case. So -- and for a sanity check, we first computed the nearest neighbors in the input space by the (indiscernible) and distance between them. So what you get is the smooth manifold that separates the 4's and 9 's reasonably well. These are test samples as in you don't know the relationship between any of the two dots in this manifold nor do you know the relationship between any of the dots with the previous training set. So yeah and besides the separation it has a smooth change from delta to straight and so on. >> Question: So you're putting data -- then I mean your training labels are which pairs are similar. So does this mean that for all 3000 by 3000 labeled also or are you just using the five as a proxy where you are going in and saying that the five that are the nearest neighbors are in fact similar? >> Sumit Chopra: Yes, only the nearest neighbors are similar. >> Question: (Inaudible) ->> Sumit Chopra: Sparse of a hopefully connected graph over the digits yeah. Yeah. >> Question: Certain amount of -- well, I mean for instance if they were rotated forth and things like that, right? >> Sumit Chopra: Uh-huh. >> Question: Automatically labeling the neighbors in that way wouldn't -- wouldn't really get you get you because those neighbors are chosen by Euclidean space. >> Sumit Chopra: Exactly. As I said this is just a sanity check with the prior knowledge thing later on that will hopefully convince you. So now what we did was we explicitly translated the images by minus 6, minus 3, and plus 3 and plus 6 fixers. And again for the purpose of sanity we checked, we again computed the Euclidean neighbors and what we get is -- these five clusters and each of the cluster corresponds to the four translation and the center cluster the original image. And furthermore, the images within each cluster are fairly well separated and the order in which the clusters are clustered is exactly according to the way they are translated. Like this is minus 6, minus 3, 0, plus 3 and plus 6. So it's -- so this is like sort of reinforcing the fact that nearest neighbor in a Euclidean space might not be a good idea if you have these sort of complicated translations. >> Question: (Inaudible) -- is an issue then with respect to living in apartments because the -- in fact you would want the 4's of the plus 3 to be very close to the 4's in the minus 3 if they are the same identifier. And because that would be real variance, right? You would in fact want those 4's to be almost on top of each other because you would want whatever features are selected and selected through (inaudible). >> Sumit Chopra: That's my next slide. >> Question: Oh, sorry. >> Sumit Chopra: So now finally what we do is we inject prior knowledge and what we say is each sample is a neighbor of its five Euclidean neighbors. In relation a sample is also neighbor of shifted versions in relation each sample is also neighbor of shifted version of its five Euclidean neighbors. So what you have is exactly what we want, a well separated manifold and if you zoom inside this you get identical folds that are translated in shape and that is exactly what we wanted. So yeah and this is of course using the prior knowledge. And this is what we get if you use similar prior knowledge for LLE, a completely regenerate solution which is difficult to put into words what it's doing. Another experiment was using a little more complicated data set and this was consisting of airplanes from the Norb data set and we project into 3D space. So airplanes were -- consisted of 972 images with 18 angles and nine elevation angles and six lighting conditions. And how we generate the neighborhood graph is by saying that two planes are similar if they differ by one in assignment or in elevation. And explicitly don't give any lighting conditions requirement and what we get is a very nice 3D cylinder and along the rim of the cylinder the planes are arranged according to their angle and along the height of the cylinder the planes are arranged according to their elevation and it ignores the lighting condition. It is effectively returning us the way we generated our data. And just for reference once again this is what LLE would have given if -- and the last application for this was face verification where the task is to accept or reject the claimed identity of a person in an image. So given a pair of images your machine would say yes or no whether they are similar or not. And of course it is a difficult problem because you could have a very large variability in the data set. Like you could have artificial contusion like face scar, sunglasses and rather animated expression. And there are large number of classes and there are even unseen classes where you've not trained on those subjects. And training was very similar other than this loss function now. So when you have a dissimilar thing you actually want discreet clusters for every subject in the feature space. So you -- so you essentially pulling up as much -- pushing apart as much as possible the dissimilar pairs. That's the only difference between the two. And yeah and for similar pairs you have the usual quadratic loss. So among the various data sets namely the AT&T, Ferret(phonetic) and Purdue. I'll discuss the Purdue data set, which was the most challenging. It consists of around 136 subjects and they have a very high degree of variability, as you can see, for every subject. And we picked 96 random subjects for training and 40 for testing. And this is what we get as far as the performance is concerned. For 10% false accept you only reject 11% of the pairs. And of course as you increase this there's an increase in the false reject rate, as well, as it increases. But I mean, to convince you a little more what it's doing is it's correctly classifying this as a genuine pair. This as a genuine pair, which is difficult even for a human and this as genuine pair and it is also able to classify this as an imposter pair and these are fairly easy. Well, this is not maybe. So -- yeah. So and there are a whole bunch of extensions to this idea and for example you could use it to do an automatic calculated detection as in generating a bunch of invariant features for an object. So what you have is a moving camera that takes pictures from different objects at different angles and you have a connected neighborhood graph by neighbors being defined as two images if they are temporally adjacent to each other. And then what you would hope to see is a cluster for every such object so each cluster according in these invariant features. And other techniques where it can be used beyond images is for example information retrieval where you are doing semantic hashing for documents. I mean you just need to how to do documents -- you just need to label how the two documents are similar and that could be any arbitrary distance. And natural language processing. In particularly very distinctly people from any CD search have used this for semantic role labeling. And this is the work of Jason Westin and Ronan Collobert appearing in Sears ICM. So what they do is they train a deep architecture for doing semantic role labeling and what they -- in addition to the usual supervised learning of this deep architecture for every layer they also have this DrLIM training, which they call M(inaudible) layer. So when you back promulgate the variance through both part and this part you hope to get features over here that are more meaningful or more consistent, both with respect to the supervision and also with respect to the similarity and dissimilarity. And they beat -- pretty much beat the state of the art for semantic role labeling using this technique. And yeah, so and finally I'd like to end my talk by discussing a bunch of things that I'll be interested in doing and which involve basically designing efficient inference and learning algorithms for large-scale layers sets, primarily involving geo world data sets and solving interesting questions as in (inaudible) classification and regression are the fundamental issues, but can go beyond that, for instance, predicting how the with respect to the house prices, predicting how the neighborhoods changed dynamically with respect to the demographic movements and such. And yeah and exploiting the underlying structure that's there in the data sets and most real world data sets and not really use simple IID assumption and yeah using energy based model. Deep architecture, which I've been involved in outside project and problemistic modes. So yeah. And to show you really nice demo that myself and (inaudible) had (inaudible). So these are like the images for plane and neighborhood relationship is again the ultimate angle and what you are seeing here is after every epoch how the DrLIM training is going ahead. So the I guess to have circle in the end that will basically the planes according to their angles. Just to fast-forward it. Initially everything is random, now it's trying to solve unwinded slopes and now it has three loops remaining and finally it -- oops. Something is stuck I think here. So yeah so finally in the end what you have is a circle, and as expected and then it's basically fine tuning its parameter to really get the (inaudible) and the learning rate reduces so what you have is a circle. Yep. That's it. Thank you very much. (Applause) >> Question: So this last feels kind of reminiscent of channel equalization. Do you know about channel equalization? >> Sumit Chopra: Uh-huh. >> Question: Used in modems. The idea was we all once were looking for features work across all the variations of this. >> Sumit Chopra: Uh-huh. >> Question: And in channel equalization what you try to do is to model the noise process. >> Sumit Chopra: I see. >> Question: Then you process then you know what's going on here and you can work out the combination of the two and then you can sort of back infer what is going on. >> Sumit Chopra: I see. >> Question: And I think that maybe -- it's much easier to model the noise process than to find what features would be invariant across it. >> Sumit Chopra: Hmmm. >> Question: So what you are doing in this is sort of giving the system a chance to learn the noise process. >> Sumit Chopra: Yeah, yeah, yeah. >> Question: Maybe it would make even more sense to just model the noise process; correct? >> Sumit Chopra: And by noise here you will be ->> Question: One case you had was the shifts. >> Sumit Chopra: Yes. >> Question: And the other case was the lighting. >> Sumit Chopra: Uh-huh. >> Question: And in general there would be a noise process dependent upon the application. >> Sumit Chopra: Yeah, yeah, yeah. >> Question: In modems it was something else. >> Sumit Chopra: Yeah. (Inaudible). >> Question: Thank you very much. >> Sumit Chopra: Thanks a lot.

>> Chris Burges: Okay. We're going to get... today. He's visiting us for three days from the...

Related documents

Products

Support

&gt;&gt; Chris Burges: Okay. We're going to get... today. He's visiting us for three days from the...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Chris Burges: Okay. We're going to get... today. He's visiting us for three days from the...