>> Chris Burges: Okay. We're going to get... today. He's visiting us for three days from the...

advertisement
>> Chris Burges: Okay. We're going to get going. It's a pleasure to have Sumit Chopra here
today. He's visiting us for three days from the Courant Institute of Mathematical Sciences in New
York University. He's finishing up his PhD with Yann LeCun and he's going to talk about Energy
Based Models. And if you'd like to chat with him and you haven't had the chance yet there's still
a few open slots on the calender. Just approach me after the talk and we'll set that up. Thank
you.
>> Sumit Chopra: Thank you very much, Chris. So yeah I'll be talking about my work that I've
been enrolled in over the last three and-a-half, four years as part of my PhD under the
supervision of Professor Yann LeCun from NYU. And this is in collaboration with my colleagues I
had from the Computer Science Department and a part of it with our friends from the Economics
Department, namely Tumpy(phonetic), Professor Caplin and Professor Lee.
So yeah the working was primarily in two parts. The first is during learning in a relational setting.
So we proposed a bunch of novel algorithms that can do regression and other relational setting in
particular. And the second was learning a similarity metric discriminatively. And the underlying
theme that connects the two -- these two works is the Energy Based Model framework, which
we've applied to both the examples. Hopefully I'll be able to communicate to you by the end that
we can do a bunch of cool things with such a framework.
So just to motivate you towards the two problems. In many available problems what you have is
you can't assume that the data is independently united particularly distributed from an underlying
distribution D that you don't know. Examples include automatic fraud detection, wide marketing,
collaborated filtering, web-page classification and many more realistic place prediction in
particular. So what we have in these is samples related to each other by complex ways. And
these relationships between samples influence each other's value of the unknown variables.
So for example, consider the web-base classification problem. You're given a bunch of web
pages and their contents and you're asked -- the problem is to label the web page as in whether it
belongs to like whether it is a commercial web page or a university web page and so on.
So consider a web page, along with its contents and its label, and suppose you also know the
links that this web page connects to. Right. So with the underlying assumption that linking web
pages will tend to discuss similar topics with this link information you can say something about
the labels of these other web pages, as well. Or in other words, there's lot of information in this
link structure that should be exploited and not just an IID kind of thing. So the question is can we
exploit such information in addition to the individual features?
And as far as similarity metric is concerned. So suppose I give you a bunch of images and I ask
you the following question and that is: Give me a mapping that maps these images to a low
dimensional output space so that similar images in the input space are mapped to nearby points
in the output space and dissimilar images are mapped to far away points in the output space.
And know that the criteria of similarity and dissimilarity could be anything, as in this is something
that I'll be giving you. For instance I could say that two airplanes are similar if they differ by one
as a mortal angle or one elevation angle. Or I could say that two airplanes are similar if they have
the same lighting conditions.
So the bottom line is that the mapping should only be faithful to the similarity measure that I give
you and should ignore all the irrelevant transformations. And the third thing that I ask from you is
some sort of out of sample guarantee that given a new image and you don't know the relationship
with the respect to the training data can you match this new image faithfully without retraining the
system again? So that's a question that I'll be answering. And so this is -- you can view this
problem as equal to searching for a good feature space, whereby you would end up with
dissimilar objects clustered around the same corner and hence you can do classification or
regression becomes easier in that space.
So very briefly, as I said, the underlying models are Energy -- like the underlying theme is Energy
Based Models. So just a brief introduction of what they are. So you're given suppose variable X
and then variable to be predicted Y. What Energy Based Model says is that you assign an
energy to these two variables E, a scaler, a normalized energy and that sort of captures the
dependencies between these samples -- these variables rather. And this energy function can be
viewed as some sort of compatibility measure. So lower energy would imply high compatibility
between the two values of the variables and high energy implies low compatibility. So in
particular in this case you are given a different animal, which is observed and you have the set of
labels Y, which you want to infer, and your correct energy function should assign a low energy to
the animal class and a high energy to all the others. And note that we don't really care about
whether the energy of an airplane is higher than the car or the car is higher than the airplane. All
we need to do is we need to ensure that the correct energy is lower than all the incorrect
answers.
So in France now for a new sample X would simply involve searching for a Y that produces the
minimum energy to learn such an energy function. So -- and as far as learning is concerned it
boils down to looking for an energy function that assigns low answers -- low energies to the
correct answers and high energies to the incorrect answers according to this inference algorithm.
So what this boils down to is the following. So you have this observed variable XI and initially
supposed to start with such an energy function. And when you have a higher energy given to Y
being the correct answer and low energy to some other incorrect answer, YI bar, which we call
the most offending incorrect answer because this is like the most troublesome answer for your
machine and this is like the incorrect answer with the least energy. And this is what exactly your
machine would be producing right now as your inference.
So the learning should involve pushing down on the energies of the correct answer and pulling up
on the energies of incorrect answer to get this sort of desired energy surface. And this can be
done by minimizing a loss functional with respect to the set of parameters, W, that define this
energy function. That is the broadbased idea behind energy-based learning and details can be
found in the tutorial that we recently wrote on Energy Based Setting.
So yeah, coming to the first part of the talk with involves more factor graph models for doing
relational aggression. As I said before samples are not assumed to be IID in such a setting,
rather they are related in complex ways. Furthermore these dependencies could either be direct
as in they could be given as part of the data in case of the link structure of webs or it could be
hidden as in not given to you as part of the data. So now it is like a two-phase problem. First you
need to infer these relationships, one. And second, use these relationships to do some form of
collective prediction.
So in particular we apply our framework for real estate price prediction problem and I'll talk about
file data about this problem. In fact I'll present my framework in the light of this problem to -- for
better, for easier understanding. And -- but yeah I'd like to say that this is a fairly general
framework and can be applied to other data, as well. For example we are right now trying to
extend it to the social network data for slash-dots for instance.
So yeah of course a lot of previous work has been done in this area more recently, but the trouble
with most of these algorithms is that they only cater to classification problems as in various
outputs discreet. And it's not straightforward to generalize them to continuous variables and
hence use them for regression. So to this end we propose a novel framework for relational
aggression using factor graphs and we propose efficient inference and learning algorithms for the
same and being in an Energy Based Setting we are able to handle nonexponential family of
functions, as well, and not necessarily log linear and apply to the problem of real estate.
So -- yeah, so the question is how is this real estate price prediction relational? Well, clearly my
poor one bedroom, one bathroom house will be much cheaper than for example Chris' five
bathroom, five bedroom house. So -- or in other words, this aspect of the price is so-called
intrinsic price, that is a function of only its individual features like bedrooms, bathrooms and so
on. But also a one bedroom, one bathroom house in a poor locality is -- will be cheaper than a
similarly sized house in a very high-end locality. In other words, the price also depends on the
function -- also is a function of the quality of the designability of the neighborhood in which it lies.
And this in turn is function in terms of the desirability of the other houses that make up that
neighborhood and this is where the relational aspect of the price comes in.
And the second point is you really don't know this desirability as in it is not given as part of the
data and it is hidden and so you need to infer that, as well. And so this is in line with the location,
location mantra that most realtors have been using.
So keeping this in mind we model the price as a product of two quantities, namely its intrinsic
price and desirability of its location. Or thinking in terms of energy based setting what you have is
an energy function E1 that captures dependency of the price with the house specific features and
energy function E2 that captures the dependency of price with the desirability and so combine the
two. Or more formally these relationships between these variables can be captured in the
so-called energy based factor graphs. So this is like energy based factor graphs.
And so just to give you a short introduction of what factor graph in energy based setting would
look like. You have a bunch of variables for your problems. Some of them are observed. Others
are unobserved. And you define an energy function, E, or all your variables. One way to do it is
defining a global energy function where all the variables would -- can result in complications as in
if each of your variables is very high dimensional. So if you were doing an inference you would
end up searching for a very -- inside a very huge space like trying to search for all the possible
combinations of variables.
However, if you know something about the structure of this underlying energy function, in the
sense if you are not only subset of variables interact with each other. So what you could do is
you could split this energy function into a sum of smaller energy functions, where each of them
take only a subset of variables into account. And then the final energy is nothing but the sum of
these smaller functions. So each of these functions is called a factor that captures the
dependency between the variables that it takes. So it's very similar to what a problemistic factor
graph would look like where you have a huge joint distribution and you're factorizing it with a
subset of variables to make it more manageable.
So -- uh-huh?
>> Question: So this -- this bottom line --
>> Sumit Chopra: Uh-huh.
>> Question: -- you know, is this like a theorem or something that you can always represent?
>> Sumit Chopra: No. No. It's like ->> Question: (Inaudible) -- the sum?
>> Sumit Chopra: No, it's not really a theorem. What it says is if you know something about the
structure of your energy function then you can factorize them. For example, in the case of real
estate price prediction we know that features of a house don't really interact directly with the
desirability of the location. Rather they interact with the price. So then you can split it into two
halves.
I mean, it's very similar to what you have in a problemistic setting, right? You know the
relationship, when you know the dependencies between variables then one way to represent it is
through the entire joint distribution. However, if you know the link structure of the variables then
you can break it into a bunch of parts. That is provided you know the link structure or some sort
of dependencies between variables.
>> Question: We have probability theory and we know the rules that govern it and we can prove
that the two things are equivalent, I probably don't understand because I don't know the
underlying laws that govern these energy (inaudible) ->> Sumit Chopra: Yeah. I mean, all you need to -- well in that case all you need to know is what
the dependencies look like, right? And then yeah of course you can prove it. But here define
such energy function that is the sum of the small energy functions.
>> Question: This is just a choice that you make.
>> Sumit Chopra: Yeah, yeah. It's essentially just a choice that we make. Yes.
>> Question: What functions (inaudible) what kind of constraints ->> Sumit Chopra: No constraints.
>> Question: No constraints at all?
>> Sumit Chopra: Sorry? Positivity, not really. Yeah, I mean they're a bunch of loss functions
where you don't need to have a positive energy constraints. So yeah, so it's like a choice that we
make by splitting this huge energy functions into a sum of energy functions, smaller energy
functions using the prior knowledge of the problem at hand.
>> Question: So why are they (inaudible) -- function?
>> Sumit Chopra: Uh, yeah. Yeah. So...
>> Question: Well, it can't be arbitrary because on the left-hand side --
>> Sumit Chopra: Uh-huh.
>> Question: -- you have to look at all the possible values of X, Y and Z.
>> Sumit Chopra: Uh-huh.
>> Question: And you can calculate how many values that function -- nodes that function can
take. If it is arbitrary function it can take that many, but on the right-hand side we have much few
integration.
>> Question: It's not an equation, it's just a definition of the left-hand side.
>> Sumit Chopra: Yes, yes, that's exactly.
>> Question: The right-hand side (indiscernible) ->> Sumit Chopra: Yes.
>> Question: So that left-hand side can't be ->> Sumit Chopra: Okay, okay. Yeah. So in the case of house price prediction what we have
now is a factor graph for a single house that looks something like this, that takes features, price
and desirability into account and basically the sum of the two is your energy or all the features.
But here the desirability of the location in turn depends on the desirability of other houses. Right?
Or in other words these variables contract with other factor graphs of other houses. So more
formally this sort of thing is represented by what we call relational factor graph, where the idea is
to have a single factor graph that captures dependency among the samples, all the training
samples and not just how one factor graph for every house.
So in particular this is how we define a factor graph for the relational -- for the house price
prediction problem. So for every house -- for every house we assign a single factor EXYZ. This
is nonrelational and parametric in nature and it captures the dependency between the price, its
individual features and this estimated desirability. So that's the nonrelational factor and this
estimated desirability in turn depends on the actual desirability of the location of the training of the
neighboring training samples.
So to encode this dependency we define another factor graph and associate that with the house,
EZZ. And this is relational factor graph and nonparametric in nature. We repeat this process for
all the other houses to have this huge factor graph that captures both individual dependencies
and the dependencies between the desirabilities.
And as I said now the energy over the entire set of variables is basically the sum of energy of the
factors. And yeah. So assuming that you've learned these -- so yeah one more thing that I
wanted to point out is that the -- EXYZ, EIXYZ are like parametric factors with parameters W and
they share the parameters among each other. So now for a new test house X0, the inference
involves creating two new factors, building the links with its neighbors and doing the following
minimization over the unknown variables D0 and Y0 with respect to that house.
>> Question: W and the Zs.
>> Sumit Chopra: The Ws and the Zs. So yeah, they can be somewhat parameters are sort of
hidden variables. Yeah. I mean, we use Zs to compute this YI. So yeah so clearly for a test this
is like an approximation because ideally what we would have wanted was a Z0 that results with
desirability of the training samples and but that would have led to minimization over the entire set
of Zs over the training samples to come up with the proper answer, Y0. But to avoid such an -and that obviously is infeasible if you do that with the specter of the training every test point.
So we remove that dependency with the specter Z and of course it is approximation but it makes
sense from the point of your house price prediction because this data, the training data is
essentially some historic data to us. Yes?
>> Question: (Inaudible) -- W is?
>> Sumit Chopra: The parameters of these factors.
So yeah so DZs are basically the training data with historic data to us and the test point would be
some point in the future, in the distant future. So clearly the desirability of that point might not
have -- will not have an effect on the past desirabilities. So that makes ->> Question: So here the dot samples are observed ->> Sumit Chopra: Yes.
>> Question: And the (inaudible) samples are unobserved?
>> Sumit Chopra: Yes, yes, yes. So in particular the training involves minimizing an energy loss,
E, over the three sets of variable, the unobserved variables, which is nothing but the sum of the
factors. And here we have a theorem that says that if both the factors are a quadratic function of
D then the second factor can be merged into the first and be featured as a function. So what you
have is now the loss function, the energy loss now reduces to minimization only with respect to
W's and Z's with each house having only a single factor, EI bar, and I'll go into detail about the
nature of what EI bar looks like now or in a moment.
The learning algorithm so then again is basically nothing but a generalized EM type of algorithm.
In the E phase you fix W and minimize with respect to Z and in the M phase you fix Z and
minimize with respect to the W's.
>> Question: What was D again?
>> Sumit Chopra: So D was like an estimated desirability of the house from its training samples.
>> Question: Okay.
>> Sumit Chopra: From its nearby training samples.
So M phase, as I said since it shares the parameters among the factors it can -- it's somewhat
easier to compute and you can do it using (inaudible) descent over W's. In the E phase again
since the two factors are merged into one and you have a single factor we show that learning
again reduces to back promulgating gradients with respect to Z. But now here note that the
gradients are back promulgated or a bunch of samples and not just over a single sample. Yeah?
>> Question: (Inaudible) -- expectation make up the exterior? I mean ->> Sumit Chopra: Yeah, it's like a proper E phase, but it is more like a coordinated descent kind
of thing.
>> Question: (Inaudible).
>> Sumit Chopra: Yeah, yeah. Not really computing distribution as such. Yeah.
Um, so yeah in particular the nonrelational factor, the EXYZ is basically square difference
between the log while we work in the log domain. So the square difference between the
predicted price and the actual price. So this is the predicted price. Where G now is the
parametric function where The w comes into play and this is sort of measuring the intrinsic price
by taking into account only the house specific variables. XH. And DI is the estimated desirability
from its neighbors. The relational factor now is again square difference between the DI and this
nonparametric function that basically takes us input the neighborhood -- observed neighborhood
feature of the house like coming from census track like median household income and so on and
also the Z's, the learned Z's of the training neighboring training samples and does the
(indiscernible).
So and this is how a single factor now looks like. It takes in house variables into G to get the log
of the intrinsic price. Takes in the neighborhood features and the Z's of the neighboring training
samples into the nonparametric function to get the log of the desirability and sum the two to get
the predicted price -- the log of the predicted price in the -- and then the energy simply the square
difference between the true answer and the predicted one.
Or to give you a little more intuition about what's happening in the nonrelational -- in the relational
factor is so you have a bunch of training samples. Each is associated with a ZI. And now when a
new sample comes you compute its neighbors and using the Z's of its neighbors you are doing
this smooth interpretation. So what you are effectively here doing is learning this smooth
desirability manifold over the entire geographic area. Yeah?
>> Question: (Inaudible) -- hard boundary like a railroad track or something?
>> Sumit Chopra: Yeah. So for algorithm doesn't take that into account. It's like -- it essentially
only takes the (inaudible) distance. But yeah that's a part of the future work that we're working on
and not only incorporating boundaries, but also for instance right now the number of neighbors
are fixed, which we -- compute using cross-validation, but for a bunch of houses how can you
sort of incorporate the variable number of neighbors?
>> Question: (Inaudible) -- like in condo and some other place like a farm out in the middle of
nowhere (inaudible)?
>> Sumit Chopra: Uh, yeah, but for the moment we're only working with single-family residences
so that sort of removes.
>> Question: (Inaudible).
>> Sumit Chopra: Yeah. Yeah.
>> Question: (Inaudible) -- are a dime a dozen with lots of examples of the same thing and then
there's others which is very unique.
>> Sumit Chopra: Yeah, yeah.
>> Question: How can you work with those (inaudible)?
>> Sumit Chopra: Hmmm. That's a good question that ->> Question: (Inaudible) across ->> Question: It looks very smooth.
>> Sumit Chopra: No this, is not the actual manifold that we've learned.
>> Question: What is it?
>> Sumit Chopra: This is just to show you that it's a manifold, just a cartoon here basically.
>> Question: Oh.
>> Sumit Chopra: So yeah.
>> Question: So you could have very abrupt changes this street is (inaudible).
>> Sumit Chopra: Yeah.
>> Question: -- capture that kind of ->> Sumit Chopra: For the moment not. But yeah as I said that is a part of -- we'll see. So yeah
-- so now given -- now as you learn this manifold for a new house you have its input features.
You have its house specific features. You plug into the G function to get the intrinsic price. You
plug in the location in this manifold to get the desirability.
So yes -- so learning now simple energy loss with some regularization that ensures smoothness
over the manifold and E phase now reduces which is minimization with respect to Z now
(inaudible) program that we solve using the conjugate area.
And yeah so coming to your point, well not really your point, but essentially what we are doing
here is maximizing the conditional likelihood of the unobserved variables given the observed
variables where the likelihood is defined as the broad-span distribution which is marginalized over
the hidden variables. And this is equal well into the usual distribution where energy now
incorporates the marginalization. This is like the free energy, if you want. Map estimation with
respect to hidden variables.
>> Question: (Inaudible).
>> Sumit Chopra: Uh, yeah. Here they are. Yeah, yeah. But yeah in a general sense they
might not be. Well it's square distance, though. Yeah.
So and this is done by minimizing the negative log like the loss, which obviously is difficult to
minimize this log of the partition function. So but here we note that since the energy is (inaudible)
the contrast of term vanishes when you are computing gradients. So what you have is a simple
energy loss along with a map estimation.
So coming to the experiments. We tried the -- on the variable data set provided to us by
FirstAmerican.com, I think. And transactions -- so it included transactions from the Los Angeles
County in the year 2004 and since it's a real world it's fairly diverse and it spans 1754 census
tracks and 28 school districts. And minimal preprocessing was done, like for example price, area
and income variables were mapped into log domain and one of (inaudible) used for nonnumeric
discreet variables and we used only single-family residences and we saw the data go into the
sale dates and take the first 90% as training set and the rest 10% as a test set.
And a bunch of house specific variables that we include were usual stuff, living area, bedrooms,
bathrooms and so on and the neighborhood variables came from census track and school district
information like median household income and average time to commute to work.
>> Question: (Inaudible).
>> Question: Uh, yes.
>> Question: (Inaudible) -- because I thought there was a factoring between house specific
features and then location.
>> Sumit Chopra: Yeah, yeah, yeah, yeah.
>> Question: -- that's a little strange.
>> Sumit Chopra: No, yeah, maybe I'm wrong here. It does not. Yeah. It does not.
>> Question: (Inaudible) ->> Sumit Chopra: One here. I mean, the data set was spanning just the one year.
>> Question: Okay.
>> Sumit Chopra: So we take the first 90% which boils down to around 42 or 43 weeks. Yeah.
>> Question: Did you use the previous set price?
>> Sumit Chopra: Yes, that's we used that.
>> Question: Did you have the data for previous sale?
>> Sumit Chopra: Yes.
>> Question: So this variable this list doesn't exactly didn't say that, it says previous sell but
not when.
>> Sumit Chopra: What didn't say that?
>> Question: The (inaudible) sale was.
>> Sumit Chopra: You mean the date?
>> Question: Yeah.
>> Sumit Chopra: Yeah. We don't use the date.
>> Question: Well that's important, isn't it?
>> Sumit Chopra: Um, yes.
>> Question: You know the original owner and someone lived there 50 years and died in the
house, the previous set price is going to be different than if it sold last year.
>> Sumit Chopra: Yeah, yeah, I agree. We should, yeah. In fact ->> Question: Sort of be a sampling, too, didn't you say you took the first 46 weeks as training
and that test data?
>> Sumit Chopra: Roughly boiled down to that, first 90% of the house is training.
>> Question: In region where prices are (inaudible) higher, lower, mid-stream, might be
(inaudible)?
>> Sumit Chopra: Ummm, you mean because of seasonal changes?
>> Question: (Inaudible)->> Sumit Chopra: Yeah, like I mean if you do the other way as in you just randomly pick then you
are not really doing prediction in that case. It will be like doing a (inaudible) prediction kind of a
thing. Right? So yeah, I mean we did -- one sort of drawback and this is that we only have a
single year data. So you can't do much as encode inflation or seasonal changes, but right now
that is again a future work where we are in the process of gathering data from the past 30 years.
We're obviously will be encoding features like time, like inflation and seasonal changes. Yeah.
So and yeah, you are right, I mean previous sale price should somehow be rated by when the
thing was sold. Yeah. Yeah. So yeah and a base -- a bunch of baseline methods that we
compare to are those that are used normally in the past for this particular problem, namely
nearest neighbor. You pick the training samples and averaged the price. Linear regression ->> Question: (Inaudible).
>> Sumit Chopra: In location.
>> Question: The location ->> Sumit Chopra: Just physical location.
>> Question: (Inaudible) ->> Sumit Chopra: Actually we tried both and I think location does much better job than -- yeah.
Locally rated linear regression where a local model -- a locally linear model over the space which
is globally nonlinear and a fully connected network. And what we report here is for every house
we compute so far absolute forecasting error, which is the absolute error divided by the actual
value so that takes into account if there is any outlyer of price.
And in every column we report the percentage of houses with less than 5% or less than 10, less
than 15 and so on. So clearly you would want these numbers to be higher, as in more houses
should be -- should have less percentage error. And we see that we do fairly better job as
compared to the other algorithms. And it's hybrid in the sense that it's a combination of two
things, both the nonparametric model and -- that computes the desirability and the parametric
model. Uh-huh?
>> Question: (Indiscernible) a baseline what the list price is for the sale? I guess the question is
how good are the appraisers? How good ->> Sumit Chopra: Yeah, yeah, yeah.
>> Question: Also, what models do real estate companies use? They must have models they
trust because typically they might set the price high so that (inaudible) ->> Sumit Chopra: Yeah.
>> Question: But I'd be surprised if this list price is off by 15% in 80% of the cases.
>> Question: Okay. Let's -- couples must have -- is it all local expertise or do they have models
that you use (inaudible)?
>> Sumit Chopra: I think it's a -- well, I think it's a bit of both. But I'm not sure about the models
they use because obviously there is no way to have access to them other than these which were
traditionally published in the legislature that eh economists have used.
Yeah, of course, yeah, yeah.
So and here what we show is the launt(phonetic) desirability on the test houses so each point is a
house, a test house and it's color coded according to the value of its estimated desirability. So
red means higher desirability. Blue means low desirability. And if you're familiar with the Los
Angeles area, so it's doing something really reasonable. Areas like Beverly Hills, Santa Monica,
Here, Malibu along the coast and Pasadena, they are all red, indicating they are highly desirable.
Areas like downtown and down in the desert, they are all blue. While in the valley it's like
moderately desirable. So that's something interesting we thought was happening with respect to
this model.
And another thing that we did with this was try to answer typical sellers dilemma, like whether
making a particular modification to a house will increase its value or not and if so then by how
much. So what we do is once we've run the model we put up the value of that attribute by one
unit and ask the model to predict the value of the (inaudible) house. And we also have the
original predicted price and we compute the sensitivity issue which sort of measures the expected
gain in price by unit change and not by attribute.
So what we show here is bedroom sensitivity manifold. Again each point is a house and color
coded according to its sensitivity. So you see that in the downtown area, which is fairly
congested, adding one more bedroom to the house is much more valuable than as opposed to
adding another bedroom in the five bedroom mansion out in the desert or out in the valley. So
yeah again we thought that something interesting was going on.
Yeah, and that ends my first part so -- and as part of the future work obviously one
straightforward extension is to include the time variable and second as we discussed sort of to
incorporate the hard boundaries and have sort of have a dynamic neighborhood for every house
rather than a static one. And yeah we've been planning to extend this technique to other
domains like as I said slash-dot data we have from -- in collaboration with Sterns(Phonetic)
School of Business. So the idea there is given a whole bunch of comments by different users
and the source article you want to come up with a prediction of the rating of that comment that
would be given to it by different moderators. So what we model this problem is in the following
way. That you have a comment whose rating would depend not only on the preceding comment
and the original article, but also on the so-called mood of the user or his or her intellectual ability,
which is hidden. As in some users tend to generally write funny comments. Some users tend to
write generally stupid comments, so on and so on. So hopefully we plan to extend this to sort of
capture that mood of the user.
Yes. So the second part involves learning similarity metric discriminatively where we designed a
technique called DrLIM, which stands for Dimensionally Reduction by Learning and Invariant
Mapping. Hopefully I'll be able to convince you that this is rather intelligent DR. So yeah.
So as I said, given a bunch of images I -- can you generate lower dimensional mapping so that
similar objects are closer to each other and dissimilar objects are further apart? And also have
some other sample guaranteed to this problem.
Well, so you might say that okay there are a whole lot of previous algorithms and you pick one
and provide it with a similarity and you get the answer. Okay. Fair enough. So I pick my favorite
algorithm. That's LLE. I provide to LLE the information, explicit information that is two plains are
similar if they are -- if they differ by one angle. One two angle or one elevation angle. And this is
what I get as output. It has completely ignored the angle or elevation information and rather
clustered the points according to the lighting conditions. And same here. And there is
absolutely -- I mean it is a highly degenerate manifold that LLE constructs.
So the question is what went wrong here? The trouble with LLE and most of the other algorithms
here is that they rely on a computable distance metric in their input space. In the case of LLE
hence you see the lighting being the major factor of clustering.
Well, there are those that don't really depend on the distance measure, but they do not generate
explicit mapping for you so you don't have any other sample guarantees for such things. And just
to convince you that it's not -- these requirements are important and not really useful generating
pretty pictures because you can have certain clarification of verification problems where the
number of classes is very large and the training samples per class is large. Variability among
them and you also have a bunch of unseen classes you are not trained on. For example in Face
verification you train on a bunch of subjects and you will test on a bunch of subjects you are not
seen on. You want to have that out of sample guarantee.
So yeah just to summarize about the object once again. We want to map from higher
dimensional space to lower dimensional space, which maps similar and the similarity could be
anything. So the similar sample to nearby points in output space and dissimilar samples to
faraway points. And it should not require an arbitrary computable distance metric in the input
space and hence should be invariant within transformation and have some out of sample
guarantee.
So what we propose is a simple three-step algorithm. The first step involves building a
neighborhood graph. So based on whatever similarity you choose, you create similar links
among the samples and all the other pairs of samples are considered dissimilar to each other.
Step two involves choosing this function -- parametric function GW that matches the higher
dimensional points to the lower dimensional output space. Step c involves training the
parameters W so that similar points are together and faraway points are -- dissimilar points are
far away.
So the question remains: What goes inside this GW and how do you train it? Pretty much
anything can go inside GW, it could either a linear function, a convolutional network. It
completely depends on the problem that you have at hand for example -- yeah. And as far as
training is concerned, we use the so-called siamese architecture that was first explored by
Brilmayer. So what it does is it keeps two identical copies of the parametric function GW that
share the same sort of weights and you have a bunch of important images like two of them which
are similar or dissimilar. You plug them in and generate the features in the output space and the
energy is given by any distance measure in the output space. So note that this measure is in the
output space rather than input space and to learn these rates you minimize this contracted loss or
what this loss is doing is if you have similar images then the label YI associated with these
images is 0. This part of the loss function is activated, which is nothing but a quadratic loss. So
minimizing this loss is equal to minimizing this energy function or this distance in the output
space.
However, the samples are dissimilar, YI is one and this part is activated and minimizing the loss
now increasing the energy or the distance in the output space. And by some margin M here as in
because we are seeking a smooth manifold in a bunch of our experiments and so you don't want
to push the dissimilar functions far away apart and hence generate these clusters. Although we
might need clusters in a bunch of situations, which I'll talk about.
>> Question: (Inaudible) -- situations where it's not as similar or dissimilar (inaudible) too much?
I am moving, I mean, similar to the smooth thing.
>> Sumit Chopra: Yeah, yeah, yeah.
>> Question: (Inaudible) points.
>> Sumit Chopra: Yes. Yeah, that is one thing algorithm assumes anything that is not labeled
similar is dissimilar. Yeah, yeah. But hopefully that thing might be taken care of by this margin
thing because you're not really pushing all the points very far apart. So you'll be generating a
smoother manifold, so hopefully you'll have, you know, you are right. Yeah.
So yeah so that's it. That's the algorithm. And we tried this thing on the 4's and 9's digits from
the M data set and 4's and 9's because they are fairly similar to each other even when you see
them and hopefully reported a difficult task. So we take the randomly chosen 3,000 samples for
training and 1,000 samples for testing and the GW was a four letter convolutional network in our
case.
So -- and for a sanity check, we first computed the nearest neighbors in the input space by the
(indiscernible) and distance between them. So what you get is the smooth manifold that
separates the 4's and 9 's reasonably well. These are test samples as in you don't know the
relationship between any of the two dots in this manifold nor do you know the relationship
between any of the dots with the previous training set. So yeah and besides the separation it has
a smooth change from delta to straight and so on.
>> Question: So you're putting data -- then I mean your training labels are which pairs are
similar. So does this mean that for all 3000 by 3000 labeled also or are you just using the five as
a proxy where you are going in and saying that the five that are the nearest neighbors are in fact
similar?
>> Sumit Chopra: Yes, only the nearest neighbors are similar.
>> Question: (Inaudible) ->> Sumit Chopra: Sparse of a hopefully connected graph over the digits yeah. Yeah.
>> Question: Certain amount of -- well, I mean for instance if they were rotated forth and things
like that, right?
>> Sumit Chopra: Uh-huh.
>> Question: Automatically labeling the neighbors in that way wouldn't -- wouldn't really get you
get you because those neighbors are chosen by Euclidean space.
>> Sumit Chopra: Exactly. As I said this is just a sanity check with the prior knowledge thing
later on that will hopefully convince you.
So now what we did was we explicitly translated the images by minus 6, minus 3, and plus 3 and
plus 6 fixers. And again for the purpose of sanity we checked, we again computed the Euclidean
neighbors and what we get is -- these five clusters and each of the cluster corresponds to the four
translation and the center cluster the original image. And furthermore, the images within each
cluster are fairly well separated and the order in which the clusters are clustered is exactly
according to the way they are translated. Like this is minus 6, minus 3, 0, plus 3 and plus 6. So
it's -- so this is like sort of reinforcing the fact that nearest neighbor in a Euclidean space might
not be a good idea if you have these sort of complicated translations.
>> Question: (Inaudible) -- is an issue then with respect to living in apartments because the -- in
fact you would want the 4's of the plus 3 to be very close to the 4's in the minus 3 if they are the
same identifier. And because that would be real variance, right? You would in fact want those
4's to be almost on top of each other because you would want whatever features are selected
and selected through (inaudible).
>> Sumit Chopra: That's my next slide.
>> Question: Oh, sorry.
>> Sumit Chopra: So now finally what we do is we inject prior knowledge and what we say is
each sample is a neighbor of its five Euclidean neighbors. In relation a sample is also neighbor of
shifted versions in relation each sample is also neighbor of shifted version of its five Euclidean
neighbors. So what you have is exactly what we want, a well separated manifold and if you zoom
inside this you get identical folds that are translated in shape and that is exactly what we wanted.
So yeah and this is of course using the prior knowledge. And this is what we get if you use
similar prior knowledge for LLE, a completely regenerate solution which is difficult to put into
words what it's doing.
Another experiment was using a little more complicated data set and this was consisting of
airplanes from the Norb data set and we project into 3D space. So airplanes were -- consisted of
972 images with 18 angles and nine elevation angles and six lighting conditions. And how we
generate the neighborhood graph is by saying that two planes are similar if they differ by one in
assignment or in elevation. And explicitly don't give any lighting conditions requirement and what
we get is a very nice 3D cylinder and along the rim of the cylinder the planes are arranged
according to their angle and along the height of the cylinder the planes are arranged according to
their elevation and it ignores the lighting condition. It is effectively returning us the way we
generated our data.
And just for reference once again this is what LLE would have given if -- and the last application
for this was face verification where the task is to accept or reject the claimed identity of a person
in an image. So given a pair of images your machine would say yes or no whether they are
similar or not. And of course it is a difficult problem because you could have a very large
variability in the data set. Like you could have artificial contusion like face scar, sunglasses and
rather animated expression. And there are large number of classes and there are even unseen
classes where you've not trained on those subjects.
And training was very similar other than this loss function now. So when you have a dissimilar
thing you actually want discreet clusters for every subject in the feature space. So you -- so you
essentially pulling up as much -- pushing apart as much as possible the dissimilar pairs. That's
the only difference between the two. And yeah and for similar pairs you have the usual quadratic
loss.
So among the various data sets namely the AT&T, Ferret(phonetic) and Purdue. I'll discuss the
Purdue data set, which was the most challenging. It consists of around 136 subjects and they
have a very high degree of variability, as you can see, for every subject. And we picked 96
random subjects for training and 40 for testing. And this is what we get as far as the performance
is concerned. For 10% false accept you only reject 11% of the pairs. And of course as you
increase this there's an increase in the false reject rate, as well, as it increases. But I mean, to
convince you a little more what it's doing is it's correctly classifying this as a genuine pair. This as
a genuine pair, which is difficult even for a human and this as genuine pair and it is also able to
classify this as an imposter pair and these are fairly easy. Well, this is not maybe. So -- yeah.
So and there are a whole bunch of extensions to this idea and for example you could use it to do
an automatic calculated detection as in generating a bunch of invariant features for an object. So
what you have is a moving camera that takes pictures from different objects at different angles
and you have a connected neighborhood graph by neighbors being defined as two images if they
are temporally adjacent to each other. And then what you would hope to see is a cluster for
every such object so each cluster according in these invariant features. And other techniques
where it can be used beyond images is for example information retrieval where you are doing
semantic hashing for documents. I mean you just need to how to do documents -- you just need
to label how the two documents are similar and that could be any arbitrary distance. And natural
language processing. In particularly very distinctly people from any CD search have used this for
semantic role labeling. And this is the work of Jason Westin and Ronan Collobert appearing in
Sears ICM. So what they do is they train a deep architecture for doing semantic role labeling and
what they -- in addition to the usual supervised learning of this deep architecture for every layer
they also have this DrLIM training, which they call M(inaudible) layer. So when you back
promulgate the variance through both part and this part you hope to get features over here that
are more meaningful or more consistent, both with respect to the supervision and also with
respect to the similarity and dissimilarity. And they beat -- pretty much beat the state of the art for
semantic role labeling using this technique.
And yeah, so and finally I'd like to end my talk by discussing a bunch of things that I'll be
interested in doing and which involve basically designing efficient inference and learning
algorithms for large-scale layers sets, primarily involving geo world data sets and solving
interesting questions as in (inaudible) classification and regression are the fundamental issues,
but can go beyond that, for instance, predicting how the with respect to the house prices,
predicting how the neighborhoods changed dynamically with respect to the demographic
movements and such. And yeah and exploiting the underlying structure that's there in the data
sets and most real world data sets and not really use simple IID assumption and yeah using
energy based model. Deep architecture, which I've been involved in outside project and
problemistic modes. So yeah. And to show you really nice demo that myself and (inaudible) had
(inaudible). So these are like the images for plane and neighborhood relationship is again the
ultimate angle and what you are seeing here is after every epoch how the DrLIM training is going
ahead.
So the I guess to have circle in the end that will basically the planes according to their angles.
Just to fast-forward it. Initially everything is random, now it's trying to solve unwinded slopes and
now it has three loops remaining and finally it -- oops. Something is stuck I think here. So yeah
so finally in the end what you have is a circle, and as expected and then it's basically fine tuning
its parameter to really get the (inaudible) and the learning rate reduces so what you have is a
circle.
Yep. That's it. Thank you very much. (Applause)
>> Question: So this last feels kind of reminiscent of channel equalization. Do you know about
channel equalization?
>> Sumit Chopra: Uh-huh.
>> Question: Used in modems. The idea was we all once were looking for features work across
all the variations of this.
>> Sumit Chopra: Uh-huh.
>> Question: And in channel equalization what you try to do is to model the noise process.
>> Sumit Chopra: I see.
>> Question: Then you process then you know what's going on here and you can work out the
combination of the two and then you can sort of back infer what is going on.
>> Sumit Chopra: I see.
>> Question: And I think that maybe -- it's much easier to model the noise process than to find
what features would be invariant across it.
>> Sumit Chopra: Hmmm.
>> Question: So what you are doing in this is sort of giving the system a chance to learn the noise
process.
>> Sumit Chopra: Yeah, yeah, yeah.
>> Question: Maybe it would make even more sense to just model the noise process; correct?
>> Sumit Chopra: And by noise here you will be ->> Question: One case you had was the shifts.
>> Sumit Chopra: Yes.
>> Question: And the other case was the lighting.
>> Sumit Chopra: Uh-huh.
>> Question: And in general there would be a noise process dependent upon the application.
>> Sumit Chopra: Yeah, yeah, yeah.
>> Question: In modems it was something else.
>> Sumit Chopra: Yeah. (Inaudible).
>> Question: Thank you very much.
>> Sumit Chopra: Thanks a lot.
Download