>> Ashish Kapoor: So it's my pleasure to introduce... with us this winter. I mean, he's a graduate student...

advertisement
>> Ashish Kapoor: So it's my pleasure to introduce Andreas Damianou who is interning
with us this winter. I mean, he's a graduate student at the University of Sheffield with Neil
Lawrence, and he's going to talk about deep Gaussian process models. And this is, I
believe, joint work with Neil as well.
>>: Yeah.
>> Ashish Kapoor: All right.
>> AD: Thanks. Okay, so thanks for coming. And so this is very recent work that we did
with Neil just before I left Sheffield to come here. But also this presentation I'm also
going to include parts that are a bit older but very relevant to the deep GP models, and
for this we worked with collaborators like Michalis Titsias and Carl Ek.
Okay. So first of all I want to give some motivation why you want to consider GPs in
deep models. And, well, a colleague of mine said that lately it's very cool to have the
word "deep" in your title of a paper. You know, apart from that I think the main motivation
-- I'm trying to display it now. So if I show to you this picture here of this girl. So if I tell
that this little girl here shows the gesture in sign language for "I love you," and then I
show you this bunch of pictures, I think it's very easy for the human brain to recognize
immediately that, you know, the rest of pictures depict the same sign.
So for a computer that would be very difficult. It would be very difficult to [inaudible] this
hand or something. So you know, objectively speaking we know that the human brain is
very good at one-shot learning and generally in learning because, you know, there is
some sort of hierarchical representation of knowledge in our brain and also very good
prior models for the data.
And if you try to find an analogy for the computers then we deep belief networks that try
to represent the knowledge in hierarchies. And as for good prior models, it's been known
that Bayesian non-parametric models are quite good and flexible models and have been
very successfully used as prior models for data.
And of course we can achieve such an effect if you have many men training examples
as well, but that's not always possible. Right? So in real-world application sometimes we
have very scarce data. So in this talk I'm mostly going to focus on the first two. So the
deep GPs are trying to combine the structural advantages of the deep belief networks
and also the advantages that come from having a Bayesian non-parametric approach to
the data.
And obviously it would be nice to compile many training examples if possible. So there
are two things: one thing is that in the current version it is not very easy because you're
still using GPs that scale [inaudible] with the data. And also many real-world
applications, these are not available anyway.
So I'm not a deep learning expert but just a very quick slide here. So what people are
doing traditionally in deep learning: so you assume that you have your observations, Y,
here. That's like the outputs. And you assume that these outputs come from a deep
hierarchy of latent units stacked on top of each other. And the most successful approach
is to start RBMs, restricted Boltzmann machines, and treat all the above layers as latent
units.
So the fact that you're using RBMs means three things basically: one is that the outputs
necessarily, essentially are modeled as a linear weighted sum of the inputs. Another
thing is that the units here are binary. And the inference that people usually do in these
models is based on sampling methods like divergence and so on because you are trying
to marginalize these hidden units and this is intractable.
So intuitively if we take this model and try to find the analogy about using GPs for the
mappings here instead of stacking RBMs then we'll be able to model continuous outputs;
that's one thing. We would be able to have non-linear mappings. And I'm also going to
show here how we can do variational inference for this model. And this is good because
you get a bound and you can do model selection and so on.
Of course the good thing with traditional approach is that it can handle huge data sets,
and for the moment this is not very easy with the GPs. But I'm going to describe that
later.
So I'm going to talk about the GPs, just a very, very quick introduction about the GPs. So
I assume most of you are even the experts in the GPs, but just a very quick introduction.
So a GP can be thought of as an infinite dimensional Gaussian distribution. So here, for
example, if I sample from a one-dimensional Gaussian, it gets these samples. Here I get
two-dimensional samples.
So if you sample from the GP, every sample is basically an infinite object; it's a function.
So in order to define a GP on some function, GP prior, the basic ingredient is a mean
function which we usually take to be zero and a covariance function. So this covariance
function is evaluated on a finite set of inputs, but as I'm going to show the posterior it's
over the function space so everything is infinite there.
And by using specific covariance functions, what you doing is that you are making
assumptions about the properties of the functions that you are trying to model and not
the specific parametric form. So for example, here you say that, you know, the functions
are very smooth, here they are not that smooth and so on. So that's what I talked about
in the previous slide. So you have a prior on the function and you evaluate it on a finite
set of inputs. But then, if you combine it with the Gaussian likelihood then you get a GP
posterior which is over the function space and also a predicative distribution. And this
means that you can evaluate it for any input.
And this is the notation I'm going to use for the entire talk. So Y is the output, X is the
input and F is the mapping between the two spaces. And F has a GP prior or Gaussian
process prior. And a very quick demonstration. I guess this is already clear to you guys,
right?
So initially before we see any points and you just have the Gaussian process prior, the
model says, "I haven't seen any points so the function can be anywhere in the gray
area." But then when you see, let's say, two points, the model says, "Well I know that in
the neighborhood around this point the function has to pass from here and the function
has to pass from here." And because of the smoothness assumptions you know that the
points in the neighborhood cannot be, you know, very far away. And as you observe
points, you learn the function better and better. If you don't have a lot of points, you can
still, based on your assumptions, generalize better as you can. And, so on.
So, you know, that was a very short demonstration for GP regression, but people have
also used GPs for unsupervised learning. And so what happens if you don't have outputs
but you want to, nevertheless, assume a generative model based on GPs. So in that
case X, the inputs, are latent, are unobserved.
And one approach is to say, "Well they are unobserved --" So basically this is the GPLVM framework, Gaussian process-latent variable model of Neil Lawrence. It's a very
successful model. So in the original approach you say, "Well X is unobserved so what
I'm going to do, I'm just going to optimize [inaudible] in a map way." And in the Bayesian
GP-LVM which is much more recent, you are trying to learn the posterior distribution
over X. So you are computing here in the marginal likelihood.
So this is intractable basically. So F here, the mapping is marginalized out. But, you
know, if you include it here and try to propagate the prior, that's infeasible. I'm not going
to expand on that; I'm just saying that this is not an easy thing to do. But in this paper
here, Titsias and Lawrence, they show variational framework with some tricks that make
this possible. And basically that's what we also base the approach on that I'm going to
show about deep GPs.
Okay, so I've come to the DGPs topic now. So that was I was showing before. Right? So
you have your outputs and the inputs that can be latent. So what happens if you just
stack another GP on top of that? So you know if you just do that and you don't want to
do any inference, you just want sample, that's easy. I mean you can do it. You just take
the inputs and you generate outputs based on some GP and then, you use these
outputs as inputs for the next GP. And you take the final output. So that's easy to do.
And it will look something like this. So you take your inputs and the outputs look like this
because of the kernel you use here and so visually non-linear. And if then you [inaudible]
from another GP and you take outputs in more dimensions, you get outputs like this. So
obviously you see that the over model is something more than a GP. Actually the
overlapping is not a GP, right, because you can see from the outputs here you have very
long-term correlations and non-stationarities. For example, you see here you have the
same effect as having the length scale of a kernel to be very small or very large
according to the layer above. So normally it'd be very difficulty to get samples like this
from normal GP.
So that's clear, right? So if you just give to GP this input and this output, it's going to
struggle. All right. So that's from sampling. But if you want to use this as a model for
inference -- So if you just present the outputs to the model and you want to inference for
the layers, that's actually very, very hard because it's very had to regularlize and to train
such a model. And I'm going to explain why.
>>: So throughout this time you assume the kernel is given. Do you learn...
>> AD: No. But that's the difficult thing, right, to learn...
>>: Yeah.
>> AD: ...the kernel and also do the marginalization. So for the demonstration we can
just fix the kernels and give the inputs and then just sample. I mean, that's how I did that.
But the real model is to learn the kernels and to be able to marginalize this into your
expressions. That's a difficult thing. And that's what I'm going to discuss about.
So just a bit more discussion about why this is very difficult. So that's the joint distribution
of the model. And if you are doing totally unsupervised learning, this is also going to be
latent. Right? So you would have another term here, p of X2. But let's assume [inaudible]
given us an input. So if you don't attempt to marginalize this guy here and you just try to
learn everything then this -- Basically this is the hierarchical GP-LVM of Lawrence and
Moore 2007. And what they show there is that it's very difficult to regularlize it because
for one thing the dimensionality of these guys of the latent variable has to be given a
priori. You don't know how big this [inaudible] should be, and you know this increases as
you go up, as you have more layers. And for this reason it's also prone to overfitting
because now all these are parameters of the model we try to optimize. And most
importantly, they show that deep structures are not really supported by the model
evidence.
So if they tried to do MAP inference here, and you also include in your kernel white noise
as people normally do, they find in the top level the white noise just explodes. It's like
switching off the deep structure, so the model prefers shallow structures and they have
to force it.
And the cool thing about the stacked GPs of the Bayesian framework I'm going to show
is that the model actually supports the deep hierarchy even when you have very scarce
data. So that's a solution we proposed in this very recent paper. And we argued that the
proper thing to do is to marginalize out all intermediate layers like people are doing with
traditional deep belief networks.
And to do that basically you want to compute this marginal likelihood. So here you only
have one hidden layer but it's the same if you have more hidden layers. And as I said,
this is intractable even for the simple stacking so we consider even stacking more layers.
But we show that in our framework you can use a variational bound which is
nonstandard actually; it's a bit more involved. And I'm not going to go into the details but
I'm going to give [inaudible]....
>>: So is the nature of this model similar or different from DBN model of contrasted
divergence?
>> AD: So I'm not expert, as I said, in the DBN but I think people there are based on
sampling techniques, right? So here is not sampling; it's variational inference. So you
have [inaudible] approximation and so you have a lower bound on the actual thing you
want to compute. And you just try to minimize the distance. So this is a constant
because it's marginalized on the data.
>>: Do you know where this similar thing can be applied for DBN at all?
>> AD: So I know that people have tried to do variational approaches and I think Ian
Murray has recent papers. But I don't know how successful it is but I know that people
still use the more traditional techniques. So I don’t know. But I know it's not an easy thing
to do. You know, some people might argue that maybe it's better to do sampling in the
first place. I don't know...
>>: [inaudible] so hopefully [inaudible] tighter bound [inaudible]...
>> AD: You don't have guarantees in the other case. You can just hope because, you
know, you do sampling. You don't know. Here it's a bit better because you have actually
the bound, yes, but...
>>: [inaudible] I mean what is the difference compared to -- So you do [inaudible] is
closer to what you would see when you apply [inaudible] right? Unless you want to
marginalize all the time, I mean, even at...
>> AD: But you're doing sampling also for the purpose of marginalizing. Right? So
they're both approximating techniques and, you know, people in machine learning I don't
think they have settled which is better for general models anyway.
>>: Nobody [inaudible] -- I don't know what is your practice. Usually you increase in very
small model with just a few nodes. And for that you can do exact sampling and then, you
get real, you know, inference. And then you can compare how [inaudible]....
>> AD: Yeah, that's actually true. Yeah. But you have -- Yeah, I haven't tried that but
that might be a good idea actually. And, you know, we were discussing that because it's
good to see both approaches. But you know, sampling, you can do it any time so
sampling is supplied everywhere. But the variational method is something that you have
to develop, and this one is not the standard variational method. It's a bit more
challenging actually.
And it's not only the inference, so it's also the regularlization thing. If you want to do
these things properly, you have to find a way to regularize this very large network of
layers and all these [inaudible] after fitting from one GP to another. So that's very hard to
learn.
And so we use a few tools. So the first tool as I said is to find a way to marginalize all the
intermediate layers which is unfortunately not [inaudible]. And to do that we extend the
variational methodology of the Bayesian GP-LVM, and naturally we based that on the
paper from last year at NIPS. It was only for a work model, for a two-layer model. And
I'm also going to explain that in the most generic case, you can also learn additional
structure by allowing the hidden units to form conditionally independent sets by using the
Manifold Relevant Determination method. I'm going to expand on that in the following
slides, just methods to make this very big and difficult to regularlize structure, easier to
learn and to automatically find.
So first I just want to give some intuition as to why marginalizing the intermediate layers
and using this Bayesian framework works well for discovering the structure of the model.
So, you know, as I said before GP basically relies on the covariance function. So if you
don't use any covariance function but you use this specific Automatic Relevance
Determination covariance function then, you know, that's a normal covariance function
that says, "If two inputs are Xi and Xj are very close then the functionality should be very
similar for these two points." But this specific function also has a weight that weighs each
dimension of the latent points individually.
So you can imagine X, you know, the inputs to be a matrix where rows are points, one
row is one point and the columns are dimensions. So if you fit GP on this, you would
have one weight per column. And when you optimize, a lot of these weights go to zero in
the Bayesian framework. And the model is saying, "You know, I don't need this
dimension." So what you are doing is that...
>>: So you are still talking about learning the weight from...
>> AD: Yeah.
>>: [inaudible] -- I thought it's very hard to learn [inaudible]...
>> AD: That's only hyperparameter learning for -- It's just hyperparameters.
>>: Oh, you don't...
>> AD: So these weights are not -- I know the traditional weights that you use for the
DBNs. It's like weights of the covariance function; it's hyperparameters of the model.
>>: So you tie them all together but you don't [inaudible]. Okay.
>> AD: Yeah, exactly. So what you are doing is that you have this model -- Now let's
just say you have these two guys here, right, it's not deep at all. It's not hierarchical. You
have your outputs, right, and you just initialize the -- So you have a latent node here, a
latent random variable; you don't know what it is. You just initialize it with, let's say, ten
dimensions and then you do the learning. And after you do the learning and [inaudible]
the weights, you see that the Bayesian procedure turns off all the unnecessary weights.
So here is for the old data demonstration. The model says, "You know, I just need two
dimensions so I can represent my original data just in two dimensions. I don't need the
rest." And the one dimension that's doing all the work is basically this dimension that is
doing all the separation. So this is a very powerful way of learning the structure of this
model. And this is just a demonstration on a shallow architecture. But you can generalize
this as you add more layers. So is that clear? You know, is that -- you know, intuitively?
>>: But how does that kernel differ from -- What's it called? -- the radial basis function
kernel? [inaudible] it's the same [inaudible]...
>> AD: Yeah. So that's a radial basis function with...
>>: [inaudible] parameter.
>> AD: Yeah, with this additional thing here. So if you take this out and you make it
common for all dimensions, if you place it here as a common parameter then this is just
the RBF.
So the other tool that we can use for eventually stacking the GPs is Manifold Relevance
Determination method. And there we use again the same trick with ARD. So again I'm
talking about the case where you have only a very shallow -- not a deep architecture. It's
just the outputs and the inputs, and you have a GP here. So what happens if, you know,
instead of this case you have two output modalities. Let's say you have data from Kinect
and you have the RGB color and the depth images. If you try to model them in the
generative framework, you have to assume if it's generative then the latent variable has
to have some commonality. Right? It has to be common or some parts to be common,
right, because if these guys are very, very relevant to each other and if they're generated
by the same random variable, there has to be some commonalities. But if you only use a
single random variable for both then you lose all the prime information in the outputs.
Right? So if you have RGB here and depth here and you model them with the same
random variable and you have a video then you can successfully probably model the
dynamics of the video which is common in both views but you lose all the prime
information like the color or the depth or whatever.
So that's like the CCA model, right? So what we did and what [inaudible] to use for the
generic version of the deep model is that you can have your other modalities here and
you use two different GP mappings for each modality from the initial latent space.
So you have a latent space bigger than the variable initially and you have two different
Gaussian processes which come with two different weight vectors per modality. And
when you fix the model there you get these different weight vectors, and these weight
vectors define the segmentation of the latent space.
I'm going to show it here a bit more clearly. So here you have one view and here you
have another view. You fit the GPs and then, you have one weight per dimension of the
initial latent variable. So if, for example, this weight is switched off we know that this
dimension doesn't matter for this guy here. If the first weight here and the first weight
here are switched or are non-zero for both then we know that this is the shared space.
So you can automatically learn segmentation of the latent space in this way. Right?
>>: How big is Q in your practice problem?
>> AD: So Q, you have initialize it and then you let the -- I will explain. But you have to
initialize then the Bayesian procedure takes care of that. So it's difficult to know a priori
but here's the thing: if you have a matrix of data that's N data in Q dimensions then the
rank of the matrix cannot exceed N or Q. So you have 50 data, Q cannot be more than
50. So at most you use 50.
>>: There are so many [inaudible] equation [inaudible]?
>> AD: Yeah, I mean...
>>: [inaudible] very small number, you'd probably use that.
>> AD: We have tried this with [inaudible] where the outputs are millions and you have a
hundred data points so initialize with twenty dimensions. And eventually many are
switched off. So the thing is that, you know, if you don't have a lot of data, the [inaudible]
cannot be much. So usually I use twenty in practice.
>>: Yeah that [inaudible]...
>> AD: Just as a rule of thumb because it's not what the real dimensionality of the data
is, it's not what the real dimensionality of this -- the [inaudible] dimensionality of this is,
it's what you're actual data set supports. Right? So maybe you have...
>>: So when you solve this problem Bayesian optimization works well?
>> AD: Yeah.
>>: [inaudible]...
>> AD: Yeah, I'm going to show you a demo right now, actually. So I think it's better if I
first describe the data set. So here's the data set that I'm going to use. So I don't know if
you know the Yale Faces. So it's a bunch of images from different people, and for each
individual you have images taken from a different light angle. So what we did is that we
created one data set where we have all different images of one individual and all
different images of another individual and three individuals in total. Right?
So here we have the images of three different persons under all possible illumination
conditions. In the other data set we have all images of three different persons, again
under all illumination conditions. And then, we align the images so that the only thing
that I have in common is the lighting direction. So we have three different people in total
in this data set, and three different people in total in this data set. And to do the matching
so that, for example, any guy here can match any guy here as long as it's from the same
light position.
So artificially we created a data set where the commonality is the light direction and the
private information in each data set is the individual face characteristics of the persons.
And then, let's see the demo here.
Okay, so what I'm going to do here is that I trained the model on this kind of data, and I
got these weights. So these weights tell me that what initialized with 14 dimensions.
Dimension 1, 2 and 3 is common for both data sets. And because the weights are
switched on for both data sets. For example, Dimension 4 is only relevant for the second
data set and so on. So here I plot latent dimensions against each other. And every time I
move the mouse I sample from these points just to see what happens and to see what
kind of information I can encode.
So firstly I'm going sample from Dimensions 1 and 3 which, as you see, are common.
And since the model learned that these two dimensions are common, I would expect to
see only light variations on the sample.
So you see the outputs here. And indeed that's the case. So I sample and you see that
what changes is only the direction of the light. Right? So wherever you have red crosses
is the actually training inputs that map to outputs, and wherever I move in the white
space it's just [inaudible] outputs because it's a model of continuous variables. So here I
generate from [inaudible] lighting directions. If I got to the extremes -- I can get also
more extreme, stuff that is not in the original data set.
So see that successfully the model learns that Dimensions 1 and 3 are common for
these data sets. So now I'm going to fix, let's say, the light direction here. And I'm going
to sample from the private space to see if it encodes, indeed, private information. So
Dimensions 5 and 14 as you can see here are private for the first model. And what
happens when I sample from these dimensions is that I get outputs that only vary in the
private information which is the characteristics of the persons. So here when I sample I
get outputs for the first guy. When I sample here I get outputs for the face of the second
guy and here for the third guy. And the cool thing is when I sample in between I get
[inaudible] outputs. So here I get like this morphing effect which is kind of cool. You
know I want to try that with my own face as well. So you see that there is this morphing
effect. If you go in between you get the face of a guy which is in between this guy and
this guy. And all this is by having fixed the direction of the light to be there. You know, I
can just go back...
>>: So this dimension is the one that's non-zero?
>> AD: So this dimension is 5 and 14.
>>: 5, 14.
>> AD: So it's 5 and 14 which are the dominant dimensions that are also private for one
of the models. So...
>>: [inaudible] 8 and 10 which are the private...
>> AD: So if I take 8 and 10, nothing is going to happen because the model just -- So
that's a good question. So the model is just going to ignore. It says, "You know, nothing
happens." The model just doesn't care. You see, so now it's sampled but the output
doesn't change because the model says that these dimensions are irrelevant.
>>: Because you're showing the model for the first.
>> AD: Exactly. So you are here and -- Oops. And it's like trying to put outputs in this
guy by sampling from the private space of this guy. It just isn't going to -- Nothing
happens. So first I samples from the shared spaces; that's when the light direction
changed. And then I sampled from the private space. And I could also do the same for
the other modality. But you get the feeling of how this works, right?
>>: Yeah.
>>: So I suppose you are going to show when you have one more layer up, things will
get better? Or not?
>> AD: Yeah, so that's a tool you can use to learn some additional structure in the latent
space. And now I will just actually, yeah, start talking about stacking GPs eventually. But
just first to mention that, you know, since we are here in Microsoft the potential
applications: So motion-capture data, we have actually tried on that and it works pretty
well. [inaudible] Lewis and motion capture data. Maybe another idea would be Kinect
data. I'm actually going to try that for the [inaudible] challenge; I think it would be a good
idea. Also for the deep model but I never had the time to do that. But you know it's
something that I want to do in the future.
Okay. Now venturing back to stacking GPs. So for the moment -- So that's the most
general graphical model simplified of the stacked GPs, is what I showed so far. But now
I just want to mention that you can also expand the model horizontally in the Manifold
Relevance Determination method that I showed previously so that you can have more
modality in the outputs. And in the same manner as you said, you can have conditional
independencies in the hidden units.
So that was the intuition. But to make things simpler, let's forget that you can additionally
learn this kind of structure, and let's just focus on the simple case where you have one
modality and then you have stacked random variables. So just forget about this, and let's
just focus on the case where you have your outputs and then you stack single random
variables all the way up.
So as I said we need the variational methodology to be able to marginalize all hidden
layers. And because it's a bit complicated, I'm not going to show exactly the math. But
just as a demonstration just to get the feeling of what kind of variables are involved and
the complexity and how you can optimize this thing. I'm not going to explain, obviously,
but just to get the feeling.
So if you want to marginalize all these guys, all the latent variables -- Well this can also
be observed if you have supervised learning. But anyway, if you want to marginalize the
latent points then you want to compute a bound on P of Y. And to compute this bound
you have to add all these terms here. And I just want to show graphically that the first
term depends on this, on the leaves. And you have terms like this on the intermediate
layers and this term for the top layer.
And so this guy has an expression like this and this guy has an expression like this. I'm
not going to describe it, but I just want you to see how this thing scales. Right? So what
you want to do -- And if you make some more calculations -- is that you introduce some
additional variational parameters that we call U here. And you want to take expectation
like this and like this for this term. And the reason I'm just showing this without explaining
is just because I want to comment about the complexity of the model that what you have
is basically one set of variational parameters, pair GP so pair mapping. So again if you
go to the simple case, you know, where you don't have all these modalities, you have
this case, you would have one set of variational parameters here and one set of given
parameters, and the same for this guy and the same as you go up.
So these things, as you can see, scale pretty much L times the same amount as for the
sparse GPs. But, you know, you have to take this expectations and these couple of
things, make things a bit more complicated. You have additional matrix multiplication, so
in practice a bit slower. And because of all these calculations, the problem becomes
even harder so you have a non-convex optimization to do. And, you know, you have a
lot of local minima. I just want to show that to comment on the practical way of
[inaudible] this model.
>>: But in this case, I mean -- So this [inaudible]. So what would [inaudible]...
>> AD: Yeah.
>>: ...[inaudible] develop is that you [inaudible]?
>> AD: Yeah.
>>: Which is pretty [inaudible]. So do you have any way of removing this [inaudible]?
>> AD: So I think stronger regularizer is the Bayesian framework itself. So when you
have Bayesian framework and you have the prior, the model is somehow like an
automatic Occam's razor. It's not going to overfit us in the maximum a posteriori
approach because you have, you know, this Occam's razor effect that tries to regularize
things. So if you have a maximum a posteriori approach and you were trying, let's say, to
do the same thing, you would get all weights switched on because the model says, "Oh,
more dimensions. Okay, I can use them, no problem, to fit the model better." But if you
have a Bayesian model, it tends to regularize itself better. And the model will say, "I don't
need these dimensions." So if you generalize it as you add layers, the vital ingredient
that makes this thing possible...
>>: So you think is because of the unique property of GP it can do this rather than
[inaudible]...
>> AD: So it's not -- So it's the variational framework that allows Bayesian treatment of
the whole thing. So if you just stuck the GPs here and you don't marginalize this and you
try to optimize then it's going to overfit. But if you marginalize this within a Bayesian
framework then it's going to sort of auto-regularize. I'm not saying that this is ideal. But
for example, you don't get this effect that they notice that if you don't optimize it the top
layer is like switched off because there's a lot of noise. And the model says, "Oh, I will
just try to learn this with a single node here with as many dimensions as possible and
that's it."
So the model is more robust in doing this abstract learning and this hierarchical learning.
>>: So [inaudible]...
>> AD: Yes.
>>: ...[inaudible].
>> AD: So it's more because also the Gaussian process is a non-parametric model so I
think it's [inaudible]. It's like -- I think these two more things for the recipe. And I mean by
looking at what people are doing in the DBN literature, they also have these sort of
problems, right. So they need to initialize very carefully. They do these [inaudible]
divergence tricks and they start from a point and then they sample. So these problems
exist here as well, obviously. You have to initialize very carefully. But what I'm saying is
that if you do all these things carefully, here in theory and I will say in practice you get
good results.
If you didn't have the Bayesian framework then you would have am model that, you
know, by definition would not be...
>>: The whole point is that all the problems you mentioned about DBN that's all true.
You have all these problems. And that's why in practice [inaudible]. They use DBN and
they use as initial parameter to do something else.
>> AD: Right.
>>: I don't know whether that [inaudible]...
>> AD: So here's the thing...
>>: ...[inaudible]...
>> AD: I don't want to claim that this is going to replace DBNs because I haven't made
the actual experiments to compare the two. And this is because I know that it's very
difficult to train the DBNs, and I didn't want to misrepresent them and say, "Oh, you
know, this is better." But I just want to show, you know, from a...
>>: But the point is having been [inaudible] DBN is not [inaudible] practical for
[inaudible].
>> AD: Right.
>>: So we want to see that whether we can extract information from this [inaudible]...
>> AD: So that's in...
>>: ...[inaudible]...
>> AD: ...the next slides here.
>>: Okay.
>> AD: But, okay, I see that you want the main point, and I should mention them. So I
think that if you try to get the gist of what I'm trying to intuitively pass here is that firstly, I
just want to give, you know, maybe a new fresh perspective in the general philosophy of
deep learning and say, you know, "People only are starting with RBMs, maybe we
should start thinking about better mapping between the layers and different models that
are still deep." And another thing is maybe to demonstrate when data is scarce -- And
that's what I'm going to show in the experiments and that I know for sure that deep
models struggle when they have very, very few data points. Even when data is scarce,
you can use this model and learn useful representations. And, yeah, and that's
[inaudible] Bayesian versus non-Bayesian thing. As I'm going to show having the bound
here and having the variational approximation, you can see that the model actually
supports deep hierarchies.
So what I'm going to show is that in the experiments I tried models of different levels, so
I tried the model of deep level one layer, two layers, three, four, five. And the model and
the bound -- Actually the model evidence prefers deep hierarchies. And this doesn't
happen in the maximum a posteriori if you take maximum a posterior. So the model
actually supports the deep.
So you know that it's a correct to model to use. And of course you have the inference
problem, and as I said the non-convexity. But, you know, it's a complicated model. So
okay let's now start seeing these things in practice. So to see this in practice I will just
take this simple case where I have the observed outputs. And I only have a two-level
architecture, so this is not deep but it's still stacked.
So in the normal GP you would have only this. So now in the stacked GP only two levels
but still --. What you are doing is that you take the real inputs, you pass it through a GP
and then, you take these latent nodes and you pass it through another GP to get the
outputs. So it's what I showed in the beginning but now I'm going to comment on the
training that you mentioned.
So now what I'm going to do is that I'm going sample. I'm just going to see real data and
real inputs and instead of modeling them with a single GP, I'm going to model with this
GP. So instead of taking the inputs and the output and training a supervised model, I'm
taking the inputs, I'm passing them through a GP that I'm going to learn. I'm
marginalizing this level, so I'm training a distribution, and then I'm passing them through
another GP. And intuitively I would expect this to be a more robust model because it can
model data like this, for example.
And this is basically a warped model, right? So you take the inputs of the GP and you
warp them into something else before fitting them to the other GP. And this is basically
the modeling we represented last year at NIPS 2011. And there we just showed that only
for the case when the inputs are time, but it can be anything actually. So you have a
model that can model sequences, right? So at each point here has its associated time
point.
Let's see where I have examples. So let's say you have a video sequence and, you
know, you have me [inaudible] and so on. So each frame is a picture of me. And you
give to the model the time points. So here is me at this time point; here is me at the next
time point. And you try to model it through this instead through a regular GP. And I'm
going to show you maybe another demonstration now with this model.
So here I did exactly this. So give me a second. Here I took a video of this woman
talking. And this is very, very high dimensional by the way. And the cool thing is that with
all these models we are able to model very high dimensional data so even if you have a
HD video with millions of dimensions.
So here we only have 150 training points, and each point is nearly a million dimensions.
We just fit it [inaudible] pixels. And we took this video. We removed some blocks of
frames and we tried to reconstruct these blocks. And to reconsign these blocks, we gave
the time points for the missing blocks. So we set the training data where the preserved
frames with the time stamps and the test data where the time stamps that I'm asking to
generate -- So it's like, "At 30 minutes show me what the frame should be..."
>>: Interpolation.
>> AD: Interpolation. Exactly, right? Plus we can give some partial observations. So
here we gave -- But we can also not give partial observations. Here we just gave half....
>>: But you know something is missing? If you don't know there is something missing
here, can you technique [inaudible]...?
>> AD: So that's a good question. So...
>>: [inaudible]
>> AD: ...it's a generative model and you don't have to know because you can even do
oversampling. So if you have -- Simple example -- seconds 1, 2 until 10, right, and you
try to generate the video from second 1.5 so, I mean, what does it mean to be missing?
It's a generative model. You can just generate as many samples as you want in
between. Right? If you happen to have this sample then it's okay. But you don't really
care. You just sample as many as you want.
So here whenever you see the green bar on top, it's the training point. And whenever
you see the red bar top, it's something that was missing but we generated it. And then
we put everything in place just to show that the video is quite smooth. And let me play
the video here.
So you see that it manages to generate...
>>: Yeah, so dealing with the red point, which part of the video has been removed?
>> AD: So it's random blocks...
>>: [inaudible]
>> AD: But when you see the red point -- So for example if this is red, this means that
this was removed and it was reconstructed.
>>: I see.
>>: So you...
>>: The original...
>>: ...removed the entire frame? Entire frame or parts of the frame?
>> AD: So for this demo only from the down lip and on the bottom was presented to the
model along with a time stamp. And all this was missing and it was reconstructed.
>>: I see.
>> AD: But we get very similar results if everything is missing. And...
>>: So essentially it's [inaudible] gets removed from [inaudible]...
>> AD: Exactly. Yeah. So it's like a very, very sophisticated mean somehow. And
actually here I paused on the most challenging frame because it's where the head
moves, and the model doesn't have enough data so it's a bit blurry.
>>: It looks like the motion blurs [inaudible].
>>: Yeah, yeah, exactly.
>> AD: Yeah, but it's not the motion so it's actually the [inaudible] because the model is
like, "Oh, I'm going to take a bit of this and a bit of this and..."
>>: These all have a [inaudible] common filter, right?
>> AD: Yeah, yeah....
>>: [inaudible]
>> AD: It's the same thing.
>>: But it has some, you know, probabilistic [inaudible] dictating the trajectory...
>> AD: Yeah.
>>: ...[inaudible]
>> AD: So I also have another video so...
>>: So this was interesting. So there was a case when she was blinking and it...
>> AD: Oh.
>>: ...[inaudible], right?
>> AD: Yeah, yeah. How did you notice that? Actually I wanted to comment on that and
I forgot. So the eye -- That's actually a funny thing. So...
>>: So how does it know?
>> AD: So the model doesn't actually know. So you see funny things happening with the
eyes because it was only given this and the time stamp. So the model tries to learn this
but it's not very easy. So you can indeed see very funny stuff. Not very fun but...
>>: So was the original [inaudible] blinking or not?
>> AD: We didn't compare frame by frame, but I think there's a point where's blinking
and she shouldn't be or something like this. But in my opinion this is actually a good
thing because it's a generative model and you don't want to get stuck to -- That was only
trained on 100 data. You don't want to get stuck on that. You want to generate new
things. And if...
>>: But I think if you use that word "Gaussian process" [inaudible].
>> AD: Yeah, yeah.
>>: In [inaudible]...
>> AD: Yeah, yeah, exactly.
>>: Somebody [inaudible]
>> AD: Yeah, exactly. So the GP is struggling to learn the smooth parts and the nonsmooth parts so it's a bit -- So, yeah, that's why it's happening. So if we have time I'm
going to show another video later. But because I want to make sure that I'm going to
show also the deepest experiments, I'm going to keep this for the end if we have time.
So, you know, if I have time I will come back and also show another video. But first I just
want to make sure that I will present these final two slides because they are actual deep
hierarchies.
So these things are actually in the paper. And so here we use the MNIST data and we
consider a deep hierarchy. So now it's no longer just two layers; we trained models with
two, three, four or five layers in total. And here I'm presenting the results for the deepest
hierarchy we tried: five layers.
We took a very small subset of the MNIST data, and we only took fifty samples of zeros,
fifty samples of sixes and fifty samples of ones. So [inaudible] classes. And in total 150
points only. We trained the model and we got these optimized weights. So you
remember that I said it's vital to this Bayesian framework because you can learn
automatically the structure of the model. So that's what you see here. So you initialize
these guys, let's say in 20 dimensions or something. And then the model says, "Okay --"
You only give the outputs, you only give the images of the digits and the depth you like
and the model automatically says, "Okay, I'm going to use 12 units in the first layer, 5
units in the second layer and, you know, 4 units in the top layer," and so on.
So the model automatically switches off units basically. And that's what making it robust.
And when I sample from these spaces to see what they encode, I get samples like this.
So I'm going to show that you learn features that are going from generic features to local
features as you move down through the hierarchy as someone would expect. So if you
sample from the very top layer -- So the very top layer is four-dimensional. If you sample
from the dominant dimension to see what's going on, you get samples that basically can
be any digits. So you have a zero, you have stuff in between, you have six and then, you
have something that looks between six and one and you have ones. So the top layer
indeed encodes the most abstract information.
It encodes information that, you know, "I have three digits and that's how the look like."
And because it's the generative model, you can also sample stuff that is in between
because it's a continuous variable.
>>: So this is after all the learning is done in [inaudible] parameter?
>> AD: Exactly. You do all the learning and then you try to sample from the model and
see what the model learned. So if you sample from this, from the top layer, you get
samples like this which are very generic. So the top layers says, "You know what I'm
encoding is just if it's one, six, zero," or stuff in between. And if you sample from this
layer it's, again, abstract but less abstract.
>>: Let me [inaudible]. What do you mean by sample from one layer? What happens
with the layer? I mean the whole generative process...
>> AD: Yeah. So you sum from here and then you propagate.
>>: And -- I see.
>> AD: So...
>>: So you're not marginalizing or anything?
>> AD: You do marginalize. So it's you take the posterior distribution of this even. You
know, you take the posterior distribution by having...
>>: And then just propagate that posterior on [inaudible].
>> AD: Yes. Yes. So here let's say that X is four dimensional. What you are doing is that
you initialize to a single X and then you change the values of one of the dimensions
continuously. And you get this output by...
>>: So those are a different sampling formula? You get a different number.
>> AD: So these are different samples. I actually have a demo here but I don't know if I
have time.
>>: [inaudible]....
>> Ashish Kapoor: Yeah, you should probably wrap up.
>>: Okay.
>> AD: Yeah, so...
>> Ashish Kapoor: I mean we can show the demo at the end [inaudible].
>>: Okay.
>> AD: So it's the same thing that I showed before. And I'm just going to show how this
was generated.
So here I'm just going to show how I generate samples from the bottom layer. So it's
exactly the same thing, right. So I'm just sampling from two of the dimensions of the
bottom layer, and I obtain very local features. So you see there in code if the zero is
closed or not, if it's a closed circle. This is dimension and this is something different. And
if I sample from the second layer, again you have more local features. You see? More
local features. And if I sample from the top layer, I have very generic features. You get
ones, zeros, sixes; you get everything.
>>: This is a very different way of doing this DBN [inaudible] because with a DBN they
put a label as part of [inaudible]. And then they use this, what's called a sampling, you
know, in...
>>: [inaudible]
>>: [inaudible]
>> AD: So...
>>: But here you can do the same thing, right? You put that certain...
>> AD: But you still initialize -- Yeah, I mean you're right. You can just sample, but I
think you can do the same with the DBNs.
>>: [inaudible] specific label on. So if I say let's generate nines, for example, you're
going to show me what it looks like.
>> AD: To generate, let's say, nines. So...
>>: But you don't have the...
>> AD: ...you can do that if you figure out which dimensions are responsible for
generating nines. But if you want to...
>>: Can you do that if you have input and then you upend the [inaudible]...
>>: [inaudible]
>> AD: Yeah.
>>: [inaudible]
>>: Right.
>>: You can do it.
>>: And then you learn the [inaudible]. And then say that "All right, well this [inaudible]."
>>: Yeah...
>> AD: So...
>>: [inaudible] frees that unit as part of the...
>>: As part of [inaudible]...
>> AD: So actually when I show these models...
>>: You basically look at the conditional...
[simultaneous inaudible voices]
>>: [inaudible] to see the comparison.
>> AD: Actually...
[simultaneous inaudible voices]
>>: I actually tried that for [inaudible]. It actually quite well.
>> AD: Yeah. So actually with this model that I showed here you can say, "I have my
data and here I have my labels." We have tried that and it works pretty well with a motion
capture. So here you have your actual data and here you have the labels. And then
when you give a new label, you can sample and go through the latent space and get
outputs here. So that's exactly what you asked, right, and you can do the same exactly
in the deep.
So you can just give a label here. You can go through the deep hierarchy and then
obtain samples here.
>>: [inaudible] show the demo? That'd be interesting to see.
>> AD: No. I didn't think about that actually.
>>: [inaudible] you can see [inaudible]...
>> AD: Yeah. So...
>>: ...[inaudible] can learn [inaudible].
>> AD: So you are right. That would be quite interesting. I mean you can still do it if you
figure out if single dimensions that it's possible for generating six, like in the other demo
you had a specific face. But, yeah, that's actually a good idea and straightforward to do.
>>: [inaudible]
>> AD: So that's the last slide. I don't know if I have time for this experiment.
>> Ashish Kapoor: Don't you have a demo you can show quickly?
>> AD: So I think it's better if I just describe this. It's quicker. So what I did here is,
again, with the two modalities. So you have two different data sets. The first data set is
motion capture for a guy walking. The second data set is for the other guy walking. But
these two subjects are interacting. So they are approaching each other, they do a highfive and then, they go apart. So you know that the dynamics have some commonality,
right, because they are doing very similar motions. But they also have some stuff that is
not common because, you know, this is a different walking style, also different direction
obviously as one.
So we model this with a stacked GP model and we also use this Manifold Relevance
Determination trick to take into account different modalities. So here we have all the
frames for the first guy walking. Here are all the frames for the second guy walking. And
that's all we give to the model. We train the model and then, the model automatically
discovers this latent space that basically says, "I have a shared part and I have two
private parts. And I have a shared part on top of the hierarchy in this deep model." And
the shared part looks like this which is pretty intuitive because it's like the motion of -- It's
like the motion of these people that go like this, do a high-five and then, they continue
walking.
And if you sample from here and get outputs in the first output space, you get this guy. If
you sample from here and you get outputs in the second output space, you get this guy.
So it's like saying this is indeed shared information because it's at this point where
they're about to high-five.
If you sample from this point it's immediately after or before high-fiving. If you sample
from the private spaces you get very private information. If you would sample from here
you would see like this guy, I don't know, moving the hand or moving the angle because
that's specific to the person and to walking style of each person.
And if you sample from here, you get samples of the different styles of the overall
motion. So I don't have a very good for that but I think you get the intuition. And the cool
thing is that you learn all this automatically. So when Lawrence [inaudible] in 2007 tried
to do this with a stacked hierarchial GP-LVM, they had to specifically say that they want
the shared space and they want this space to be two-dimensional and this space to be
private and one space here.
And they had to constrain the noise here. So they had to hard code all the structure. So
in the Bayesian model you just give the outputs and everything is learned automatically.
And just to wrap up: while you are seeing this I'm going to talk about future work. So one
of the downsides of this model is that, you know, since it's based on GPs it cannot
handle a lot of data obviously. But there is a nice idea for extending this with stochastic
variational inference. And there are to very, very recent papers that we think are
applicable to our case. And we think that would be really cool to be able to do deep
networks with GPs and also with millions of data. And I think this is the next step.
And, yeah, that's pretty much it. So thanks.
>>: Thank you.
[applause]
>>: [inaudible] how big are the weight vectors matrix that you have in that one -- Yeah,
this one. The example. How big is it? A five? Is it...
>> AD: So here because I had 150 points and because I have [inaudible] with these
models, I know that it's not going to -- I mean you can do some ad hoc experiments and
see that most of them are switched off. So, you know, I started with 15 here and
because I know that you learn more abstract information -- I mean, you could just put 20,
20, 20, and so on. They're going to be switched off anyway. Just don't...
>>: If you get very much more complex in your image, do you [inaudible]...?
>> AD: I don’t think so because for the video with the woman talking, which is like very
complex, it's like all the pixels are -- It switches off all dimensions but 12. And this is
because, as I said, maybe this image of the woman talking which is one million
dimensions, maybe the real dimensionality its -- I don’t know -- one thousand. But the
point is, the data that you have -- Can this support one thousand dimensions? No
because you have only a hundred data. So pretty much for the amount of data -- I mean
I have never seen any model using more than 20 to be honest.
So if you want to know in practice I just use 20. And that's because the GPs are limited
to, I don't know, a few thousand data points at most. And usually the data cannot support
a lot of dimensions.
>>: So that was relevant to [inaudible] because here you talk about regression models
all the time. So do you ever explore discrimination? [inaudible] be useful in discrimination
[inaudible].
>> AD: So you mean [inaudible], for example? So here it was totally unsupervised. And
one test we did -- And thanks for reminding me to say that -- is that we took the latent
spaces here and we did nearest neighbor. So for each latent point here, we found its
nearest neighbor and we tried to see the errors. So if you take a six and all its nearest
neighbors are sixes then you have zero errors.
And what we saw was that the deepest model had the best error. So it had only one
error here. So the best separation comes from a model that is the deepest. I actually
have this in the demo but, you know, I don't have a lot of time.
>>: So the extent of using a generative model for discrimination is that you just compute
the likelihood for each class? So you fix class [inaudible] here...
>> AD: Okay, you can...
>>: ...[inaudible]
>> AD: Class conditional -- So this is possible for the model, and we have actually done
that with other demos. If you see the Bayesian GP-LVM, for example, you will see this
class conditional classification. But for the deep model it's possible; it's the same. But we
haven't implemented that yet.
>>: I see.
>> AD: So here is the separation of the digits with the deepest hierarchy. See, it's pretty
good. So you have all the sixes here, all the zeros. Here you have a six that looks a bit
like one so it's here. So you have good separation, so you can't quantify it like this. And
[inaudible] where you actually give the inputs and you get outputs. I didn't show that here
but we have a paper in submission and it's going to be there.
>>: What do you think -- If you run it for MNIST, what kind of error rate do you think you
might get?
>> AD: You cannot run it for the whole MNIST.
>>: No? Oh.
>> AD: Because you use GPs, the scale -- You cannot -- I think you cannot even use
GPs for the MNIST, right? MNIST is huge. I don’t know. How big is it?
>>: Sixty thousand.
>> AD: Has anyone used GPs, just GPs, not even deep GPs, for that?
>>: I think you can do GPs on [inaudible]. Sixty thousand only, right?
>> AD: Sixty.
>>: Sixty thousand, yeah.
>> AD: I mean GPs are scaled by N cubed, right, so it's going to be sixty thousand...
>>: [inaudible] not that hard to work.
>> AD: To the power of three...
>>: Well this simulation is kernel [inaudible].
>> AD: Yeah.
>>: Kernel regression [inaudible].
>> AD: Yeah.
>>: So [inaudible]...
>> AD: But you have to invert sixty thousand by sixty thousand.
>>: Yeah, that's easy.
>> AD: Is it?
>>: I think it was [inaudible].
[simultaneous inaudible voices]
>>: No my computer is [inaudible] -- I mean I could easily do sixty thousand [inaudible].
[simultaneous inaudible voices]
>> AD: Yeah but you have to do a lot of inverses because it's non-convex, so you have,
I don't know, a thousand inverses.
>>: Yeah, that's fine. Let it run for the night or even for two, three days.
>> AD: Yeah. Anyway, I haven't tried. For the GPs it would be even slower.
>>: So [inaudible] is that if you compare it with a DBN, as I mentioned early on DBN is
very good for generation but for discrimination DBN is very bad.
>> AD: Yeah, so...
>>: This is the kind of thing that [inaudible].
>> AD: So this is very recent work...
>>: [inaudible] that DBN is useful [inaudible]. It's just, you know, take whatever you have
and put it neural network and [inaudible]. And so I wonder whether this one has ever
been used...
>>: Do you still do -- So do you fix the DBN? Or do you do...
>>: It's the same thing...
>>: ...[inaudible] propagation on the deep structure?
>>: No, no, no. It's just [inaudible] stacking.
>>: So you do the stacking and that'll fix it?
>>: Yes. And [inaudible]...
>>: And then...
>>: ...initialize it for...
>>: ...[inaudible] something on top of it?
>>: Yeah. Yeah.
[simultaneous inaudible voices]
>>: But you don’t do propagation on the...
>>: No, no, no. You don't need to do that, yeah.
>>: Yeah, it's interesting. Like they actually use DBN as a feature extraction...
>>: [inaudible], yeah.
>>: ...[inaudible].
>>: But for some places that I've read that they said that they still do at the end some
propagation on the entire...
>>: Yeah, yeah, yeah. People do that but that's not really important.
>>: It's not important?
>>: [inaudible]. Pretty much [inaudible] that's enough. But the [inaudible]...
>>: And the end stack some discriminative model at the top?
>>: No, not even the top. You just [inaudible] whatever weight [inaudible] just like you
[inaudible] here. [inaudible] neural net machine and use that to initialize the other
machine [inaudible]. [inaudible] everybody's doing that.
>>: Yeah. [inaudible]...
>>: [inaudible] you claim that this is better maybe to explore that weighted information to
see whether other machines...
>> AD: I don’t want to claim that this is better, actually, because I haven't done
experiments. It's just that we were very excited about extending that to large data sets.
So we are working towards this first so that then we can properly compare it to DBNs
and don't have to [inaudible] on four or five days or whatever. But, you know, in more
extensive -- I mean this is very, very recent actually. Right? In more extensive work, we
would like to not only compare to DBNs but, you know, maybe there are complimentary.
You know, GPs have sometimes use for [inaudible] or vice versa. I don't know, maybe
they're complimentary with DBNs. And that would be interesting to see. Yeah.
>>: Thank you very much.
[simultaneous inaudible voices]
>> AD: Okay. Thanks.
Download