>> Ashish Kapoor: So it's my pleasure to introduce... with us this winter. I mean, he's a graduate student...

>> Ashish Kapoor: So it's my pleasure to introduce Andreas Damianou who is interning with us this winter. I mean, he's a graduate student at the University of Sheffield with Neil Lawrence, and he's going to talk about deep Gaussian process models. And this is, I believe, joint work with Neil as well. >>: Yeah. >> Ashish Kapoor: All right. >> AD: Thanks. Okay, so thanks for coming. And so this is very recent work that we did with Neil just before I left Sheffield to come here. But also this presentation I'm also going to include parts that are a bit older but very relevant to the deep GP models, and for this we worked with collaborators like Michalis Titsias and Carl Ek. Okay. So first of all I want to give some motivation why you want to consider GPs in deep models. And, well, a colleague of mine said that lately it's very cool to have the word "deep" in your title of a paper. You know, apart from that I think the main motivation -- I'm trying to display it now. So if I show to you this picture here of this girl. So if I tell that this little girl here shows the gesture in sign language for "I love you," and then I show you this bunch of pictures, I think it's very easy for the human brain to recognize immediately that, you know, the rest of pictures depict the same sign. So for a computer that would be very difficult. It would be very difficult to [inaudible] this hand or something. So you know, objectively speaking we know that the human brain is very good at one-shot learning and generally in learning because, you know, there is some sort of hierarchical representation of knowledge in our brain and also very good prior models for the data. And if you try to find an analogy for the computers then we deep belief networks that try to represent the knowledge in hierarchies. And as for good prior models, it's been known that Bayesian non-parametric models are quite good and flexible models and have been very successfully used as prior models for data. And of course we can achieve such an effect if you have many men training examples as well, but that's not always possible. Right? So in real-world application sometimes we have very scarce data. So in this talk I'm mostly going to focus on the first two. So the deep GPs are trying to combine the structural advantages of the deep belief networks and also the advantages that come from having a Bayesian non-parametric approach to the data. And obviously it would be nice to compile many training examples if possible. So there are two things: one thing is that in the current version it is not very easy because you're still using GPs that scale [inaudible] with the data. And also many real-world applications, these are not available anyway. So I'm not a deep learning expert but just a very quick slide here. So what people are doing traditionally in deep learning: so you assume that you have your observations, Y, here. That's like the outputs. And you assume that these outputs come from a deep hierarchy of latent units stacked on top of each other. And the most successful approach is to start RBMs, restricted Boltzmann machines, and treat all the above layers as latent units. So the fact that you're using RBMs means three things basically: one is that the outputs necessarily, essentially are modeled as a linear weighted sum of the inputs. Another thing is that the units here are binary. And the inference that people usually do in these models is based on sampling methods like divergence and so on because you are trying to marginalize these hidden units and this is intractable. So intuitively if we take this model and try to find the analogy about using GPs for the mappings here instead of stacking RBMs then we'll be able to model continuous outputs; that's one thing. We would be able to have non-linear mappings. And I'm also going to show here how we can do variational inference for this model. And this is good because you get a bound and you can do model selection and so on. Of course the good thing with traditional approach is that it can handle huge data sets, and for the moment this is not very easy with the GPs. But I'm going to describe that later. So I'm going to talk about the GPs, just a very, very quick introduction about the GPs. So I assume most of you are even the experts in the GPs, but just a very quick introduction. So a GP can be thought of as an infinite dimensional Gaussian distribution. So here, for example, if I sample from a one-dimensional Gaussian, it gets these samples. Here I get two-dimensional samples. So if you sample from the GP, every sample is basically an infinite object; it's a function. So in order to define a GP on some function, GP prior, the basic ingredient is a mean function which we usually take to be zero and a covariance function. So this covariance function is evaluated on a finite set of inputs, but as I'm going to show the posterior it's over the function space so everything is infinite there. And by using specific covariance functions, what you doing is that you are making assumptions about the properties of the functions that you are trying to model and not the specific parametric form. So for example, here you say that, you know, the functions are very smooth, here they are not that smooth and so on. So that's what I talked about in the previous slide. So you have a prior on the function and you evaluate it on a finite set of inputs. But then, if you combine it with the Gaussian likelihood then you get a GP posterior which is over the function space and also a predicative distribution. And this means that you can evaluate it for any input. And this is the notation I'm going to use for the entire talk. So Y is the output, X is the input and F is the mapping between the two spaces. And F has a GP prior or Gaussian process prior. And a very quick demonstration. I guess this is already clear to you guys, right? So initially before we see any points and you just have the Gaussian process prior, the model says, "I haven't seen any points so the function can be anywhere in the gray area." But then when you see, let's say, two points, the model says, "Well I know that in the neighborhood around this point the function has to pass from here and the function has to pass from here." And because of the smoothness assumptions you know that the points in the neighborhood cannot be, you know, very far away. And as you observe points, you learn the function better and better. If you don't have a lot of points, you can still, based on your assumptions, generalize better as you can. And, so on. So, you know, that was a very short demonstration for GP regression, but people have also used GPs for unsupervised learning. And so what happens if you don't have outputs but you want to, nevertheless, assume a generative model based on GPs. So in that case X, the inputs, are latent, are unobserved. And one approach is to say, "Well they are unobserved --" So basically this is the GPLVM framework, Gaussian process-latent variable model of Neil Lawrence. It's a very successful model. So in the original approach you say, "Well X is unobserved so what I'm going to do, I'm just going to optimize [inaudible] in a map way." And in the Bayesian GP-LVM which is much more recent, you are trying to learn the posterior distribution over X. So you are computing here in the marginal likelihood. So this is intractable basically. So F here, the mapping is marginalized out. But, you know, if you include it here and try to propagate the prior, that's infeasible. I'm not going to expand on that; I'm just saying that this is not an easy thing to do. But in this paper here, Titsias and Lawrence, they show variational framework with some tricks that make this possible. And basically that's what we also base the approach on that I'm going to show about deep GPs. Okay, so I've come to the DGPs topic now. So that was I was showing before. Right? So you have your outputs and the inputs that can be latent. So what happens if you just stack another GP on top of that? So you know if you just do that and you don't want to do any inference, you just want sample, that's easy. I mean you can do it. You just take the inputs and you generate outputs based on some GP and then, you use these outputs as inputs for the next GP. And you take the final output. So that's easy to do. And it will look something like this. So you take your inputs and the outputs look like this because of the kernel you use here and so visually non-linear. And if then you [inaudible] from another GP and you take outputs in more dimensions, you get outputs like this. So obviously you see that the over model is something more than a GP. Actually the overlapping is not a GP, right, because you can see from the outputs here you have very long-term correlations and non-stationarities. For example, you see here you have the same effect as having the length scale of a kernel to be very small or very large according to the layer above. So normally it'd be very difficulty to get samples like this from normal GP. So that's clear, right? So if you just give to GP this input and this output, it's going to struggle. All right. So that's from sampling. But if you want to use this as a model for inference -- So if you just present the outputs to the model and you want to inference for the layers, that's actually very, very hard because it's very had to regularlize and to train such a model. And I'm going to explain why. >>: So throughout this time you assume the kernel is given. Do you learn... >> AD: No. But that's the difficult thing, right, to learn... >>: Yeah. >> AD: ...the kernel and also do the marginalization. So for the demonstration we can just fix the kernels and give the inputs and then just sample. I mean, that's how I did that. But the real model is to learn the kernels and to be able to marginalize this into your expressions. That's a difficult thing. And that's what I'm going to discuss about. So just a bit more discussion about why this is very difficult. So that's the joint distribution of the model. And if you are doing totally unsupervised learning, this is also going to be latent. Right? So you would have another term here, p of X2. But let's assume [inaudible] given us an input. So if you don't attempt to marginalize this guy here and you just try to learn everything then this -- Basically this is the hierarchical GP-LVM of Lawrence and Moore 2007. And what they show there is that it's very difficult to regularlize it because for one thing the dimensionality of these guys of the latent variable has to be given a priori. You don't know how big this [inaudible] should be, and you know this increases as you go up, as you have more layers. And for this reason it's also prone to overfitting because now all these are parameters of the model we try to optimize. And most importantly, they show that deep structures are not really supported by the model evidence. So if they tried to do MAP inference here, and you also include in your kernel white noise as people normally do, they find in the top level the white noise just explodes. It's like switching off the deep structure, so the model prefers shallow structures and they have to force it. And the cool thing about the stacked GPs of the Bayesian framework I'm going to show is that the model actually supports the deep hierarchy even when you have very scarce data. So that's a solution we proposed in this very recent paper. And we argued that the proper thing to do is to marginalize out all intermediate layers like people are doing with traditional deep belief networks. And to do that basically you want to compute this marginal likelihood. So here you only have one hidden layer but it's the same if you have more hidden layers. And as I said, this is intractable even for the simple stacking so we consider even stacking more layers. But we show that in our framework you can use a variational bound which is nonstandard actually; it's a bit more involved. And I'm not going to go into the details but I'm going to give [inaudible].... >>: So is the nature of this model similar or different from DBN model of contrasted divergence? >> AD: So I'm not expert, as I said, in the DBN but I think people there are based on sampling techniques, right? So here is not sampling; it's variational inference. So you have [inaudible] approximation and so you have a lower bound on the actual thing you want to compute. And you just try to minimize the distance. So this is a constant because it's marginalized on the data. >>: Do you know where this similar thing can be applied for DBN at all? >> AD: So I know that people have tried to do variational approaches and I think Ian Murray has recent papers. But I don't know how successful it is but I know that people still use the more traditional techniques. So I don’t know. But I know it's not an easy thing to do. You know, some people might argue that maybe it's better to do sampling in the first place. I don't know... >>: [inaudible] so hopefully [inaudible] tighter bound [inaudible]... >> AD: You don't have guarantees in the other case. You can just hope because, you know, you do sampling. You don't know. Here it's a bit better because you have actually the bound, yes, but... >>: [inaudible] I mean what is the difference compared to -- So you do [inaudible] is closer to what you would see when you apply [inaudible] right? Unless you want to marginalize all the time, I mean, even at... >> AD: But you're doing sampling also for the purpose of marginalizing. Right? So they're both approximating techniques and, you know, people in machine learning I don't think they have settled which is better for general models anyway. >>: Nobody [inaudible] -- I don't know what is your practice. Usually you increase in very small model with just a few nodes. And for that you can do exact sampling and then, you get real, you know, inference. And then you can compare how [inaudible].... >> AD: Yeah, that's actually true. Yeah. But you have -- Yeah, I haven't tried that but that might be a good idea actually. And, you know, we were discussing that because it's good to see both approaches. But you know, sampling, you can do it any time so sampling is supplied everywhere. But the variational method is something that you have to develop, and this one is not the standard variational method. It's a bit more challenging actually. And it's not only the inference, so it's also the regularlization thing. If you want to do these things properly, you have to find a way to regularize this very large network of layers and all these [inaudible] after fitting from one GP to another. So that's very hard to learn. And so we use a few tools. So the first tool as I said is to find a way to marginalize all the intermediate layers which is unfortunately not [inaudible]. And to do that we extend the variational methodology of the Bayesian GP-LVM, and naturally we based that on the paper from last year at NIPS. It was only for a work model, for a two-layer model. And I'm also going to explain that in the most generic case, you can also learn additional structure by allowing the hidden units to form conditionally independent sets by using the Manifold Relevant Determination method. I'm going to expand on that in the following slides, just methods to make this very big and difficult to regularlize structure, easier to learn and to automatically find. So first I just want to give some intuition as to why marginalizing the intermediate layers and using this Bayesian framework works well for discovering the structure of the model. So, you know, as I said before GP basically relies on the covariance function. So if you don't use any covariance function but you use this specific Automatic Relevance Determination covariance function then, you know, that's a normal covariance function that says, "If two inputs are Xi and Xj are very close then the functionality should be very similar for these two points." But this specific function also has a weight that weighs each dimension of the latent points individually. So you can imagine X, you know, the inputs to be a matrix where rows are points, one row is one point and the columns are dimensions. So if you fit GP on this, you would have one weight per column. And when you optimize, a lot of these weights go to zero in the Bayesian framework. And the model is saying, "You know, I don't need this dimension." So what you are doing is that... >>: So you are still talking about learning the weight from... >> AD: Yeah. >>: [inaudible] -- I thought it's very hard to learn [inaudible]... >> AD: That's only hyperparameter learning for -- It's just hyperparameters. >>: Oh, you don't... >> AD: So these weights are not -- I know the traditional weights that you use for the DBNs. It's like weights of the covariance function; it's hyperparameters of the model. >>: So you tie them all together but you don't [inaudible]. Okay. >> AD: Yeah, exactly. So what you are doing is that you have this model -- Now let's just say you have these two guys here, right, it's not deep at all. It's not hierarchical. You have your outputs, right, and you just initialize the -- So you have a latent node here, a latent random variable; you don't know what it is. You just initialize it with, let's say, ten dimensions and then you do the learning. And after you do the learning and [inaudible] the weights, you see that the Bayesian procedure turns off all the unnecessary weights. So here is for the old data demonstration. The model says, "You know, I just need two dimensions so I can represent my original data just in two dimensions. I don't need the rest." And the one dimension that's doing all the work is basically this dimension that is doing all the separation. So this is a very powerful way of learning the structure of this model. And this is just a demonstration on a shallow architecture. But you can generalize this as you add more layers. So is that clear? You know, is that -- you know, intuitively? >>: But how does that kernel differ from -- What's it called? -- the radial basis function kernel? [inaudible] it's the same [inaudible]... >> AD: Yeah. So that's a radial basis function with... >>: [inaudible] parameter. >> AD: Yeah, with this additional thing here. So if you take this out and you make it common for all dimensions, if you place it here as a common parameter then this is just the RBF. So the other tool that we can use for eventually stacking the GPs is Manifold Relevance Determination method. And there we use again the same trick with ARD. So again I'm talking about the case where you have only a very shallow -- not a deep architecture. It's just the outputs and the inputs, and you have a GP here. So what happens if, you know, instead of this case you have two output modalities. Let's say you have data from Kinect and you have the RGB color and the depth images. If you try to model them in the generative framework, you have to assume if it's generative then the latent variable has to have some commonality. Right? It has to be common or some parts to be common, right, because if these guys are very, very relevant to each other and if they're generated by the same random variable, there has to be some commonalities. But if you only use a single random variable for both then you lose all the prime information in the outputs. Right? So if you have RGB here and depth here and you model them with the same random variable and you have a video then you can successfully probably model the dynamics of the video which is common in both views but you lose all the prime information like the color or the depth or whatever. So that's like the CCA model, right? So what we did and what [inaudible] to use for the generic version of the deep model is that you can have your other modalities here and you use two different GP mappings for each modality from the initial latent space. So you have a latent space bigger than the variable initially and you have two different Gaussian processes which come with two different weight vectors per modality. And when you fix the model there you get these different weight vectors, and these weight vectors define the segmentation of the latent space. I'm going to show it here a bit more clearly. So here you have one view and here you have another view. You fit the GPs and then, you have one weight per dimension of the initial latent variable. So if, for example, this weight is switched off we know that this dimension doesn't matter for this guy here. If the first weight here and the first weight here are switched or are non-zero for both then we know that this is the shared space. So you can automatically learn segmentation of the latent space in this way. Right? >>: How big is Q in your practice problem? >> AD: So Q, you have initialize it and then you let the -- I will explain. But you have to initialize then the Bayesian procedure takes care of that. So it's difficult to know a priori but here's the thing: if you have a matrix of data that's N data in Q dimensions then the rank of the matrix cannot exceed N or Q. So you have 50 data, Q cannot be more than 50. So at most you use 50. >>: There are so many [inaudible] equation [inaudible]? >> AD: Yeah, I mean... >>: [inaudible] very small number, you'd probably use that. >> AD: We have tried this with [inaudible] where the outputs are millions and you have a hundred data points so initialize with twenty dimensions. And eventually many are switched off. So the thing is that, you know, if you don't have a lot of data, the [inaudible] cannot be much. So usually I use twenty in practice. >>: Yeah that [inaudible]... >> AD: Just as a rule of thumb because it's not what the real dimensionality of the data is, it's not what the real dimensionality of this -- the [inaudible] dimensionality of this is, it's what you're actual data set supports. Right? So maybe you have... >>: So when you solve this problem Bayesian optimization works well? >> AD: Yeah. >>: [inaudible]... >> AD: Yeah, I'm going to show you a demo right now, actually. So I think it's better if I first describe the data set. So here's the data set that I'm going to use. So I don't know if you know the Yale Faces. So it's a bunch of images from different people, and for each individual you have images taken from a different light angle. So what we did is that we created one data set where we have all different images of one individual and all different images of another individual and three individuals in total. Right? So here we have the images of three different persons under all possible illumination conditions. In the other data set we have all images of three different persons, again under all illumination conditions. And then, we align the images so that the only thing that I have in common is the lighting direction. So we have three different people in total in this data set, and three different people in total in this data set. And to do the matching so that, for example, any guy here can match any guy here as long as it's from the same light position. So artificially we created a data set where the commonality is the light direction and the private information in each data set is the individual face characteristics of the persons. And then, let's see the demo here. Okay, so what I'm going to do here is that I trained the model on this kind of data, and I got these weights. So these weights tell me that what initialized with 14 dimensions. Dimension 1, 2 and 3 is common for both data sets. And because the weights are switched on for both data sets. For example, Dimension 4 is only relevant for the second data set and so on. So here I plot latent dimensions against each other. And every time I move the mouse I sample from these points just to see what happens and to see what kind of information I can encode. So firstly I'm going sample from Dimensions 1 and 3 which, as you see, are common. And since the model learned that these two dimensions are common, I would expect to see only light variations on the sample. So you see the outputs here. And indeed that's the case. So I sample and you see that what changes is only the direction of the light. Right? So wherever you have red crosses is the actually training inputs that map to outputs, and wherever I move in the white space it's just [inaudible] outputs because it's a model of continuous variables. So here I generate from [inaudible] lighting directions. If I got to the extremes -- I can get also more extreme, stuff that is not in the original data set. So see that successfully the model learns that Dimensions 1 and 3 are common for these data sets. So now I'm going to fix, let's say, the light direction here. And I'm going to sample from the private space to see if it encodes, indeed, private information. So Dimensions 5 and 14 as you can see here are private for the first model. And what happens when I sample from these dimensions is that I get outputs that only vary in the private information which is the characteristics of the persons. So here when I sample I get outputs for the first guy. When I sample here I get outputs for the face of the second guy and here for the third guy. And the cool thing is when I sample in between I get [inaudible] outputs. So here I get like this morphing effect which is kind of cool. You know I want to try that with my own face as well. So you see that there is this morphing effect. If you go in between you get the face of a guy which is in between this guy and this guy. And all this is by having fixed the direction of the light to be there. You know, I can just go back... >>: So this dimension is the one that's non-zero? >> AD: So this dimension is 5 and 14. >>: 5, 14. >> AD: So it's 5 and 14 which are the dominant dimensions that are also private for one of the models. So... >>: [inaudible] 8 and 10 which are the private... >> AD: So if I take 8 and 10, nothing is going to happen because the model just -- So that's a good question. So the model is just going to ignore. It says, "You know, nothing happens." The model just doesn't care. You see, so now it's sampled but the output doesn't change because the model says that these dimensions are irrelevant. >>: Because you're showing the model for the first. >> AD: Exactly. So you are here and -- Oops. And it's like trying to put outputs in this guy by sampling from the private space of this guy. It just isn't going to -- Nothing happens. So first I samples from the shared spaces; that's when the light direction changed. And then I sampled from the private space. And I could also do the same for the other modality. But you get the feeling of how this works, right? >>: Yeah. >>: So I suppose you are going to show when you have one more layer up, things will get better? Or not? >> AD: Yeah, so that's a tool you can use to learn some additional structure in the latent space. And now I will just actually, yeah, start talking about stacking GPs eventually. But just first to mention that, you know, since we are here in Microsoft the potential applications: So motion-capture data, we have actually tried on that and it works pretty well. [inaudible] Lewis and motion capture data. Maybe another idea would be Kinect data. I'm actually going to try that for the [inaudible] challenge; I think it would be a good idea. Also for the deep model but I never had the time to do that. But you know it's something that I want to do in the future. Okay. Now venturing back to stacking GPs. So for the moment -- So that's the most general graphical model simplified of the stacked GPs, is what I showed so far. But now I just want to mention that you can also expand the model horizontally in the Manifold Relevance Determination method that I showed previously so that you can have more modality in the outputs. And in the same manner as you said, you can have conditional independencies in the hidden units. So that was the intuition. But to make things simpler, let's forget that you can additionally learn this kind of structure, and let's just focus on the simple case where you have one modality and then you have stacked random variables. So just forget about this, and let's just focus on the case where you have your outputs and then you stack single random variables all the way up. So as I said we need the variational methodology to be able to marginalize all hidden layers. And because it's a bit complicated, I'm not going to show exactly the math. But just as a demonstration just to get the feeling of what kind of variables are involved and the complexity and how you can optimize this thing. I'm not going to explain, obviously, but just to get the feeling. So if you want to marginalize all these guys, all the latent variables -- Well this can also be observed if you have supervised learning. But anyway, if you want to marginalize the latent points then you want to compute a bound on P of Y. And to compute this bound you have to add all these terms here. And I just want to show graphically that the first term depends on this, on the leaves. And you have terms like this on the intermediate layers and this term for the top layer. And so this guy has an expression like this and this guy has an expression like this. I'm not going to describe it, but I just want you to see how this thing scales. Right? So what you want to do -- And if you make some more calculations -- is that you introduce some additional variational parameters that we call U here. And you want to take expectation like this and like this for this term. And the reason I'm just showing this without explaining is just because I want to comment about the complexity of the model that what you have is basically one set of variational parameters, pair GP so pair mapping. So again if you go to the simple case, you know, where you don't have all these modalities, you have this case, you would have one set of variational parameters here and one set of given parameters, and the same for this guy and the same as you go up. So these things, as you can see, scale pretty much L times the same amount as for the sparse GPs. But, you know, you have to take this expectations and these couple of things, make things a bit more complicated. You have additional matrix multiplication, so in practice a bit slower. And because of all these calculations, the problem becomes even harder so you have a non-convex optimization to do. And, you know, you have a lot of local minima. I just want to show that to comment on the practical way of [inaudible] this model. >>: But in this case, I mean -- So this [inaudible]. So what would [inaudible]... >> AD: Yeah. >>: ...[inaudible] develop is that you [inaudible]? >> AD: Yeah. >>: Which is pretty [inaudible]. So do you have any way of removing this [inaudible]? >> AD: So I think stronger regularizer is the Bayesian framework itself. So when you have Bayesian framework and you have the prior, the model is somehow like an automatic Occam's razor. It's not going to overfit us in the maximum a posteriori approach because you have, you know, this Occam's razor effect that tries to regularize things. So if you have a maximum a posteriori approach and you were trying, let's say, to do the same thing, you would get all weights switched on because the model says, "Oh, more dimensions. Okay, I can use them, no problem, to fit the model better." But if you have a Bayesian model, it tends to regularize itself better. And the model will say, "I don't need these dimensions." So if you generalize it as you add layers, the vital ingredient that makes this thing possible... >>: So you think is because of the unique property of GP it can do this rather than [inaudible]... >> AD: So it's not -- So it's the variational framework that allows Bayesian treatment of the whole thing. So if you just stuck the GPs here and you don't marginalize this and you try to optimize then it's going to overfit. But if you marginalize this within a Bayesian framework then it's going to sort of auto-regularize. I'm not saying that this is ideal. But for example, you don't get this effect that they notice that if you don't optimize it the top layer is like switched off because there's a lot of noise. And the model says, "Oh, I will just try to learn this with a single node here with as many dimensions as possible and that's it." So the model is more robust in doing this abstract learning and this hierarchical learning. >>: So [inaudible]... >> AD: Yes. >>: ...[inaudible]. >> AD: So it's more because also the Gaussian process is a non-parametric model so I think it's [inaudible]. It's like -- I think these two more things for the recipe. And I mean by looking at what people are doing in the DBN literature, they also have these sort of problems, right. So they need to initialize very carefully. They do these [inaudible] divergence tricks and they start from a point and then they sample. So these problems exist here as well, obviously. You have to initialize very carefully. But what I'm saying is that if you do all these things carefully, here in theory and I will say in practice you get good results. If you didn't have the Bayesian framework then you would have am model that, you know, by definition would not be... >>: The whole point is that all the problems you mentioned about DBN that's all true. You have all these problems. And that's why in practice [inaudible]. They use DBN and they use as initial parameter to do something else. >> AD: Right. >>: I don't know whether that [inaudible]... >> AD: So here's the thing... >>: ...[inaudible]... >> AD: I don't want to claim that this is going to replace DBNs because I haven't made the actual experiments to compare the two. And this is because I know that it's very difficult to train the DBNs, and I didn't want to misrepresent them and say, "Oh, you know, this is better." But I just want to show, you know, from a... >>: But the point is having been [inaudible] DBN is not [inaudible] practical for [inaudible]. >> AD: Right. >>: So we want to see that whether we can extract information from this [inaudible]... >> AD: So that's in... >>: ...[inaudible]... >> AD: ...the next slides here. >>: Okay. >> AD: But, okay, I see that you want the main point, and I should mention them. So I think that if you try to get the gist of what I'm trying to intuitively pass here is that firstly, I just want to give, you know, maybe a new fresh perspective in the general philosophy of deep learning and say, you know, "People only are starting with RBMs, maybe we should start thinking about better mapping between the layers and different models that are still deep." And another thing is maybe to demonstrate when data is scarce -- And that's what I'm going to show in the experiments and that I know for sure that deep models struggle when they have very, very few data points. Even when data is scarce, you can use this model and learn useful representations. And, yeah, and that's [inaudible] Bayesian versus non-Bayesian thing. As I'm going to show having the bound here and having the variational approximation, you can see that the model actually supports deep hierarchies. So what I'm going to show is that in the experiments I tried models of different levels, so I tried the model of deep level one layer, two layers, three, four, five. And the model and the bound -- Actually the model evidence prefers deep hierarchies. And this doesn't happen in the maximum a posteriori if you take maximum a posterior. So the model actually supports the deep. So you know that it's a correct to model to use. And of course you have the inference problem, and as I said the non-convexity. But, you know, it's a complicated model. So okay let's now start seeing these things in practice. So to see this in practice I will just take this simple case where I have the observed outputs. And I only have a two-level architecture, so this is not deep but it's still stacked. So in the normal GP you would have only this. So now in the stacked GP only two levels but still --. What you are doing is that you take the real inputs, you pass it through a GP and then, you take these latent nodes and you pass it through another GP to get the outputs. So it's what I showed in the beginning but now I'm going to comment on the training that you mentioned. So now what I'm going to do is that I'm going sample. I'm just going to see real data and real inputs and instead of modeling them with a single GP, I'm going to model with this GP. So instead of taking the inputs and the output and training a supervised model, I'm taking the inputs, I'm passing them through a GP that I'm going to learn. I'm marginalizing this level, so I'm training a distribution, and then I'm passing them through another GP. And intuitively I would expect this to be a more robust model because it can model data like this, for example. And this is basically a warped model, right? So you take the inputs of the GP and you warp them into something else before fitting them to the other GP. And this is basically the modeling we represented last year at NIPS 2011. And there we just showed that only for the case when the inputs are time, but it can be anything actually. So you have a model that can model sequences, right? So at each point here has its associated time point. Let's see where I have examples. So let's say you have a video sequence and, you know, you have me [inaudible] and so on. So each frame is a picture of me. And you give to the model the time points. So here is me at this time point; here is me at the next time point. And you try to model it through this instead through a regular GP. And I'm going to show you maybe another demonstration now with this model. So here I did exactly this. So give me a second. Here I took a video of this woman talking. And this is very, very high dimensional by the way. And the cool thing is that with all these models we are able to model very high dimensional data so even if you have a HD video with millions of dimensions. So here we only have 150 training points, and each point is nearly a million dimensions. We just fit it [inaudible] pixels. And we took this video. We removed some blocks of frames and we tried to reconstruct these blocks. And to reconsign these blocks, we gave the time points for the missing blocks. So we set the training data where the preserved frames with the time stamps and the test data where the time stamps that I'm asking to generate -- So it's like, "At 30 minutes show me what the frame should be..." >>: Interpolation. >> AD: Interpolation. Exactly, right? Plus we can give some partial observations. So here we gave -- But we can also not give partial observations. Here we just gave half.... >>: But you know something is missing? If you don't know there is something missing here, can you technique [inaudible]...? >> AD: So that's a good question. So... >>: [inaudible] >> AD: ...it's a generative model and you don't have to know because you can even do oversampling. So if you have -- Simple example -- seconds 1, 2 until 10, right, and you try to generate the video from second 1.5 so, I mean, what does it mean to be missing? It's a generative model. You can just generate as many samples as you want in between. Right? If you happen to have this sample then it's okay. But you don't really care. You just sample as many as you want. So here whenever you see the green bar on top, it's the training point. And whenever you see the red bar top, it's something that was missing but we generated it. And then we put everything in place just to show that the video is quite smooth. And let me play the video here. So you see that it manages to generate... >>: Yeah, so dealing with the red point, which part of the video has been removed? >> AD: So it's random blocks... >>: [inaudible] >> AD: But when you see the red point -- So for example if this is red, this means that this was removed and it was reconstructed. >>: I see. >>: So you... >>: The original... >>: ...removed the entire frame? Entire frame or parts of the frame? >> AD: So for this demo only from the down lip and on the bottom was presented to the model along with a time stamp. And all this was missing and it was reconstructed. >>: I see. >> AD: But we get very similar results if everything is missing. And... >>: So essentially it's [inaudible] gets removed from [inaudible]... >> AD: Exactly. Yeah. So it's like a very, very sophisticated mean somehow. And actually here I paused on the most challenging frame because it's where the head moves, and the model doesn't have enough data so it's a bit blurry. >>: It looks like the motion blurs [inaudible]. >>: Yeah, yeah, exactly. >> AD: Yeah, but it's not the motion so it's actually the [inaudible] because the model is like, "Oh, I'm going to take a bit of this and a bit of this and..." >>: These all have a [inaudible] common filter, right? >> AD: Yeah, yeah.... >>: [inaudible] >> AD: It's the same thing. >>: But it has some, you know, probabilistic [inaudible] dictating the trajectory... >> AD: Yeah. >>: ...[inaudible] >> AD: So I also have another video so... >>: So this was interesting. So there was a case when she was blinking and it... >> AD: Oh. >>: ...[inaudible], right? >> AD: Yeah, yeah. How did you notice that? Actually I wanted to comment on that and I forgot. So the eye -- That's actually a funny thing. So... >>: So how does it know? >> AD: So the model doesn't actually know. So you see funny things happening with the eyes because it was only given this and the time stamp. So the model tries to learn this but it's not very easy. So you can indeed see very funny stuff. Not very fun but... >>: So was the original [inaudible] blinking or not? >> AD: We didn't compare frame by frame, but I think there's a point where's blinking and she shouldn't be or something like this. But in my opinion this is actually a good thing because it's a generative model and you don't want to get stuck to -- That was only trained on 100 data. You don't want to get stuck on that. You want to generate new things. And if... >>: But I think if you use that word "Gaussian process" [inaudible]. >> AD: Yeah, yeah. >>: In [inaudible]... >> AD: Yeah, yeah, exactly. >>: Somebody [inaudible] >> AD: Yeah, exactly. So the GP is struggling to learn the smooth parts and the nonsmooth parts so it's a bit -- So, yeah, that's why it's happening. So if we have time I'm going to show another video later. But because I want to make sure that I'm going to show also the deepest experiments, I'm going to keep this for the end if we have time. So, you know, if I have time I will come back and also show another video. But first I just want to make sure that I will present these final two slides because they are actual deep hierarchies. So these things are actually in the paper. And so here we use the MNIST data and we consider a deep hierarchy. So now it's no longer just two layers; we trained models with two, three, four or five layers in total. And here I'm presenting the results for the deepest hierarchy we tried: five layers. We took a very small subset of the MNIST data, and we only took fifty samples of zeros, fifty samples of sixes and fifty samples of ones. So [inaudible] classes. And in total 150 points only. We trained the model and we got these optimized weights. So you remember that I said it's vital to this Bayesian framework because you can learn automatically the structure of the model. So that's what you see here. So you initialize these guys, let's say in 20 dimensions or something. And then the model says, "Okay --" You only give the outputs, you only give the images of the digits and the depth you like and the model automatically says, "Okay, I'm going to use 12 units in the first layer, 5 units in the second layer and, you know, 4 units in the top layer," and so on. So the model automatically switches off units basically. And that's what making it robust. And when I sample from these spaces to see what they encode, I get samples like this. So I'm going to show that you learn features that are going from generic features to local features as you move down through the hierarchy as someone would expect. So if you sample from the very top layer -- So the very top layer is four-dimensional. If you sample from the dominant dimension to see what's going on, you get samples that basically can be any digits. So you have a zero, you have stuff in between, you have six and then, you have something that looks between six and one and you have ones. So the top layer indeed encodes the most abstract information. It encodes information that, you know, "I have three digits and that's how the look like." And because it's the generative model, you can also sample stuff that is in between because it's a continuous variable. >>: So this is after all the learning is done in [inaudible] parameter? >> AD: Exactly. You do all the learning and then you try to sample from the model and see what the model learned. So if you sample from this, from the top layer, you get samples like this which are very generic. So the top layers says, "You know what I'm encoding is just if it's one, six, zero," or stuff in between. And if you sample from this layer it's, again, abstract but less abstract. >>: Let me [inaudible]. What do you mean by sample from one layer? What happens with the layer? I mean the whole generative process... >> AD: Yeah. So you sum from here and then you propagate. >>: And -- I see. >> AD: So... >>: So you're not marginalizing or anything? >> AD: You do marginalize. So it's you take the posterior distribution of this even. You know, you take the posterior distribution by having... >>: And then just propagate that posterior on [inaudible]. >> AD: Yes. Yes. So here let's say that X is four dimensional. What you are doing is that you initialize to a single X and then you change the values of one of the dimensions continuously. And you get this output by... >>: So those are a different sampling formula? You get a different number. >> AD: So these are different samples. I actually have a demo here but I don't know if I have time. >>: [inaudible].... >> Ashish Kapoor: Yeah, you should probably wrap up. >>: Okay. >> AD: Yeah, so... >> Ashish Kapoor: I mean we can show the demo at the end [inaudible]. >>: Okay. >> AD: So it's the same thing that I showed before. And I'm just going to show how this was generated. So here I'm just going to show how I generate samples from the bottom layer. So it's exactly the same thing, right. So I'm just sampling from two of the dimensions of the bottom layer, and I obtain very local features. So you see there in code if the zero is closed or not, if it's a closed circle. This is dimension and this is something different. And if I sample from the second layer, again you have more local features. You see? More local features. And if I sample from the top layer, I have very generic features. You get ones, zeros, sixes; you get everything. >>: This is a very different way of doing this DBN [inaudible] because with a DBN they put a label as part of [inaudible]. And then they use this, what's called a sampling, you know, in... >>: [inaudible] >>: [inaudible] >> AD: So... >>: But here you can do the same thing, right? You put that certain... >> AD: But you still initialize -- Yeah, I mean you're right. You can just sample, but I think you can do the same with the DBNs. >>: [inaudible] specific label on. So if I say let's generate nines, for example, you're going to show me what it looks like. >> AD: To generate, let's say, nines. So... >>: But you don't have the... >> AD: ...you can do that if you figure out which dimensions are responsible for generating nines. But if you want to... >>: Can you do that if you have input and then you upend the [inaudible]... >>: [inaudible] >> AD: Yeah. >>: [inaudible] >>: Right. >>: You can do it. >>: And then you learn the [inaudible]. And then say that "All right, well this [inaudible]." >>: Yeah... >> AD: So... >>: [inaudible] frees that unit as part of the... >>: As part of [inaudible]... >> AD: So actually when I show these models... >>: You basically look at the conditional... [simultaneous inaudible voices] >>: [inaudible] to see the comparison. >> AD: Actually... [simultaneous inaudible voices] >>: I actually tried that for [inaudible]. It actually quite well. >> AD: Yeah. So actually with this model that I showed here you can say, "I have my data and here I have my labels." We have tried that and it works pretty well with a motion capture. So here you have your actual data and here you have the labels. And then when you give a new label, you can sample and go through the latent space and get outputs here. So that's exactly what you asked, right, and you can do the same exactly in the deep. So you can just give a label here. You can go through the deep hierarchy and then obtain samples here. >>: [inaudible] show the demo? That'd be interesting to see. >> AD: No. I didn't think about that actually. >>: [inaudible] you can see [inaudible]... >> AD: Yeah. So... >>: ...[inaudible] can learn [inaudible]. >> AD: So you are right. That would be quite interesting. I mean you can still do it if you figure out if single dimensions that it's possible for generating six, like in the other demo you had a specific face. But, yeah, that's actually a good idea and straightforward to do. >>: [inaudible] >> AD: So that's the last slide. I don't know if I have time for this experiment. >> Ashish Kapoor: Don't you have a demo you can show quickly? >> AD: So I think it's better if I just describe this. It's quicker. So what I did here is, again, with the two modalities. So you have two different data sets. The first data set is motion capture for a guy walking. The second data set is for the other guy walking. But these two subjects are interacting. So they are approaching each other, they do a highfive and then, they go apart. So you know that the dynamics have some commonality, right, because they are doing very similar motions. But they also have some stuff that is not common because, you know, this is a different walking style, also different direction obviously as one. So we model this with a stacked GP model and we also use this Manifold Relevance Determination trick to take into account different modalities. So here we have all the frames for the first guy walking. Here are all the frames for the second guy walking. And that's all we give to the model. We train the model and then, the model automatically discovers this latent space that basically says, "I have a shared part and I have two private parts. And I have a shared part on top of the hierarchy in this deep model." And the shared part looks like this which is pretty intuitive because it's like the motion of -- It's like the motion of these people that go like this, do a high-five and then, they continue walking. And if you sample from here and get outputs in the first output space, you get this guy. If you sample from here and you get outputs in the second output space, you get this guy. So it's like saying this is indeed shared information because it's at this point where they're about to high-five. If you sample from this point it's immediately after or before high-fiving. If you sample from the private spaces you get very private information. If you would sample from here you would see like this guy, I don't know, moving the hand or moving the angle because that's specific to the person and to walking style of each person. And if you sample from here, you get samples of the different styles of the overall motion. So I don't have a very good for that but I think you get the intuition. And the cool thing is that you learn all this automatically. So when Lawrence [inaudible] in 2007 tried to do this with a stacked hierarchial GP-LVM, they had to specifically say that they want the shared space and they want this space to be two-dimensional and this space to be private and one space here. And they had to constrain the noise here. So they had to hard code all the structure. So in the Bayesian model you just give the outputs and everything is learned automatically. And just to wrap up: while you are seeing this I'm going to talk about future work. So one of the downsides of this model is that, you know, since it's based on GPs it cannot handle a lot of data obviously. But there is a nice idea for extending this with stochastic variational inference. And there are to very, very recent papers that we think are applicable to our case. And we think that would be really cool to be able to do deep networks with GPs and also with millions of data. And I think this is the next step. And, yeah, that's pretty much it. So thanks. >>: Thank you. [applause] >>: [inaudible] how big are the weight vectors matrix that you have in that one -- Yeah, this one. The example. How big is it? A five? Is it... >> AD: So here because I had 150 points and because I have [inaudible] with these models, I know that it's not going to -- I mean you can do some ad hoc experiments and see that most of them are switched off. So, you know, I started with 15 here and because I know that you learn more abstract information -- I mean, you could just put 20, 20, 20, and so on. They're going to be switched off anyway. Just don't... >>: If you get very much more complex in your image, do you [inaudible]...? >> AD: I don’t think so because for the video with the woman talking, which is like very complex, it's like all the pixels are -- It switches off all dimensions but 12. And this is because, as I said, maybe this image of the woman talking which is one million dimensions, maybe the real dimensionality its -- I don’t know -- one thousand. But the point is, the data that you have -- Can this support one thousand dimensions? No because you have only a hundred data. So pretty much for the amount of data -- I mean I have never seen any model using more than 20 to be honest. So if you want to know in practice I just use 20. And that's because the GPs are limited to, I don't know, a few thousand data points at most. And usually the data cannot support a lot of dimensions. >>: So that was relevant to [inaudible] because here you talk about regression models all the time. So do you ever explore discrimination? [inaudible] be useful in discrimination [inaudible]. >> AD: So you mean [inaudible], for example? So here it was totally unsupervised. And one test we did -- And thanks for reminding me to say that -- is that we took the latent spaces here and we did nearest neighbor. So for each latent point here, we found its nearest neighbor and we tried to see the errors. So if you take a six and all its nearest neighbors are sixes then you have zero errors. And what we saw was that the deepest model had the best error. So it had only one error here. So the best separation comes from a model that is the deepest. I actually have this in the demo but, you know, I don't have a lot of time. >>: So the extent of using a generative model for discrimination is that you just compute the likelihood for each class? So you fix class [inaudible] here... >> AD: Okay, you can... >>: ...[inaudible] >> AD: Class conditional -- So this is possible for the model, and we have actually done that with other demos. If you see the Bayesian GP-LVM, for example, you will see this class conditional classification. But for the deep model it's possible; it's the same. But we haven't implemented that yet. >>: I see. >> AD: So here is the separation of the digits with the deepest hierarchy. See, it's pretty good. So you have all the sixes here, all the zeros. Here you have a six that looks a bit like one so it's here. So you have good separation, so you can't quantify it like this. And [inaudible] where you actually give the inputs and you get outputs. I didn't show that here but we have a paper in submission and it's going to be there. >>: What do you think -- If you run it for MNIST, what kind of error rate do you think you might get? >> AD: You cannot run it for the whole MNIST. >>: No? Oh. >> AD: Because you use GPs, the scale -- You cannot -- I think you cannot even use GPs for the MNIST, right? MNIST is huge. I don’t know. How big is it? >>: Sixty thousand. >> AD: Has anyone used GPs, just GPs, not even deep GPs, for that? >>: I think you can do GPs on [inaudible]. Sixty thousand only, right? >> AD: Sixty. >>: Sixty thousand, yeah. >> AD: I mean GPs are scaled by N cubed, right, so it's going to be sixty thousand... >>: [inaudible] not that hard to work. >> AD: To the power of three... >>: Well this simulation is kernel [inaudible]. >> AD: Yeah. >>: Kernel regression [inaudible]. >> AD: Yeah. >>: So [inaudible]... >> AD: But you have to invert sixty thousand by sixty thousand. >>: Yeah, that's easy. >> AD: Is it? >>: I think it was [inaudible]. [simultaneous inaudible voices] >>: No my computer is [inaudible] -- I mean I could easily do sixty thousand [inaudible]. [simultaneous inaudible voices] >> AD: Yeah but you have to do a lot of inverses because it's non-convex, so you have, I don't know, a thousand inverses. >>: Yeah, that's fine. Let it run for the night or even for two, three days. >> AD: Yeah. Anyway, I haven't tried. For the GPs it would be even slower. >>: So [inaudible] is that if you compare it with a DBN, as I mentioned early on DBN is very good for generation but for discrimination DBN is very bad. >> AD: Yeah, so... >>: This is the kind of thing that [inaudible]. >> AD: So this is very recent work... >>: [inaudible] that DBN is useful [inaudible]. It's just, you know, take whatever you have and put it neural network and [inaudible]. And so I wonder whether this one has ever been used... >>: Do you still do -- So do you fix the DBN? Or do you do... >>: It's the same thing... >>: ...[inaudible] propagation on the deep structure? >>: No, no, no. It's just [inaudible] stacking. >>: So you do the stacking and that'll fix it? >>: Yes. And [inaudible]... >>: And then... >>: ...initialize it for... >>: ...[inaudible] something on top of it? >>: Yeah. Yeah. [simultaneous inaudible voices] >>: But you don’t do propagation on the... >>: No, no, no. You don't need to do that, yeah. >>: Yeah, it's interesting. Like they actually use DBN as a feature extraction... >>: [inaudible], yeah. >>: ...[inaudible]. >>: But for some places that I've read that they said that they still do at the end some propagation on the entire... >>: Yeah, yeah, yeah. People do that but that's not really important. >>: It's not important? >>: [inaudible]. Pretty much [inaudible] that's enough. But the [inaudible]... >>: And the end stack some discriminative model at the top? >>: No, not even the top. You just [inaudible] whatever weight [inaudible] just like you [inaudible] here. [inaudible] neural net machine and use that to initialize the other machine [inaudible]. [inaudible] everybody's doing that. >>: Yeah. [inaudible]... >>: [inaudible] you claim that this is better maybe to explore that weighted information to see whether other machines... >> AD: I don’t want to claim that this is better, actually, because I haven't done experiments. It's just that we were very excited about extending that to large data sets. So we are working towards this first so that then we can properly compare it to DBNs and don't have to [inaudible] on four or five days or whatever. But, you know, in more extensive -- I mean this is very, very recent actually. Right? In more extensive work, we would like to not only compare to DBNs but, you know, maybe there are complimentary. You know, GPs have sometimes use for [inaudible] or vice versa. I don't know, maybe they're complimentary with DBNs. And that would be interesting to see. Yeah. >>: Thank you very much. [simultaneous inaudible voices] >> AD: Okay. Thanks.

>> Ashish Kapoor: So it's my pleasure to introduce... with us this winter. I mean, he's a graduate student...

Related documents

Products

Support

&gt;&gt; Ashish Kapoor: So it's my pleasure to introduce... with us this winter. I mean, he's a graduate student...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ashish Kapoor: So it's my pleasure to introduce... with us this winter. I mean, he's a graduate student...