>> Dengyong Zhou: Okay. Today we're hosting Emti Khan from University of British Columbia where he's working with Ben Marlin and Kevin Murphy on real exciting stuff dealing with graphical models, making them more scalable efficient and accurate, all at the same time. Before that he was at IIS Bangalore where a few of us have friends and so on. So Emti. >> Emtiyaz Khan: Thank you. Let me start by saying that I'm really excited to be here and thanks for having me. It's great to share my work. I'll be talking about some work that's part of my Ph.D. work that I did in the last few years. Okay. So I'll be talking about piecewise bounds for estimating discrete-data latent Gaussian models. And my full name is Mohammad Emtiyaz Kahn, but people usually call me Emti because it's simpler and easier to remember. This work is joint work with Benjamin Marlin and Kevin Murphy from University of British Columbia. Okay. So let me start by telling you about some sources of discrete data just to motivate you for this talk. So let's start with the data that everybody has been excited about, Netflix dataset. So it's a user-rating dataset where a bunch of users rate a bunch of movies that they've seen and given a new movie you'd like to predict how much a user is interested in that movie. Discrete data also arises in social science where you have survey dataset where you fill in questionnaires. So you have questions and you answer them. And then you have voting dataset where you vote for things, and then you have blogs, and you would like to take the data and you would like to process ahead for something, for example, maybe sentiment analysis just to know how people are feeling about the democracy. You can have speeches that you could do sentiment analysis on as well. Then in econometrics you have this consumer choice data where a bunch of consumers choose products, and every product has this attribute, and you want to know how users are -- what is the utility of the user, which attribute do they like more, so that you can design new products. So that's another example where discrete data arises. Then the usual image and video dataset where you want to do object detection, you have many objects like cats and dogs that you want to find in the image, you want to do image classification, you have tags that attached to video and images, and then you want to find correlation between the tags so that you can put it in the search. So that's another example. Then health dataset where you observe lots of discrete data about patients, and you want to use that discrete information to be able to build good health design systems. Finally, Xbox. So you can have some game and sports data, like the system that you guys have, TrueSkill, and from the data that people have -- have matches against each other and then they win and lose and you can -- based on that record you want to be able to match the players of equal skill together so they can enjoy the game. Okay. So the discrete data is available in lots of real-world application. That's what I wanted to say in this slide. And given this discrete data, what you would like to do is you'd like to learn correlation in the data, and then you want to use this correlation to be able to make useful prediction from the data. And you can use a lot of models, like mixture models, you could do some graphical model, structure learning, model the correlation between the observed variable directly or you could use latent variable model. So in this talk I'll be talking about this important class of model called latent Gaussian model. I'll show you in next couple of slides how many popular models are actually subsumed under this latent Gaussian model. I'll being using likelihood based on logit function. This will become clear in subsequent slides. And I'll be focusing on binary, categorical, and ordinal type of discrete data. This is the most basic form of the data. And then for other dataset, you could use the methodology that I talk about, you can extend it easily to other datasets. Okay. So that defines the scope of this talk. Let me now motivate you why you should care about latent Gaussian models. So here is special class of models for classification models that fall under latent Gaussian models. For example, Bayesian logistic regression and Gaussian process classification. You might have heard of these models. Usually you have -- you have features, you have continuous features, and then you observe discrete information about these features. So in this example you have two classes, class 1 and class 2. And given this training data you would like to learn a predictive distribution so that if you observe a new point in the feature space you can classify it as a discrete 1 or a 0 class. So that's Bayesian logistic regression. >>: Well, okay, I guess [inaudible] Gaussian process classification, but I wondered if you could comment, Gaussian process regression, at least for classification, seems to work just as well as Gaussian process classification, so that kind of [inaudible] it's maybe not such a great example of why you want to model discrete data. In other words, if you're just discrete -- if you're trying to predict labels, you can invent the labels in continuous space and just treat them as Gaussians. >> Emtiyaz Khan: Yeah, but if you have multiclass, like if you have categorical data -- well, you could do a lot of hacks. You're right. You could convert the data into, you know, 1 of K encoding. And you could just learn binary ->>: And it works just as well? >> Emtiyaz Khan: It does. But a lot of time -- and, for example, in social science, you would actually like to make predictions and you would like to say, okay, I want to choose this one category out of everything. And when you use the binary encoding and you make prediction, you actually are choosing everything, right? So let's say that you have five categories and you use 1 of M encoding for that and you just say that this data is continuous, I'm just going to code it as 1 minus 1. >>: Or 1, 0, 0, 0, whatever. >> Emtiyaz Khan: Yeah. And then you make a prediction, and your prediction would be for every bit you're making a prediction. Right? >>: That's correct. >> Emtiyaz Khan: Yeah. At the end you want to make a decision like, okay, which class should I ->>: [inaudible]. >> Emtiyaz Khan: Sure. You could do ->>: [inaudible]. >> Emtiyaz Khan: Yeah. So you're right. You could do a lot of hacks to be able to model this data in different ways. But you would like to have a principled approach. So you're right in being skeptical there. I agree with you. >>: [inaudible] this is opposite to the Gaussian version of the [inaudible] the observation is Gaussian condition of the discrete [inaudible]. >> Emtiyaz Khan: Yeah. So that's one version of RBMs. You're right. >>: That doesn't only fit into this. >> Emtiyaz Khan: That doesn't fit into this because that -- so it's basically exponential family [inaudible], right? And the link function from the latent to the observed variable is undirected. >>: I see. So this is why ->> Emtiyaz Khan: This factor models -- the model that I'm talking about -- the next class of model that I'll show, factor analysis model is basically the difference is that the likelihood link is directed instead of undirected. So handling missing data is much easier with these models than RBMs, which is one of the motivation of what we have been doing. So, yeah, you're right, you could do hacks, but having a good principled approach is good, I think, and I believe in it. And also like if you have text data and other kind of stuff, right, like if you're doing topic modeling, for example, you could extend these models very easily to handle that kind of dataset. >>: Don't get me started on topic models. >> Emtiyaz Khan: Okay. [laughter]. >> Emtiyaz Khan: Well, I'm very sure I'll -- well, when I finish this talk I think we'll have a lot to talk about. So I've just started right now. Okay. So this is Bayesian logistic regression, and a lot of literature, despite all the hacks that are posted on there, people work on this problem, and there's lots of paper, lots of -- tons and tons of papers on this problem, these models. And all the work that I'll explain today actually applies to all of that. Okay. So the second class of models that we're interested in, latent factor models. And these are also special case of LGMs, for example, probabilistic PCA and factor analysis models. So in these models you have a dataset, a matrix of discrete data, and then you want to project it to latent factor space which is continuous, so you want to have a low-dimensional representation of this big matrix there. And this area has been actually less -- there have been less papers on this area because it's a more difficult problem and people don't know how to do it properly. So there have been recent papers like this one, and then one of our paper last year at NIST where we talked about factor analysis. So this is a harder problem and this is actually the main focus of our work. >>: [inaudible] because? >> Emtiyaz Khan: There are several reasons why you would -- so first thing is if you have missing data in your matrix and you want to predict those missing datas, you go -- from the observed variable, you go back to the latent space, and then using the latent space you try to fill in the values. It's basically [inaudible]. You could also use it for visualization, like social scientists would like to do. So they have all voting data and they want to know how democrats and republicans vote, right, so they want to predict it into a smaller dimension space. >>: [inaudible]. >> Emtiyaz Khan: I'm sorry? >>: What would be the metric evaluation of this [inaudible]? >> Emtiyaz Khan: Yeah. So missing data imputation is something that we have used in our experiments. So there are lots of actually existing models in many areas. Like social science they have like special models which are factor models. Actually, I think, so I was trying to read this TrueSkill paper recently, and I think TrueSkill is also -- falls in this category, where you have this mapping from latent to the observed variable. You've specified the mapping already. You're trying to learn the latent space. >>: The latent space is [inaudible]. >> Emtiyaz Khan: Yeah. >>: [inaudible]. >> Emtiyaz Khan: So, yeah, many cases the model structure is given to you, and then you just -you have some way of interpreting these latent factors, right? Okay. All these models are special case of latent factor model, and that's why LGMs are important. But the problem with these models is that parameter learning is intractable. So the basic idea is that the likelihood that we used to model the discrete data, which in this talk we focus on logit function, this likelihood is not conjugate to the Gaussian prior that you assume. So when you want to compute the marginal likelihood to learn parameter of the model, you have to solve this sort of integral. So you have some latent variables that you want to integrate out, you have likelihood and you have prior. And these are not conjugates, so you cannot compute this in closed form. So it's just intractable. It will become more clear in subsequent slide, what actual the form is. >>: [inaudible] they're both exponential family, so do you multiply together? >> Emtiyaz Khan: Yeah. But it's not a distribution. >>: Oh, I see. Okay. >> Emtiyaz Khan: If you have a distribution and you know the normalizing constant of that, then you can optimize that. It will become more clear in the slide, but this is just a general idea; that you have such integral and we are trying to solve this integral. And our approach that I'll talk about in this talk is to bound this piece by some piece was bound, and then we find a lower bound which is a tractable lower bound. And we try to optimize that lower bound with parameter. It's standard EM stuff. Okay. So here's the outline of the talk. So I'll start with binary data LGMs. This is the work that we recently got published in ICML. So I'll talk about the difficulty in the parameter learning, I'll show you how standard variational method based on Jensen's inequality. It doesn't solve the problem. And then I show you if you use bad bounds it could be really, really bad. So our solution is piecewise bound, and then I'll present some results to convince you that it's a good idea. Then I'll go to categorical data LGMs. And this is work in progress. And this is the first time that I'm talking about this to anybody, actually. So I'll present the standard multinomial logit model that's used, and I'll show that some of the existing bounds can be band. And then we have this new model where you can use the piecewise bounds that we -- came up in ICML. And then I'll show some preliminary result to convince you. And I'm not sure if I'll have time to actually go to ordinal data, but it's straightforward application of piecewise bounds to model the ordinal data. So those are the three things, and then I'll conclude. Okay. So let's start with binary data LGM and try to talk about why -- what is the problem. Okay. So first I'll explain the model to you in detail. So this is latent Gaussian models. So you start by sampling latent Gaussians. Okay. So I do know that by zed here. Eta are the parameter which is basically mean and the covariance of the Gaussian. And you sample a Gaussian here. So in this graphical model I show that Gaussian. It's an L length vector. And the zed is correlated because this sigma is correlated. Okay. So you sample one zed. And now you want to get observation using this latent model. So let's say that you have D observation, capital D observations that you have. So each of this is a discrete data. So it could be like categorical data, a binary data. So to model each of this observation, what we do is we take a linear combination of this latent vector. Okay. So you have zed and you take a linear transformation and you obtain this parameter eta. So this eta is for this observation. Okay. It's a ->>: [inaudible]. >> Emtiyaz Khan: Yes. There's nothing random. So that's why it's not there in the graphical model. >>: So why is your observation [inaudible]? >> Emtiyaz Khan: Yes. >>: [inaudible] operation don't guarantee that I don't [inaudible]. >>: Wait for it? [laughter]. >>: And then -[laughter]. >> Emtiyaz Khan: You guys are really eager. That's great. Okay. So you could then use this eta, which is the linear predictor to model different kind of datasets. So likelihood examples are you can have binary data and you could use this logistic function. Basically it's a logistic of eta. So it's just a sigmoid. For categorical, you can have softmax link, or what people call as multinomial logit. So two model of this categorical data. Okay. So those are the examples. So the basic idea is that you take this zed, you take the linear combination, you get the linear predictor, pass it through a link function and then generate the discrete observation. >>: [inaudible]. >> Emtiyaz Khan: Okay. Thank you. So -- so you're doing this. You do this like multiple number of times, depending on what kind of dataset you have. So you do that. And, yeah, so your parameter set is these three things that you don't know. So you have mu, you have sigma, and you have W. So these are the parameters that you want to estimate, right? So if you want to compute the marginal likelihood, you want to get rid of this latent variable. So you could just do maximization if you want, but it doesn't work very well. So what we would like to do is to integrate this latent variable out. >>: [inaudible] W is just a parameter [inaudible] so what do you [inaudible] variable here in a comfortable [inaudible] fixed parameter. It's not random variable. >>: Well, they're Bayesian. They're treating it like a random variable. >> Emtiyaz Khan: I'm actually not treating as a random variable, but, yeah, but I put it there just to show the dependency. So, you're right ->>: [inaudible]. >> Emtiyaz Khan: Okay. So just to present the general model, I put it as a variable there. But, yeah, we are just doing point estimation of the parameters. Okay. So this is a half Bayesian thing, right? Like I'm just trying to like take away the uncertainty in the latent variable and be [inaudible] about the parameters. So okay. So just to make connections to logistic regression, while zed is usually the weights and W is the features that we observe, so in regression, the problem is just to infer the posterior distribution or weights. So it's a simpler problem. In factor analysis, you don't observe W, you don't observe zed. So you want to like represent Y as a multiplication of W and zed. Right? So it's a two-matrix. You multiply them together, like matrix factorization, and then you get the output from that. Okay. So that's the connection. And I'll make it more clear later. >>: So correlation between all the zed and then the sigma, you know, being [inaudible] is it really [inaudible] in many applications? If you don't have that [inaudible]. >> Emtiyaz Khan: Yes. So special cases with either have W or have sigma. So in factor analysis you assume that your Gaussian is spherical. And then you cut it through different points using W. Like you transform it in different way. >>: Well, I'm saying if you just remove the correlation among all the Z there, is the problem tractable? >> Emtiyaz Khan: No. So that doesn't matter. So sigma and W there, they're both -- the way it works -- they're unidentifiable, actually, in the full model. So you either use W or you use sigma. But the problem ->>: [inaudible] sigma to be diagonal. That's what he's saying. >> Emtiyaz Khan: Yeah. So if you use W, then you can trust sigma to be diagonal. >>: I see. Okay. Can you make W to be a little matrix also? >> Emtiyaz Khan: Then you can make sigma to be full matrix. So in Gaussian process, I'll show this to you later, in Gaussian process, your W is identity, your sigma is a full prior covariance matrix that you get from using the features. It's a cardinal matrix. In factor analysis, sigma is an identity matrix and W is the factor-loading matrix. So everything falls as a special case, so this is why I've stated this. As a general model where we would like to learn these parameters. But we don't want all of them in a specific application; we just want one or two. Okay. Okay. So I'm going unusually slow. So I think I'll have to cut down maybe later on. But I like this. This is going pretty good. Okay. So let's start with the binary version where you have a Bernoulli logistic link. So what you do basically, you get this eta. So I dropped the notation D and N from here just for simplicity. So you get that eta. W is just a rho. It's a vector. And you take the linear combination, get the eta, and then you pass it through the sigmoid like this. You can model one here. So if you look at the curve, sigmoid is like that. And if eta is 2, then the probability you get like .9 something, approximately, something like that. Okay. So I take log of this and I get eta and I get this thing. Now, I'll show you that this thing actually causes a problem in estimation. So let's -- let's go to the standard way -- okay. So let's talk about parameter estimation. So you have this likelihood but you want to estimate the parameter, you can write the likelihood given all the data points as this. So summation just comes across all the data points because the data points are all independent given the latent variable. Then you have this log and you have big integral over zed that you want to solve, so you have likelihood over all dimensions, and then you have a big Gaussian prior. So you want to do this integral in closed form to be able to compute this marginal likelihood. Okay. So this actually looks like that. So I'm just showing you data, a likelihood for one data vector which is of dimension D. So you have all these D sigmoids basically, and then you have a Gaussian, big Gaussian, and you want to do this integral in closed form, but you cannot. So the standard variation. So you can do sampling and all that, but I don't want to go into that. And we can talk about it later why it's not a good idea to do that. Just to save time, I'll quickly jump to the variational method that we are using. Okay. So the standard variational trick is to get a lower bound using Jensen's inequality. Okay. And I'm very sure all of us probably have used this at some point in our lives. So the trick is to -- I assume a posterior distribution, which is Gaussian. The actual distribution is not going to be Gaussian because it's logistic multiplied by a Gaussian. So I approximate the posterior distribution to be Gaussian and I divide and multiply by the standard trick. And then I try to push the log inside and use Jensen's inequality. So when I actually [inaudible] all this sum product becomes sum. So the problem is simpler now. It's not a multidimensional integral. It's a simpler integration. Okay. So I do that. I did -- took the log inside, this product becomes sum, I get the [inaudible] log of this. And all other terms get pushed into this KL divergence chart, so basically KL divergence of posterior versus prior. Okay. So we all know that, probably, that KL divergence is available in closed form if you have the normal distribution. So this is tractable. Okay. So we have this term that we don't know and this integrate. This is log of the likelihood, which is like a sigmoid thing, and then you won't take expectations with respect to Gaussian. So you don't know how to do that yet. You could use numerical method to actually approximate it, but as D increases and N increases you'll have to do a lot of function evaluation, right? There is another problem that because you have this lower bound and you want to get the best lower bound out of whatever lower bounds you have, so you want to find the best M and V. So it's not actually a simple function evaluation. Normally you have to do maximization or M and V. So you want to get the best lower bound. So you're not able -- you not only want function evaluation, you also want gradients of this integral. Okay. So that makes it even more expensive, to be -- if you use just numerical methods. So our aim is to avoid numerical method and use some bounding method for this term, so I have -- everything scales well. All right. So the standard -- okay. Let me explain a little bit more what happens, is that log likelihood when I plug that value in here, this is log of 1 plus E to the 4 eta. As you remember from a previous slide, and then I can do a change of variable from zed to eta, and I get this expression. Just done simple change of variable. Okay. So you want to evaluate this. It's now a 1D integral. It's very simple for binary case. But you don't want to do numerical method because you want a functional form and you want the gradient. So you want to be able to evaluate this whole thing in closed form, if you could. Okay. And then you have some other tractable term that you don't have to worry about. >>: [inaudible] push the gradient, because the gradient depends on [inaudible]. >> Emtiyaz Khan: Yeah, but then even that will be -- yeah. >>: [inaudible] complexity as ->> Emtiyaz Khan: Yeah. So you could do a lot of numerical tricks, but as D increases, you have to do more and more of those, right? Because this M and V depends on the D and N. So let's say if you have factor analysis problem where you have, you know, thousand respondents and they vote on like 2,000 issues, then you have like around 20,000 things to work on. Okay. >>: So eta is over what kind of range, roughly? >> Emtiyaz Khan: Sorry? >>: Eta [inaudible]. >> Emtiyaz Khan: Yes. It's a real [inaudible]. >>: [inaudible] really makes a difference [inaudible]. >>: [inaudible]. >> Emtiyaz Khan: Well, no, so the important thing is that eta depends on zed through W, right? >>: That's true. Oh, so we could scale arbitrarily. >> Emtiyaz Khan: Yeah. So ->>: I see. >> Emtiyaz Khan: -- that's why actually like if you try to implement it numerically, you're quadrature inside have to be able to like scale well and you have to -- so you will see actually that our method is really, really simple. It exactly does that. It just finds an approximation and it scales it arbitrarily, and we can control the error there. It's pretty simple. >>: So you're taking the piecewise and [inaudible] it to ->> Emtiyaz Khan: [inaudible]. >>: Because if like -- if you hinge in logistic loss. >> Emtiyaz Khan: Okay. >>: We do that as well [inaudible] ->> Emtiyaz Khan: Okay. >>: -- [inaudible]. >> Emtiyaz Khan: Cool. All right. So yeah. So our ideal is basically actuating. So I'll go to the problem. And you'll see why it's nice for this sort of variational framework. So okay. So if you actually look at it, pictorially it looks like this logistic log [inaudible] function that usually people call it, that's this function. It's nice and convex. And then you just want to integrate it against Gaussian. You can use numb call method, but it's expensive, so we don't want it. Okay. So the usual way is to find an upper bound to this and then -- because you have a negative here, you get a lower bound to the whole thing. Okay. So first bound, this is very simple bound actually. You probably have studied in like machine learning like 101, if such course exist. >>: 201. >> Emtiyaz Khan: Yeah. So it's like fixed curvature, and it's just ->>: [inaudible] break that into three pieces. >> Emtiyaz Khan: Yeah. >>: So what's equal above that you can [inaudible] in the middle of the thing [inaudible]. >> Emtiyaz Khan: Yeah. But how do you find that piecewise fixed thing. That's ->>: Well, that's why I asking the range of error. If you know roughly what it is, you know [inaudible]. >> Emtiyaz Khan: Yeah. So what I'll show you, then, you don't even have to know that. Like it's much easier than that. Okay. So you have this quadratic here, which is fixed thing. But it makes unbounded error in the tails which is bad. Right? So you can vary this contact point, but it has fixed curvature. So you cannot control the error. >>: [inaudible]. >> Emtiyaz Khan: Yeah. So Jaakkola actually in '96 -- Tommi Jaakkola had a better bound, which is actually really, really popular. His paper has been cited like a thousand times. And everybody seems to be using that. So this is -- it's a quadratic, but it's tied at two points, so it's actually nice. But as you increase the contact point far, you get huge error in the middle part, so if you have huge variances. >>: But [inaudible] be large [inaudible] errors. >> Emtiyaz Khan: Yeah. >>: [inaudible]. >> Emtiyaz Khan: Yeah. But if you have a huge variance, like in Gaussian process, then you're screwed with this. Okay. So the problem is that both of these bounds have unbounded error and people use it but they don't know -- they always -- everybody talks about that being bad, and they say, oh, maybe the KL divergence is not a good measure. But I'll show you that that's not case. So if you do this more carefully and bound it properly, you could actually reduce the error. Okay. So let me first talk about the problem with these bounds. So demonstrated on this simple problem, it's a one-dimensional problem. So let's say that you have -- you observe only a scaler and you do it N number of times. Instead of a D dimensional length vector you just have one scaler. So you have this one scaler latent Gaussian here for each of those. And then it is generated from this mean in mu and sigma, which is also one dimensional. So it's a simple problem. One, the example, I fixed mu to 2, sigma to 2. And because it's a simple problem, I can just compute the marginal likelihood using some numerical integration. So I had the ground truth. So I know what is the marginal likelihood here. And the idea is to again rate data, fix one of the parameter, and then compare the marginal likelihood and the lower bound with respect to this one parameter that is free. And see how much error you are making. >>: You're looking at the minimum of both for sigma? In other words, you're really sad if the minimum for both is [inaudible]. >> Emtiyaz Khan: Thank you. >>: Right. >> Emtiyaz Khan: Okay. There's more, but I'll skip this. Okay. So here's the plot. So if have the Bohning bound -- well, the blue line is the true marginal likelihood. And it's peaked at the right place, which is sigma equal to 2. And the red line is the lower bound, which is called peaked at zero. So actually when we -- when I wrote the code like last year, I thought it's a bug for like 15 days. And we wrote for like -- my colleague wrote his code in mathematica, and we tried to match everything, the numbers matched. And this optimization just goes to zero, for some reason. And just keeps running because it's trying to reduce sigma to zero. Okay. If you use the Jaakkola bound, it has the same thing. Now, the approximation is a little bit better, but it goes to zero. It shows the fundamental problem with these bounding methods; that, you know, if you have unbounded error, then you could do really, really badly. It's not the fault of Jensen's inequality. Okay. So our solution is to just bound it by pieces and have bounded errors. So we just want to have these pieces so that we know how much error we are making between these two functions. And if you do that, then everything works well. Okay. So the peaks match here. Okay. So now I'll show you how to obtain those piecewise bounds. Well, basically, what you want to do, you have this log of 1 plus Z to the power X function, in the red line you have this piecewise approximation to it. And what you want to find, you want to find the cut points, where you cut it, cut the pieces. You want to find the parameter of each pieces, which could be linear or quadratic, because we know how to do the closed form -- we know how to take expectation with respect to Gaussian off a linear quadratic. So you want linear quadratic pieces. Okay. So we want to do this. And it turns out that you could actually formulate this as a min and mix problem. So you can have -- you want to minimize the maximum error incurred in these pieces ->>: By area you mean area in the interval, not area in the approximation of the [inaudible]. >> Emtiyaz Khan: No. I mean error in the approximation. So this is a fixed piecewise bound. So I take log of 1 plus E to the power X, and I have this approximation and I take the difference. And I want to minimize the difference between the two. And it turns out that you could actually write it as a convex problem. And the solution has been given in this paper by Boyd. So for linear pieces you could just write a simple algorithm. It scales very well to a hundred pieces. And they have a MATLAB file. You just like run it. It takes five seconds. >>: So T1, T2 are [inaudible] or are they optimized as well? >> Emtiyaz Khan: They're also optimized. Everything is optimized. >>: [inaudible] if you have two it's easier [inaudible]. >> Emtiyaz Khan: Yeah. But it's actually not that hard a problem, because -- they show in their paper that it all can be formulated as a convex problem. It turns out actually that number of parameters can be reduced drastically because, you know, things are symmetric, so I'll -- so it happens that if you have odd number of -- so if you have -- let's say zero is somewhere here, then T1 is equal to T minus T2. So you could reduce parameters like that actually. >>: That's specific for this particular function, or is it general for any [inaudible]. >> Emtiyaz Khan: No, it's specific for this particular function. >>: [inaudible] the general thing is [inaudible] that actually guaranteed that [inaudible]. >> Emtiyaz Khan: Okay. >>: But that's not a strict upper bound. He's looking for strict upper bound, right? >> Emtiyaz Khan: Yeah. Or a good approximation. But here you get the bound for free. The way you solve this problem, it's very easy, actually. You have this function and you want to find the tangents where you touched. So you find a lower bound initially. So you just find the tangent. >>: But here you don't guarantee that the [inaudible] you don't have that property ->> Emtiyaz Khan: Of the function, no. But of expectation, yes. So the expectation is just a truncated Gaussian. Right? Because you're taking expectation and you have this piece which exists within that interval. >>: [inaudible] of that with respect to theta. >> Emtiyaz Khan: Yes. >>: And that could be messed up. >> Emtiyaz Khan: That -- no. >>: No, that's going to be -- that's going to [inaudible]? >> Emtiyaz Khan: Well, you can evaluate the gradient and the expectation. >>: But is the gradient continuous? >> Emtiyaz Khan: Is the gradient -- yes. >>: Oh, because you're integrating [inaudible]? >> Emtiyaz Khan: Yeah, yeah, yeah. Everything is continuous. You can ->>: Right. Because if this is all under the interval [inaudible]. >> Emtiyaz Khan: It's all [inaudible], yes. >>: So [inaudible] your diagram [inaudible]. >>: Yeah, but this is all underneath the interval, because the interval of this times a Gaussian. >> Emtiyaz Khan: So this is first function approximation, and then we pass it through the expectation. >>: I see. Okay. >> Emtiyaz Khan: Okay. Thanks a lot for asking all these questions, actually. Okay. So this was linear pieces done by Boyd. And what we did is we extended it to quadratic pieces. It's a little bit harder problem. It's not convex. But so you can still solve the problem to 20 pieces for quadratic using Nelder-Mead method. Details are there in our ICML papers. >>: [inaudible]. >> Emtiyaz Khan: Nelder-Mead? >>: Yeah. It's [inaudible]. >> Emtiyaz Khan: Yeah. Yes. Okay. It worked fine for this problem, from our experience. Okay. So the advantage of this ->>: So one thing is why do you want to have absolute upper bound or lower bound? All you get to do is you get as little error as possible. >> Emtiyaz Khan: You're right. >>: [inaudible]. >> Emtiyaz Khan: You're right. >>: This is minimal error. >> Emtiyaz Khan: Yeah, yeah, yeah. >>: Are you freer with some other choices [inaudible]? >> Emtiyaz Khan: That's true. It's -- yeah, it's good to have upper bound, lower bound, just for -- you know, you get monotonic convergence. Just you have a bound using Jensen's, and you get another bound, right? If you -- you could do an approximation, it doesn't hurt, actually. So, yeah, it's not necessary that it is a bound. But the important point is that you want something that is bounded. If you have something bounded, then you can do parameter estimation in a better way, as my results will show you. Okay. So the important thing is that this is a fixed piecewise bound. So I do this once offline and I store these parameters and I don't have to actually run it inside my algorithm. So I don't have to do this optimization again and again inside my algorithm. So it's a nice feature. >>: [inaudible] the function you're trying to approximate is a function of your sigma, right [inaudible]? >> Emtiyaz Khan: Well, this function, log of 1 plus U to the 4 X is not a function of that. >>: Oh, I see. >> Emtiyaz Khan: So you make a fixed thing. And then when your eta is extended, let's say you stretch it, the approximation just stretches. So the maximum error is still bounded. And -- just a minute -- another good thing is that the maximum error, the program actually returns you how much error you're making. So you know a priori how much error you have in that. >>: [inaudible] piecewise commonly [inaudible] it's a global [inaudible]. >> Emtiyaz Khan: Global, yes. >>: So [inaudible] variational process [inaudible] specific upper bound or lower bound [inaudible]. Have you ->> Emtiyaz Khan: [inaudible] when was it published? >>: At least -- years ago. >> Emtiyaz Khan: Oh. Okay. Okay. >>: They have several specific parameter [inaudible] upper bound, lower bound. >> Emtiyaz Khan: For the logistic function or ->>: Yeah, for logistic. >> Emtiyaz Khan: Okay. I'm not aware of that. I'll definitely check that out. >>: It's variational because Jaakkola's bound is [inaudible] quadratics sits in different places. >> Emtiyaz Khan: Yeah. So you have to optimize that according to your data. You have to find the variational parameter, the touch points, according to the data. Because it's not a fixed point bound. But the beautiful thing here is that it's a fixed bound. So okay. So you can increase the accuracy by increasing the number of pieces. As you increase the number of pieces, you get less error. Okay. So here is the plot where I show maximum error with respect to number of pieces. So you have -- as you increase the number of pieces, the error goes down. Basically what it is. Red is quadratic, blue is linear. So quadratic actually makes almost 10 times less error than linear. So you can have linear less number of pieces. And your algorithm depends linearly on the number of pieces. So quadratic actually reduces the speed by 10 times. Okay. Okay. So with this I can go to the results, and I'll show you why this actually helps us. Okay. So the first model that I'm going to show it on is binary factor analysis. So just to repeat how a factor analysis related to the model that I talked about before is that now that zed is totally uncorrelated, it's zero mean [inaudible] isotropic Gaussian, and you have the factor loading matrix here that generates the data basically. So you have the data, which is an N by D matrix, basically, and you want to embed into a latent space predicted to a smaller lower dimensional space. So we use UCI voting dataset for this where D is 15 and N is 435. And we split into 80-20 train-test split. And it's the standard thing, we train the model on the training data, and then we compare the cross-entropy error of missing value on the test data. Okay. So we punch holes in the test day and then try to impute it. Okay. So here's a result where we try -- we compare error versus time. So error is Y axis, and lower is better here. Okay. And X axis is time. Okay. So start with Bohning bound. This was the first bound that had fixed curvature. And this is the first line. It goes to some value. It's much, much faster because it has fixed curvature. And you look at the algorithms, the covariance matrix you don't have to compute at all, you just do it once. This was our actually NIST paper last year where we showed that you could do categorical factor analysis with this bound. It's very fast, but it's very inaccurate. If you do Jaakkola's bound, you become a lot slower, but you improve a little bit. Okay. So now if you apply our bound, which is piecewise linear with three pieces only, only three pieces, the simplest one that you could do with your hand and just code it, so you get a little bit of improvement. The speed is about the same. Okay. And now if you increase the piece -- sorry. If you go to quadratic from linear, you get a drastic improvement. You take about the same time because you just have three number of pieces. If you go to ten pieces, you do a little bit better. >>: [inaudible] vote, interesting in this space would be down to the left. I'll bet you $10. >> Emtiyaz Khan: Okay. >>: Come on, don't you want to win $10? >> Emtiyaz Khan: I really want -- I'm a grad student. I'm fine with $10. Really. I'll do anything for $10. [laughter]. >> Emtiyaz Khan: Okay. So this is just across different train and test split. And I show that if you're below line, then -- below this line, then we are winning and others are losing. So we're always below the line. And each of this point is just a different train and test split. So we win across like different -- okay. So now I want to show some results for GP and the -- this is not there in our paper, but I wanted to add this because everybody wants to know how does this thing compare to EP. You definitely know about the expectation propagation. So Tom Minka has this method EP that is -- people have done it for Gaussian process classification a lot. So I just wanted to compare. So this is like an extensive -- I have actually not presented this anywhere in this detail, so I'm excited to talk about this. Okay. So let me first show you the model how does this fit in the latent Gaussian model as a special case of that model. So basically that you don't -- you just have a D dimensional vector. You don't sample it again and again. So N is just one vector. You have one latent Gaussian per observation. So W is just identity. So you don't have to learn that. But you have this zed and the correlation are given by mu and sigma which depend on the features. And there are two hyperparameter S and sigma, so sigma is parameterized as a squared exponential cardinal. Right? So sigma is the shape and then S is the -- I don't know, actually, two lengths parameters. Okay. So the number of hyperparameters is small. It's an easier problem because of the reason, right? You just want to estimate these two parameters. And you want to infer the posterior distribution. So it's actually an easier problem and it's interesting to show some of the plots that I'm going to show you and compare with EP later on. Okay. So we run experiments on design [inaudible] dataset. The data says 200 data points basically and we do cross-entropy prediction error on the test data to check it. Okay. So first let me show you only the variational methods, how do they compare, and then I'll go to EP. Okay. So what I'm plotting, each of these plots is actually with respect to the two parameters. Okay. So the sigma NS of the kernel parameters in X and Y. The top plot is the approximation to the marginal likelihood that you want to optimize if you want to learn these two parameters. The bottom plot is prediction error. Okay. So when you are actually doing the learning, what you would like to do, you would like to look -- the approximation to marginal likelihood should look very similar to approximation to prediction error. Because you're trying to optimize the marginal likelihood, learn the parameter, and then use it for a prediction. So you would want the shape of these two plots to look very similar. Okay. So for Bohning bound you see that they look incredibly different. And I've put the star here for the maximum minimum off the negative log likelihood. And you see that the corresponding point here is here. And you would want to be somewhere here, right? But you're here because the shapes are very different, and this happens because of the bounding error. In Jaakkola it has a little bit less bounding errors, so you get a little bit better. You get here. And piecewise you have even less bounding error. All the error is just due to Jensen's inequality. So the error that you have here is because of the Jensen's inequality, and you get almost at the optimal point. >>: Now, why are there three graphs on the bottom? >> Emtiyaz Khan: This? >>: Yeah, the three ->> Emtiyaz Khan: This is prediction error. And this is marginal likelihood approximation. >>: Right. But you're ->> Emtiyaz Khan: This is test error. Sorry. >>: Right. That's test error. But the argument -- are you drawing the same test set? If you're the same test set [inaudible] be the same picture? >> Emtiyaz Khan: No. They are three different bounds, right? [multiple people speaking at once] >>: Oh, this is prediction error plot. Oh, got it. This is prediction error [inaudible]. >> Emtiyaz Khan: So this has been noted before also that even if these methods don't perform well for negative likelihood -- for the marginal likelihood approximation, the prediction is actually pretty okay. So for Gaussian process, you can just do a grid search. Right? You won't have to actually -there are not too many parameters. So you could just do a grid search and find the prediction error, the optimum prediction error. So it actually doesn't matter. You could just use Bohning bound, which is actually really, really fast. And, believe me, it scales really well to like each covariance matrix even for a Gaussian process. >>: So my question is that for piecewise in the column, that the error really depends upon how many pieces you put in there. >> Emtiyaz Khan: Yes. >>: So if you put lots of pieces, you probably get zero error. >> Emtiyaz Khan: Yeah. >>: So the question is that, I mean, when you compare that with others, I mean, you get to have [inaudible]. >> Emtiyaz Khan: Yes. So if I use three, then it looks like that. >>: [inaudible]. >> Emtiyaz Khan: Yeah. >>: That three. So this is -- >> Emtiyaz Khan: So this is like 20. So all experiments that we did, we just had to use 20 quadratic bounds. That's like -- we use like quadratic pieces with 20 ->>: So they're all optimized. >> Emtiyaz Khan: Yeah. They're all optimized before, and then you just push it through there, all the ->>: So you would think probably would not be as good as that. >> Emtiyaz Khan: Yes. Okay. So this is all right. But how does it compare with EP is the question that I get asked all the time. So I'm going to be very careful about this and go really slow on this. So start with the -- let's start looking at the posterior distribution that these methods obtain. Okay. So the first case is a simpler case mistake. You have a likelihood, sigmoid likelihood, and you have if you multiply these two together you'll get an approximately Gaussian-looking likelihood. This is an easy scenario. Okay. Let's say that your parameters is such that that you're -- you know, your step function and your likelihood -- your likelihood and prior are kind of overlapped. It's not very nonGaussian posterior. So if you take that and plot the mean and the covariance obtained by EP and our method, you'll see they can exactly match. So everything is good if the posterior is very Gaussian. Now, the second case when you have this likelihood which is more like a hard thresholding, and it cuts the Gaussian in the middle or somewhere so that the posterior is actually a skewed Gaussian. So this is likelihood multiplied by prior and the posterior is this skewed Gaussian. It's not very Gaussian. So the two approximation differ in this case ->>: [inaudible] EP uses [inaudible] so it smashed into a Gaussian, right? >> Emtiyaz Khan: Yes. So EP is trying to match the moments. And variational is trying to minimize the KL divergence QP. So I think it's QP. So it has this property, if you know about this, and I can explain this in more detail later. This tries to minimize the areas where there are zeros, basically. So this tries to like reduce the variance, so the posterior is like squished a little bit. So you will see this effect right here. So you see that EP and piecewise, their posterior distributions differ a little bit. Right? And the variance is actually shrunk in piecewise. So what it does is actually is trying to get rid of -- so there are -- so the posterior here looks like that, right? So it's trying to like get rid of these areas it doesn't want the piecewise variational method, doesn't want to be in this area. So it takes the mean, it shifts it a little bit that side, and it shrinks the variance because it doesn't want to be in the zero area. So you see exactly the same thing here, your variances shrink and your mean is shifted towards that side. So you see increased mean and in negative side you see at most that side, and your variance, the red one is the diagonal or the blue one is the off-diagonal. Okay. So now the question is which posterior distribution is better, right? That's the question we want to answer. Because these are two different approximation and you don't know which approximation is better. And everybody says that for the posterior distribution opt-in is better because the marginal likelihood approximation is better. But I found it a little bit different. Okay. So which one is better. And I say I'll let you decide. Okay. So here is a plot comparing EP and piecewise bound. Bottom one is EP. The top one is piecewise. Okay. So first column is the KL lower bound. Okay. So if I do lower bound -- well, they look sort of very similar, but variational method is trying to optimize lower bound, so it does a little bit better here than EP which is trying to do some [inaudible] matching. It's not optimizing this function, the marginal likelihood basically. So they both about look the same. The second one is an approximation to marginal likelihood. This is given by Tom Minka in his 2001 paper. And I compute both that for both piecewise and EP, and you see that both of them look very similar. So even though the lower bound actually performs bad in this area, which is like the highly nonGaussian area, this side, because sigma is very high, this is the nonGaussian area, the lower bound is -- behaves like the lower bound. It does the right thing because it says my posterior is not Gaussian. It is very nonGaussian. So I'm going to underestimate it, right? It gives a very lower value. But when it comes to finding the posterior distribution, it doesn't matter how far you are; it matters how good you're approximating the surface that you're optimizing, right? So you're running on the surface and you want to reach at the peak at the same place. So if you make less bounding error, the approximation and posterior distribution is actually pretty good because it matches with EP. And Rasmussen has shown that this is very similar to what you obtain by MCMC. So they both give the same answer. Okay. If you look at prediction error, there's no difference. So my point is that actually the question is not KL PQ or KL QP, which one is better, the thing is that which one is more principled and which one is more faster, right? And my -- what I believe is that variational is more principled because now I can just take this and then optimize the hyperparameters because I know that I'm optimizing this objective function. While EP doesn't optimize this function or either that function. Either optimizes the data-free energy, you know, other lower bounds, which I'm not -- I don't know how easy it is to extend it to factor analysis model when you have more number of parameter. So that's my argument there. And I would like to talk to people more to know about how the EP does, you know, if you guys know more than me for sure. So my conclusion is that both methods give very similar results, but the variational EM algorithm, if you do parameter learning when your parameter space is large, not like GP, it is guaranteed to converge because you are optimizing one function in both E step and M step. So when I learned the posterior distribution and I optimize a function, and then I optimize the same function in the M step, while if you're doing EP, I don't know. >>: You're guaranteed to converge, but you're not guaranteed to converge to the optimum of the likelihood. >> Emtiyaz Khan: Yes. Because I'm using approximation. >>: I mean, should we follow argument of why this is better than EP, right? >> Emtiyaz Khan: No, you care about convergence, right? Because ->>: [inaudible] I stop for some -- you stop at an optimal ->> Emtiyaz Khan: No, think of it -- think of it as a practitioner, if you're going to implement an algorithm. Are you sure that every time you implement and run on a different dataset you will actually converge? So if you are optimizing two different functions in E and M ->>: [inaudible] converge guarantee? It just won't go anywhere. >> Emtiyaz Khan: So, yeah, both of the things are important. It's important that you converge to the right place. It's also important that you converge, right? >>: I assume [inaudible]. >>: Well, you only care about uncertainty over nonconvergence if you know that it can -- if it held something good. But if you don't have a guarantee that you had something good, then you will cross-validate, and that's how you decide when to stop. And so the discussion becomes academic. [laughter]. >> Emtiyaz Khan: Plus how weird is real discussion? [laughter]. >> Emtiyaz Khan: The thing is that you have -- yeah, the kind of toolset you would like to make, at least when you're doing your Ph.D., is something that -[laughter]. >> Emtiyaz Khan: -- something that applies to a lot of people, right? Like it's easier to code and then applies to like in social science, people can be sure that their models will work on it or like, you know ->> Dengyong Zhou: We can talk offline. >> Emtiyaz Khan: Okay. Sure. All right. So I'm already going over time. So Nickisch and Rasmussen have this paper in 2008 where they compared all the approaches. Very nice paper. Very nicely written. And it reduced half of my work actually, because they have very good conclusion. And they actually say that variational approach is more principled approach than EP, but EP actually is faster than variational because you don't have to do numerical integration. Well, we have got -- we have fixed that part in our paper. Okay. So now I'm at the second part, but it's a long thing. So I'll just give a five-minute summary of these two parts and I'll wrap up. >> Dengyong Zhou: Yeah, yeah. Show us the model. I mean, we can probably take it as red that the bounds are bad. >> Emtiyaz Khan: Okay. So the model is -- okay. I have a really -- this is really nice, but I'll skip this. It's basically -- so you want to model categorical data, and you model it through softmax link, which is the usual. And if you take log of this likelihood, you get this log some X here. As you remember, you're going to take the integration of this log some X and you don't know how to. So then you do some bounds on this. And my whole thesis was based on this, like how good the bounds are and I have theoretical conditions on which bound is better and why and which one scales and how do you choose a particular bound given the kind of application that you have. Okay. So you can do all that. But there's no hammer like the piecewise bound in this case ->>: It's multivariate now. >> Emtiyaz Khan: Yeah, because it's multivariate. But I'm very sure we'll find something. There are some new bounds where you try to bound this using sigmoids, which is actually an interesting idea, because it has same asymptotic properties, right, as you go to infinity, this sort of -- they have same kind of structure. Okay. The behavior is similar. Okay. So that's kind of a bad thing. So what we do, we have this stick-breaking latent Gaussian model. And what you do -- okay. So to generate the probability of first category, you take the stick, you break it using a sigmoid. And then to generate the second category, you take the rest of the portion and you cut it again. And you keep doing this. Okay. So you can basically write the log likelihood and you see that it actually comes in form of this logistic log partition. You can also interpret this as K minus 1 binary classifiers put together. And we know that when you're taking expectation, now I know how to bound this properly, so inference is better. Okay. So the problem is that you -- the interpretation of the weights is not easy because you -the ordering of the category will affect the solution, right? So it's not good for regression kind of models, but if you're just doing Gaussian process, using factor analysis where you don't want to interpret the latent variable and you just want prediction, then you could still use this model. And this model was told to me by Guillaume. Nobody has published it for GPs or for other GMs. But some people have used for language model. It's not something that I came up completely on my own. Okay. So the same thing. I do experiment. This is preliminary result on GPs. But this is now multiclass. So this data is categorical and you have, you know, K of those things. So it's just bigger. It's a harder problem. Okay. So if you use Bohning bound, you get this curve. Use Blei's bound, you get that curve. If you use our model and U bound, you get this curve. And the shape should look very similar to the prediction accuracy in our case. So yeah. That's basically the idea. So you're fitting the model more accurately than the other guys. And I'm working with Shakir, who's also been Girolami's student. He's a postdoc in UBC now. And we're going to compare to MCMC. He's working on that. And we're going to compare to the paper -- the only paper that actually I could find in recent times on GP multiclass. We're going to compare on factor analysis as well for categorical factor analysis. >>: Compared to using the [inaudible] on a Gaussian process regression and then I'll even let you use GPC as infringeable. But I bet you if you choose the hyperparameters with Gaussian process regression ->> Emtiyaz Khan: Oh. Okay. That's an interesting ->>: Which I never thought of. That's ->> Emtiyaz Khan: I never thought of that. >>: Yeah. Because that's yet another one of these things where you're trying to get [inaudible]. >> Emtiyaz Khan: Okay. So for categorical it's kind of a little bit tricky. But, yeah, you could still do 1 of K encoding. >>: It's called indicator crunching, by the way, and it's not multinomial, but even when we have binary [inaudible] there's some crazy [inaudible] name for all this stuff. [laughter]. >> Emtiyaz Khan: Okay. Thank you. Well, so the last part is ordinal data, and ordinal data, okay, you just have categories which is like ordinal basically. So, you know, you get these etas, and then you pass it through the sigmoids and you get this kind of [inaudible]. >>: So you'd like to get it. >> Emtiyaz Khan: Yeah. So you have this ordering. So basically to generate this, what you do, and you can use proportional-odds model, and it's basically that every probability is just a difference of these sigmoids. So it's pretty simple afterwards, right? Like I just do some math and write the log likelihood as this. And then everything rest follows, right? Okay. So that was my talk, so I'll just summarize it. So okay. So the main conclusion is that variational inference can actually perform bad if the bounds that you have an unbounded error. It so it's very, very -- it's really, really important that you know how much error you're making when you're bounding these terms. And it's not that the variational inference is bad, but it's the bound that might actually ruin your performance. It's not the Jensen's inequality which is the killer. So in piecewise bound we take this and we can drive the error in the bound to 0 by increasing the number of pieces, and this leads to improved performance as I showed you. And we also get a fine control over speed versus accuracy tradeoffs. So if you don't have time, you could just take three pieces and put it, you know, around the algorithm and get some course result. Okay. So we take the idea to categorical data LGMs, and it's kind of hard to come up with similar bounds for softmax. But we have this new stick-breaking LGM which is easier to fit the multinomial and seems that it might be useful for some cases. Okay. For ordinal data you could just apply -- I showed you the piecewise bound application. Some of the other work that I've done and if you want to know more about it, you can talk to me. It's about variational bounds and approximation in general for these three datasets and also for text and doing topic modeling and other stuff. This is part of my thesis which is going to be out probably soon. And I do a theoretical analysis of the errors made by various bounds. It turns out that that's the most important thing. And I do variation of the sufficient condition for some of the bounds to prove that one of the bound is always better than the other bound. So I give condition that if you have number of categories that's more than four, and if your matrix is diagonal dominant and blah, blah, blah, then you can do -- you should use one bound over the other. And I also tried to discuss the desired guidelines and how to choose one particular approximation technique if you want certain error in your [inaudible] real dataset, and that's part of my Ph.D. work. Okay. And that's it. [applause]