>> Scott Yih: I am Scott Yih. It is my pleasure to welcome Andre Martins today as our speaker. Andre is currently a PhD student in a special joint program with CMU and CMU Portugal programming language technologies and working with [inaudible], right? And Andre done a lot of work in both machine learning and natural language processing and his work entitled Concise Integer Linear Programming Formulations For Dependency Parsing has won the best paper award in ACL 2009, and today Andre is going to tell us how to structure output prediction problems using his methods based on dual decomposition. >> Andre Martins: Okay. Thanks Scott for introducing me. This is joint work with my four advisors, Noah Smith, Mario Figueiredo, Pedro Aguiar and Eric Xing. This talk is divided into two parts. In the first part I am going to talk a little bit about inference algorithms, in particular a new dual decomposition method for doing inference and their many overlapping components. In the second half I am going to talk about the learning problem, and particularly how to do structured sparse modeling for structural prediction in NLP. Let's start with the inference part. This is based on two papers. When presented this year at EMNLP, and another one at ICML, where it was more focused on general graphical models. The EMNLP one was particularly dual parsing. So the basic idea and I think that this is more close to the EMNLP talk so it is more directed to an NLP kind of audience. So the basic idea is that we have been dealing with statistical models for many different tasks in NLP and usually can improve accuracy as we keep breaking locality assumptions and all these models with richer features. The downside of this is that in many cases we can no longer do exact inference in a practical way and we need to use approximate decoding methods. There are many different things that have been proposed like sampling-based methods, using search, using LP relaxations. Here I am going to focus on LP relaxations-based methods. And one method that has gained prominence in EMNLP quite recently which was actually introduced in vision problems by Komodakis is dual decomposition, which gives us a principled way of combining models. It works by relaxing the original problem and then optimizing the duel with subgradient method. I am going to talk in some detail about these soon. And it has shown very good performance in NLP problems, in parsing in MT and other sorts of problems. But all these past works have been focusing and combining just a few models and here I will be concerned with massive decompositions in which you have many overlapping pieces in our models. And so we are going to present an alternative dual decomposition algorithm based on the alternating direction method of multipliers that is particularly suitable when you have to deal with these kinds of massive decompositions. There are many implications of these and we are going to focus here on parsing, but we can use these virtually for any kind of constrained conditional models with its first-order Markov logic constraints on top of that. This is quite general and can be used in many different applications. Just to provide a basic idea of the task that we are going to address here which is dependency parsing, this is what the dependency tree looks like. So we have a tree like this and we could actually replace Edinburgh with Seattle here and it would make sense as well. [laughter]. So this is a dependency tree for this sentence. It is a rooted tree and this is the root node. It spans all of the words in the sentence and so you want to predict a structure like this. Just to try to get some intuition about what this means, there is a subject here which is it. There is an object so there is rain into the predicate. There is a prepositional phrase in the city of Edinburgh and so these arrows are essentially representing the grammatical functions of the words. So you want to predict something like this and we are going to consider models that have a bunch of scores for different pieces of this tree, so you will have scores for the arcs that look at two words and consider an arc between these two words. We will have scores for consecutive siblings, so these are essentially words that, so these words it and rain are both children of does and they are on the same side and they are consecutive. We also are going to consider scores for the grandparent structures like these, and also for arbitrary siblings, not necessarily consecutive. And also for directed paths, so we have a score that reflects how likely two words are to be in the same path. So in this case I guess it would be like a high score because rain in Seattle and Edinburgh are likely to be in the same path. We also have scores for these like two consecutive words looking at the heads, at the parents of these two words. And you want to--oh and you also have these. I am going to skip some details here but essentially non-projective arcs are arcs that cross each other, and we also have scores for pairs of arcs or for arcs that have other arcs crossing them. So this is useful in actual NLP parsing. So let's talk a little bit about dual decomposition, so this is our set up. We have some inputs X, which in this case is a sentence. We have some structured output Y which is a parse tree and we are going to represent Y as a binary vector that is essentially an indicator vector of the basic parts that are active in the output. So in the parsing example, each of these basic parts are going to be a candidate arc and if the entry is one that means that the arc belongs to the tree and if it is zero that means that we are not using that arc. Does that make sense? >>: [inaudible]. >> Andre Martins: Yes, exactly. So we start with a complete graph connecting all the words, and you will have a linear number of ones here. Suppose that you have some model that gives a score to each parse tree and the goal is to predict by finding the parse tree that maximizes this score. The scores that I am going to talk about here are going to be linear scores, just linear function of parameters and features. A technique that is used quite often to increase expressive power of models like these is to compose, combined to models by--I am being very informal here but suppose that I have two views of the output Y that I am calling Z1 and Z2 here and I have two different score functions for each of the views and I just define and score the parse tree as the sum of these two individual scores. But I assume that Z1 and Z2 are two overlapping views in the sense that we are going to see what these mean later. In this case the prediction problem becomes that of maximizing the sum of the two scores with respect to Z1 and Z2 which belong to their own spaces and we have this constraint that they have to agree on their overlaps. So this notation is just an informal notation that says that if you need to say something about these two views agree with each other. This is for two components but you can imagine that we combine, that you have more than two and so we use a very simplified view of the problem. We suppose that if you have some complex object Y, we represent it as a vector of basic arcs, arcs in the dependency tree, for example. And we are going to consider several views of this object. This could be, for example, the siblings or grandparents and the other score functions that I described in the beginning. So those will induce other parts which are going to be these other things which are not necessarily arcs but conjunctions of arcs, for example. And there is some overlap between all of these views and we have that constraint that they must be consistent on the overlaps. So the goal is to be able to reconstruct Y by gluing together all of these views Z1 to ZS, and I am arguing that this is a recurrence problem in NLP and in which we want to use some local evidence and we want to glue everything and form a global meaning that is consistent and collects all of this local evidence. Yes? >>: I am trying to understand the [inaudible]. So one thing you say that they agree on overlap. It is different to say that there exists a Y such that if you view it from the Z direction you will see that and Z2 you will see that and so are you explicitly saying that just that overlap agree or do you mean that also there exists Y such that the projections will be exactly Z1, Z2, Z3? >> Andre Martins: Yes. That is the second thing that you mentioned. I am assuming that all of the Zs will be consistent in a way that there exists some well-defined Y that collects all of this information from the different views. In the parsing example, Z1 to ZS could be, for example, grandparent like two arcs, and of course there will be some overlap between different structures like these. I am requiring that the arcs that these two structures share have to be consistent, right? Let's make this more formal. So we make the score composed of the sum over individual scores and this is the problem that you want to solve, but to make all of the Zs consistent, we are going to add the witness factor that we call U, and we are going to make sure that all of these components match this witness factor. So this witness factor is indexed by the basic parts or by the arcs of the dependency tree and this constraint makes sure that all of the arcs in these structures have to match some common witness factor and this is going to force agreement between all of the overlaps in the different components. Does that make sense? So this is an [inaudible] problem. This is what we want to solve. And so this problem can be intractable and actually for non-projective dependency parsing it becomes NP hard to solve this with the features that I have described, and so a technique, an approximate way of solving the problem is to consider a linear relaxation of this problem in which we replace these sets YS, so the set corresponding to each view to its convex self. I am skipping some details here but if you think in terms of graphical models, this is exactly the same thing as approximating the marginal [inaudible] by the local [inaudible] which is an upper bound of the set that contains all [inaudible] marginals for the graphical model. So this is something that is commonly done. And in this problem and many other problems the gap, the relaxation gap is not going to be too large and so we can still do something useful by solving this relaxation. Essentially we have replaced Y of X by the convex hull. And so the next thing to do is to dualize this problem by introducing Lagrange variables, lambda sub s of r for these constraints, so we are going to dualize these out. Then we are going to solve the dual by using the projected sub gradient method. >>: So the relaxation that you're doing is for the arcs are binary variables? You realize zero one… >> Andre Martins: Yes. >>: [inaudible] 01, you could… >> Andre Martins: Yes, yes exactly. So for this parsing problem that is exactly what I am doing. I am essentially. Any further questions at this point? So if you skip all of the mess that is implied by this projected sub gradient method applied to this problem, we end up with an algorithm that looks like this. We first initialize all of the log branch multipliers to 0 and then you have a loop in which at each point we are going to look at each component s and we are going to make a z step which essentially involves maximizing the score of that sub problem plus a linear term that involves long-range variables. So this is srs because this [inaudible] is linear if the score functions are also linear if f sub X is linear, this is still linear and so this is as hard as computing the map for that particular component. And so then we aggregate all of the zs that we obtained in the z step, we compute the average, so here this is Delta of r is essentially the number of components where the basic arc r appears, so this is an average. So you aggregate everything, we average and then you do an update of logarithm multipliers that essentially subtracts out the average, and there is a step size which has to be diminishing and all of that for this to converge, this itt. This algorithm is granted to converge for the solution for the NLP relaxation and it can happen that you can find the certificate if that some point in the algorithm all of these zs’s are going to agree with each other, so if that happens that is going to mean that u r is also going to agree with everyone, then we know that we are done and that can only happen if there is no relaxation gap, so if the solution is actually integer, or it can just define the maximum number of iterations and stop after that. >>: [inaudible]. >> Andre Martins: This is the dual decomposition using the [inaudible] method. >>: It looks kind of like non-augmented [inaudible] your branch multipliers by the negative, the positive gradient [inaudible], so this looks very much like [inaudible]. >> Andre Martins: I am maximizing, not minimizing, so this is actually standard [inaudible], but this lambda step is going to be the same for ADMM, and actually you're going to see later how to take this algorithm and change it slightly to become ADMM. So this is what I just said. And so the problem here is if you have many components, this is going to become very slow. Because I am trying to reach a consensus among many things that are overlapping and it is going to be slow to get that consensus. The only thing that is pushing for a consensus here is this term here where that makes the Lagrange multipliers intervene, so Lagrange multipliers are going to be set to try to get agreement between all of the zs’s. But in ADMM we are going to get something stronger that pushes for agreement. Let's talk about ADMM algorithm. The goal is to accelerate dual decomposition. There have been several approaches that have tried to do that, one by Vladimir Jojic, [inaudible] and some others where they essentially smoothed the objective function by adding an entropic prior, nd entropic term and then because the objective becomes differentiable, they can just use gradient methods; in this case I think they used a [inaudible] of fast gradient techniques and they managed to accelerate dual decomposition by doing that. We are going to do something different which is considering the algorithm Lagrangian of this problem, and using the alternating direction method of multipliers. This was proposed in the ICML paper. This is an old method of optimization. It goes back to the ‘70s. There seems to be a recent surge of interest, there was a recent monograph by Stefan Boyd and others on this method and it applied for many different problems. In the ICML paper we describe how to apply this to general graphical models, so here the focus is more parsing but this is quite general. How can we start with this gradient algorithm and obtain ADMM algorithm? There are just a couple of changes that you need to do. First of all we need to initialize the u variables and we are just going to initialize it to uniform. We call that, so all of these parts are binary variables, so we are just saying that we don't have any evidence about the parts to be 1 or 0. We are going to add this additional quadratic term in the z step which is essentially keeping track of the residual between the zs, the variables and the u variable that is the average of everyone. So essentially it is like in the previous round we have computed the average of these votes, so we can regard this zs’s as a votes. We computed the average of the votes and now we are regularizing towards this average. So this is the reason why this is pushing for a faster consensus, because in some sense gives some basic information about what everyone has predicted in the previous round. >>: [inaudible] to make sure convergence [inaudible] a non-zero [inaudible] which ensures convergence when you're other terms [inaudible]. I mean, that is what I have seen. Is it really faster? I mean you have proof that it is faster? >> Andre Martins: I think that it is an open problem to actually show the convergence right of ADMM and like, so I am skipping some details here, but so pure [inaudible] methods would do something differently. They would do joint optimization of u and zs so they will do joint optimization of u and z, and then they would just use this multiplier here. The problem is that you cannot do that sufficient because this term is coupling the two variables together, and so ADMM essentially instead of doing this joint optimization it does coordinate, so it optimizes first with respect to z and then with respect to u. That is the basic difference between ADMM and the [inaudible]. Does that make sense? >>: [inaudible] objective function [inaudible]. >> Andre Martins: The objective function is the same. This is just an artifact of the method. So this quadratic term is something that… >>: It just appears that [inaudible]. >> Andre Martins: Yes. Exactly. I forgot to mention that, so that is why this is a residual. If you are--even in the NLP relaxation, any primal physical solution of the relaxed problem will, this thing will be zero and so this term should vanish. So yeah. So the second small difference is that you don't need [inaudible] this learning right here like in this subgradient method. We can just keep this fixed. So in ADMM… >>: [inaudible] problem in terms of [inaudible]. >> Andre Martins: Oh, you mean the… >>: [inaudible] dual function [inaudible]. >> Andre Martins: Okay, I see. Dual function… >>: [inaudible] you added the… >> Andre Martins: I see what you mean. We can also modify our primal problem by adding the residual term which is an internment term because it is going to be zero at the optimal and we can derive the zero of that. That is what you are saying. And we get the smooth dual which is--you get this, yeah. And so this can be regarded as a method. So there is this useful thing that you don't need when you this and like, you know, sub gradient method. So this means that you--this is one of the reasons that make this sub gradient method slow. You need to renew these step sizes which means that your updates to the dual variables are going to be smaller and smaller and it is going to take more time to make progress. In addition, there are some better [inaudible] conditions so even if you have a relaxation gap, we can keep track of the primal and the dual residual, so this residual term is giving that information. And we can stop our algorithm if we are below some threshold and so this gives us a method to stop the algorithm even if we, even if we don't have a certificate for the exact problem. So any questions so far? The only thing that is remaining here is how to solve this problem and this is a complication that the ADMM algorithm introduced with respect to the gradient one, because there are many cases, so this is the z step. So there are many problems, so we are going to assume that the score of each component is linear and in that case this becomes a quadratic problem because of the residual term. In the sub gradient method you didn't have this term and so the entire thing was linear, which means that if you have a component that is like a combinatorial piece of our problem we can use combinatorial logarithms to solve like, for example, if we have a sequence model as one of the, as a component, you can use Viterbi to determine the most likely path in that sequence or in the independency parsing case we can use [inaudible] algorithms for determining spanning trees. It depends on what you are, what your problem is, but all of these nice combinatorial logarithms cannot be used if you have this quadratic term. So it becomes harder to solve the z step if you have a complex problem. >>: [inaudible]. >> Andre Martins: It is a QP so you can use any QP solver, but the point is that if your component is very large, some of these combinatorial algorithms are very efficient at solving that exactly, and it is more, harder to solve QP so if you are really concerned about speed and you need to solve several times in ADMM, this could be a drawback. >>: [inaudible]. >> Andre Martins: Yes, exactly, and that is something that we do and there are all sorts of tricks that you can do and that is all very helpful. Also you don't need to solve the QP exactly. There is a result by XStine and [inaudible] in the ‘90s. They show that ADMM converges if you just approximately solve the QP as soon as something like the sum of residuals has to be summable. I don't remember the details, but as soon as you are at iteration solving the QPs more and more exactly, then you should be fine. But there are some cases that we can address and we are going to talk about those and in particular if you have any component that expresses a first order logical constraint, we can still solve the QP exactly and efficiently. This is quite useful in NLP models because often you want to inject some constraints into the model with some prior knowledge that you have about the problem at hand. An example, this is like a simple example. Suppose that the component is just composed of three variables. Two variables Z1 and Z2 that we had in the model and we just add a third variable that is the conjunction of these two. So this can be useful if you want to make a model richer by including a potential for the simultaneous inclusion of these two parts, and so it happens in parsing in these cases. So in this case Z1Z2 is like the conjunction of these two arcs and we have the same thing for the siblings and you have bigram features. So in that case the convex hull of the set corresponding to this component is going to be [inaudible], just like this essentially, so essentially it is saying that if Z1 and Z2 are both 1, Z12 has to be 1 so they would be vertex here, otherwise you have these three vertices and after the convex hull, it becomes this and solving the problem in this case as a closed form solution it is quite efficient to compute it, and so this is a [inaudible] that you can do it. And so with these you can essentially tackle any [inaudible] model in a graphical model because this would be like exponentials. The next example is uniqueness quantification. Suppose you have a group of n variables and binary variables require that exactly one of those variables can be one and all the others have to be zero. So this is like a selector of the variables. So in that case this is what the output set for this sort of problem means. So essentially it is any binary vector such that the sum is one. So if you take the convex off of that it is just a problem of simplex and the z step is just projecting go to the simplex, and so this problem can be solved efficiently and exactly by just sorting all of the entries in this vector and applying a soft thresholding operation. And so this can be done in log n and in practice it is faster if we cache the previous solution, or if you warm start, like if you store the last sort that we did for this problem. So this is a nice case. And it turns out that we also have a similar result for other, for some of the logical constraints. So in this case, we take a group of n variables, and require that at least one of the variables is one. But there can be more than one variables with one value. And so in that case, essentially we require that the conjunction of the variables is one and essentially the output set in this case is just a faulty [inaudible] where you just remove the origin, because the origin is the only thing that does not satisfy this constraint. And so projecting onto this thing can also be done efficiently by sorting, so the ICML paper described this in more detail, but it is also very efficient to do this. So there are other examples involving first order logical constraints. For example, we can introduce a new variable to the model which is the junction of existing variables. This is something useful again to make our model richer by adding a specific potential for this new variable. And so in that case you are playing, you know, a nice [inaudible] as well and you can also solve this with a [inaudible]. In this case it's a little more tricky, but you can do it. So this can be extended if we negate inputs and we can essentially deal with any firstorder logic restraint by doing this, so everything is just solvable with sort operations. I am just going to show some experiments here. The experiments are in non-projective dependency parsing. I also did some experiments in synthetic graphical models in the ICML paper, but I am just going to skip those. So here I just used data sets from CoNll tasks which are common data sets for different languages that have annotations for these dependency trees and the goal is to train the parser and then evaluate it in a test evaluation set, in a test set. So to try and we just use the MIRA algorithm which is an online algorithm and you can think of it as an online version of structured SVM algorithm. I'm going to skip details on that. Finally because this is an approximate method, if you really want an exact solution, we need to do some rounding to obtain a valid tree and it turns out that you can come up with very simple rounding strategy that just looks at the arc entries in the vector and then take those things as scores and use a combinatorial algorithm to obtain the closest valid tree to that vector. We did that and actually we usually obtain exact solutions. The percentage of fractional solutions that we get is quite small, and I think that this deserves further study, but I think the reason for that is that we actually train also with NLP relaxation and this is pushing the model to favor exact solutions and avoid eating fractional vertices. These are the results that we have obtained. We actually tried two kinds of models with different feature sets. This is a model that just uses arc features and certain other features for consecutive siblings and grandparents. And this is the results with a full set of features so there is some improvement. So we are putting that baseline. And this is where the best published results end. This compares a bunch of different methods that have been used for parsing. We don't win for all languages, but we get state-of-the-art for eight of these 14. And they are pretty close for the others. So these are the ones that we are winning. So this is just giving a flavor of how faster is the ADMM algorithm compared with the sub gradient method. So this is plotting as a number of iterations. It is plotting the accuracy that we get in the evaluation sets, averaged over, no, so this is like there are different runs here, one for each instance in the test set, and we are keeping track of the accuracies that we get if we define the maximum number of iterations for the sub gradient algorithm and for the ADMM algorithm. So you can see that ADMM usually gives a more accurate response sooner, earlier than the sub gradient method. So here we are using a considerable number of components but still not that many. If we increase this further, and so this is a second order model, if you do this for the--this is another thing showing the percentage of certificates that we get in sub gradient method and ADMM algorithm, so we also get more certificates, which means that we can actually stop our algorithm sooner. And this is showing that if we define a threshold on the primal dual residuals, this is showing the percentage of instances that we can safely stop. This is for the full model and here the results are much more impressive. So here we have many more components on average, thousands of components and in the worst case hundreds of thousands, so there are many overlapping pieces here. And this is the result that we get in the sub gradient method and with the ADMM algorithm, so eventually this red line will catch the blue line but it will just take too many iterations to get there. And so again ADMM makes a good job after say 200 or 300 iterations it is already on the best performance that you see. >>: [inaudible] updates in the two cases based on [inaudible]. >> Andre Martins: That is a good point. It's almost, it is not exactly the same because the sub gradient method is instead of analog [inaudible], it is linear for those components, so there is like a log n difference here, but that sort of gets washed away if we cache the problems and I am going to show that in the next slide. So you can consider this is approximately the same. >>: [inaudible] sweeps for all of the components? >> Andre Martins: Yes, so I am also going to--it turns out that it doesn't need to revisit some components. I am going to talk about that next. But yes, it's [inaudible]. >>: So [inaudible] a sense of how long--this is like one parse, right? >> Andre Martins: Yes. This is very fast. I think the average parsing time is less than one second. It is less than alpha second, so it is fast on average. >>: I am wondering that [inaudible] upgrading the [inaudible] linear [inaudible] then probably you can have a bigger component. So basically we do this kind of comparison we use [inaudible] and then you are forced to use slower components and I'm not sure that this is [inaudible] it is okay, but I am not sure that it is… >> Andre Martins: That is a good point. I kept some slides, but if you go to a previous one this is a fair comparison for the second order models because--there is one constraint in the parse tree which is a three constraint where you want to enforce the whole thing to be a well-defined tree, and so there is a large component that can use the sub gradient method to force that constraint. Here we are using that. The sub gradient method is using one large component for that tree and the ADMM method, because it cannot do anything useful because of the quadratic term, it is splitting that tree constraint into many small pieces and so it has, that is why it has more components on average with respect to the sub gradient method. So this comparison is fair. Then the next one is also fair, yes. So here we did the same thing. Sorry, not here but here. Here we tried that for the sub gradient method using a tree, but because some of our features require examining some things that are needed anyways and if we split that pre-constraint into smaller pieces it didn't give any advantage to use the tree, so it was actually better to split it also in the sub gradient method, because we need to reuse some of the--it is kind of hard to explain. Maybe we can talk off-line about that. So this is a fair comparison. Now there are weaker models that are not including as complex features as this one in which you could come up with course the compositions in sub gradient method. Those are for example what Michael Collings and Terry [inaudible] and Sasha Rush used. And so in that case the sub gradient method is sufficient. So this is--there are some scenarios were sub gradient method is more suitable than ADMM. ADMM is better if you don't want to concern about how to find a good composition or if no such good composition exists. So I guess I had that in some slides but I… >>: You said for the ADMM you had to do the breakup into the small [inaudible] because the quadratic term, but you could use that same breakdown for the sub gradient method as well if you wanted to, right? And then couldn't you reuse the cache the results… >> Andre Martins: Yes, but it is much, much slower. Yeah, I didn't show that here but it was slow. Actually here I did that because as I said, here you aren't gaining anything by using the tree constraint so in this full model, not the second other model, but in the full model this is actually using all of the small pieces and caching and all that. So again, we can stop these if we fall below some threshold because of the primal dual residual information. And this is like the impact of caching here. It turns out that caching can give a substantial speed up. This is essentially showing the percentage of some problems that we need to solve. You can see that after say 300 iterations, we only need to touch like 20% of the factors, actually less than that. We need to touch like 20% of the factors. >>: [inaudible] caching [inaudible]? >> Andre Martins: It means that as we keep going, iterating with the ADMM algorithm, many of these problems became exactly the same as in the previous round, and so we don't need to solve them again. We can just cache the solution. But we need to do some sort of smart caching because we don't want to examine the inputs of those problems because it is actually a linear look as well, but you can do that, because it saves a lot of time. So here are the running times. So this is like 0.34 seconds per sentence. >>: [inaudible]. >> Andre Martins: I don't really know but it is probably something like 20 words per sentence but it varies a lot. The average length doesn't give you much information because the algorithms are not linear in terms of the sentence length, so if you get a long sentence that is going to dominate. But it is like sentences that appear in newspaper copy. We did not filter out any sentences. So this concludes the first part of the talk. We have presented a new variant of dual decomposition which is faster to reach a consensus. It is suitable for problems with many overlapping components. There are some advantages and disadvantages with respect to sub gradient methods. So unlike the sub gradient method, this doesn't allow us at least in an obvious way to use combinatorial machinery to solve these problems, because you have this quadratic term. Although there are some ideas that we could try to do something smart here to reuse the machinery, but that is something that we can discuss afterwards. But on the other hand in many cases, for example, any time we have first-order logical constraints, we can actually compute the z step in closed form in an efficient way, so this is a good thing to use if you have these models in like constraint conditional models and models in which you have to inject these kinds of restraints. And there is a lot of future work, so this ADMM algorithm is quite general and could be used in many different problems. There are factors that I didn't talk about here but suppose if you had something like a budget constraint; those types of things appear in summarization. For example you want to summarize documents and you have like a budget of say 100 words that you can use, we can actually have like a sub problem that restricts the number of words to a particular number, and we can still solve the quadratic problem in an efficient way by doing that. So there are also ways of tightening these towards exact encoding and I think this will be a very nice--if this works, which is not, we still need to work out theoretical details. So hybridizing sub gradient and ADMM will give the benefits of allowing us to reuse the combinatorial machinery, so the idea here is to let some components not having the quadratic penalty, and just use them for other components, so essentially if you have something like a sequence model or a tree model, we don't use the quadratic penalty there, and so you can use combinatorial machinery or combinatorial algorithms to solve NLPs. If you have like first-order logic constraints we put the penalty there and that is going to [inaudible] consensus and so it is an open question if this hybridization will work. I believe it would and in practice it seems to be something decent, and so it will offer a good way of solving real word problems. Okay. So are there any questions regarding this first talk? >>: So you talked about some of the situations in which you could use a closed form in z steps, so in the process of running through your data sets, so if you had cases where you could use this, you just take those and treat those outside the optimization and say I have the same [inaudible] for these guys, and then let everything else go through whichever method [inaudible] so you treated those guys separately? >> Andre Martins: The model that I am using is such that any feature in that model can be framed as a first order logic constraint like these, and so I could always solve the z step efficiently in closed form using these techniques, basically sorting. >>: [inaudible] not all trees that you had would break down into, not all of the components would turn out to be closed forms [inaudible]. >> Andre Martins: I skipped some details, but it turns out that if you take this dependency parsing task in which you have a component that constrains all arcs together to define a tree, you can enforce the constraint by a sequence of formulas in first-order logic. We need to do some lifting, so I skip the details because it is very complicated, but we can, you have a multi commodity formulation for that problem, so it can essentially impose that the arcs that we get are essentially defining a connected graph, and that is going to imply that we have a tree, and to imply connectedness, you can use path variables and flow variables. It is a little complicated, but you can do everything and that is what we did. So that is what I meant by splitting the large piece into smaller pieces. It is essentially splitting that tree constraint into a set of first-order logic constraints. >>: So you could in general split all of the constraints [inaudible] into this [inaudible]. >> Andre Martins: Yes. So this is quite general. Suppose that you have a pairwise graphical model where the variables are not binary, but you can take multiple labels. So if you have like a [inaudible] model or something like that. You can also transform that graph into a binary graph that uses some of these x r factors that I showed there, and then you can apply these. So it is usually easy to take a big component and to split it into small components, but it might not be the best way of solving the problem, because it might turn out that if you don't have many overlaps, sub gradient is already a good approach. It really depends on your model and. And so it is kind of an open problem, so I think that there is some hope that you can still tackle these large factors with the quadratic penalty there. For example, there are ways of solving the QP that involves solving a sequence of LPs, so by using that, you could solve our QPs by repeatedly using these grantor algorithms. Okay. Any further questions before I proceed to the second talk? So now I am going to talk about structured sparsity in structured prediction. This is based on another EMNLP paper and the paper that we--it is vaguely related to the paper that we presented at AISTATS this year, with kernels which is not the setting that we are going to consider here. The basic idea is that we in many cases we care about sparsity in an NLP model for several reasons. We may want a compact model or because you care about runtime and it turns out that sparsity can also affect your runtime and you are going to see this in more detail later. Or because you cannot afford the large memory footprint, or because actually you want to interpret the model and try to understand what is relevant in our model. And so there is a lot of previous work that addresses sparsity, but usually just focusing on penalizing cardinality through a L1 regularization term and ignores the structure of the feature space. So here the idea is to take this structure of the feature space into account and so that is what this talk is about. So our set up is again the same setup that we had in the previous talk with the previous part. You have an input set X. For each X we have a set of candidate outputs Y of X. So X could be sentences and Y of X could be parse trees or whatever. So we are assuming that this is a structured set, so this is again structured prediction. And we are going to be concerned with linear models where the scores are linear function of the parameter vector theta and it joins input output feature map phi. To learn the parameter theta, we minimize or regularize the empirical risk functional. So something that looks like this where you have a regularizer Omega of Theta and the empirical loss term, or for any loss that we can, preferred loss here. In this talk we are going to focus on the regularizer. We also think that people have played lots of times with different losses, and there are still a lot of things to do regarding regularization. >>: So can you say that in training you want to find the other [inaudible] but [inaudible] generalize actually it is not enough that this term would be small. We want it to be smaller, the smallest one over any other alternative Y. Right, because you have this input output encoding? >> Andre Martins: So this entire thing is going to be the score of output Y, this entire thing, this product here. And I am just picking the output that maximizes this score. >>: Yes, but in training it is not enough just to say that you want to minimize it for the [inaudible] this should be small actually also need to have for the alternative Y this value be large. >> Andre Martins: This is inside the definition of the loss function, so this loss function could be like a hinge loss where it will have a max inside of it saying something like that. Or it could be a logistic loss where a model conditional probability of Y given X, does that make sense? So you could define this L as like the margin where the margin uses… >>: [inaudible] all of the alternatives. >> Andre Martins: Yes. So this is like swapped inside this loss function. Let's see what people have done in regularization. The most trivial choice is using L2 regularization where the second most trivial is not using any regularization at all. So this has been done for quite a long time. It works really well. Even if some features are irrelevant, sometimes some combination of the features is relevant you can capture that phenomena very well, but it doesn't give you a sparse solution, so if you care about sparsity this is not solving the problem. So another option is to use L1 regularization which in regression is called the lasso and this is encouraging sparsity, and we can then look at the components of the weight factor that were zeroed out and discard those features. So this is essentially allowing this to do feature selection. There are examples, for example elastic nets that combine L1 with L2. People have played with those things as well in NLP, so all of these options are treating each dimension of the feature space equally so they all are ignoring structure of the feature space. So here we want to promote structural patterns rather than just analyzing cardinality. So using a simple example, suppose that we are just doing multi-class classifications so no structure prediction here. We have an input feature vector that we call psi of X here. We have a vector of labels that indicate the label of Y, and we can consider a construct features that are a conjunction of these two things, like an input feature can join with a label. This is something that people do commonly. So you can represent the weight vector as a matrix which is labels times input features. So if you use L2 you will get something which is dense like this. These colors are essentially showing the magnitude of the weight for each feature. If you use L1 you get something which is sparse, so the white squares means zeros, so we get something that is sparse, but it is random sparsity. It doesn't any pattern. We may care about something like this which is group sparse, like for example, we may want to discard entire input features if they are not relevant for any label. And so we are going to focus now on group sparsity. So essentially we allow density inside this group but we want sparsity with respect to the groups that are selected. So in this case each group is going to be a column in the matrix. So how can we choose the groups? Here we have the opportunity to use our prior knowledge about what kind of sparsity patterns we want in our model. So here is a general formulation for that. Suppose we have D features and we group them into M groups, G1 to GM where each group is a subset of the features. We can then form parameters of vectors theta 1 to theta M; each of those things is a sub vector which corresponds to its own group. People have proposed using this group loss or regularizer which essentially is going to penalize the sum of the L2 norms of each sub vector. For each sub vector of M, we take the L2 norm that, not the square norm but the actual L2 norm, and then we take the sum of that. So you can regard these as the L1 norm of the L2 norms. It turns out that this is actually, is also a norm that people call the mixed L1 L2 norm. So if you use this regularizer, because this is the L1 of something, it is going to promote sparsity at the group level. It is going to attempt to shrink some of these norms to zero which will discard all features inside of this group. So people have used these in statistics under different names like composite absolute penalties. They tried to play with different norms, not just L1 and L2. For example, the L infinity norm is something which is used here as well. Yes? >>: [inaudible] larger than one another [inaudible] are the… >> Andre Martins: Yes. I'm going to get--it matters a lot. I say that this is one of the main problems that need to be solved. So this leads me to this. [laughter]. In general, we want to penalize some groups more than others taking the number of features in the group into account. >>: [inaudible] specify some prior knowledge… >> Andre Martins: Yes. So in this talk I am specifying that this, I am assuming that this is given, but to be extremely interesting if someone comes up with a way of setting these up automatically. >>: Is a group a class label? >> Andre Martins: Group, no. It could be in the other example, but here I am going to consider it general--so groups can be different things and I am going to give several examples of groups. But in the previous slide each group was an input feature including all of the labels conjoined with that feature. >>: [inaudible] want the groups to overlap? >> Andre Martins: Yes, and I am going to talk about that. That is a good point. There are two areas that we are going to consider. The first one is non-overlapping groups. Then we are going to talk about tree structured groups and finally the general case of graph structured groups, and we are going to start first with the non-overlapping case which is the simplest. In this case the groups are all disjoint, which means that we require that each feature belongs exactly to one group. So we can recover the wellknown regularizers that we already know by choosing some trivial groups. For example, we can get L2 regularization back if you have one large group that has all of the features there. We can recover L1 regularization if we have D Singleton groups, one group per feature and we can also have nontrivial group and get something interesting like label based groups where the groups are the columns in the step matrix or template based groups which is what I am going to describe next. So let me go to feature template selection. So suppose that you have a task like sequence labeling so this is actually chunking. So here we have a sentence. We have parts of speech tagged for each word in the sentence. This is the input and we want to predict phrase boundaries, like this means the beginning of a non-phrase which is we. This is the beginning inside a verb phrase and so we want to explore a verb phrase and so forth. So this is a useful task in NLP. And so typically people define future templates when they want to use a linear model to address this task. So examples of feature templates are this. For example, for each position in the sentence, they look at word bigrams for example and conjoin that with the label. And so in that case that would generate things like these depending on which position you are. So these are just word bigrams. Another example of a template would be part of speech trigrams, for example, and depending on which position you are, you can get different things like this. And so you may want to select feature templates, and to do that we make each group correspond to a feature template, so we have a group for word bigrams and a group for part of speech trigrams. So this notation means word at position zero, word at position one. This is relative to where we are and the same thing for the parts of speech. And we can have some other group there which is going to be zeroed out if we apply these penalties. So this is a choice of groups that allows us to do feature template selection. Any questions so far about this? >>: So presumably those templates are driven by some modeling knowledge that you have about [inaudible] groups that you have [inaudible]? >> Andre Martins: Typically at this point people usually select feature templates by hand. They specify a bunch of feature templates, and we want a method that allows you to specify or construct conjunctions of templates and all sorts of crazy things, and selecting the ones that are going to be useful for the task automatically. Let's look now at tree structured groups. Here we are going to allow some overlaps, but we are going to constrain the kind of overlaps that we get. If two groups overlap were going to require that one is containing the other. In other words, we want groups to be nested, so you have this hierarchical structure. Use a diagram to help write what you want something like this. So we redefine a new group inside of the blue one which is the green one which contains a subset of the features that are there and you can represent that by drawing an arrow that goes from the blue one to the green one. Then we can assume that there is a violet group that subsumes these two and so you draw these and so forth and you get a tree at the end, or a forest. It could be like a conjoined set of trees. Can anyone tell me what kind of sparsity this is promoting? What is going to happen if I use group lasso with this definition of groups? Can anyone guess? >>: Well you won't end up with singletons. You won't be able to kill the purple guy without also killing the green and the brown guy. >> Andre Martins: Exactly. Essentially this sparsity pattern that is being promoted here is that if a group is discarded, all of the descendent groups are also going to be discarded. >>: And that is okay as long as the regularizers use the sum of the separate regularizer [inaudible]? >> Andre Martins: Yes. This has been proposed or was first proposed by--who actually citations aren't here. I think Jenatton--I think some people are working with Francis Bach that proposed using this sort of hierarchy priors, hierarchal groups and they first showed these diagrams that provide a nice graphical representation of what is going on. Okay. So let's go to the general case where we allow for arbitrary overlaps. In that case we get a dog. So this dog is essentially the hasse diagram of the poset structure of the feature space. So essentially by defining a partial order based on set inclusion, like if one group is included into another group, we say that that group is less than or equal to the other. We endow our feature space with the structure of a poset, and we can represent that structure by using the by drawing the hasse diagram, so essentially if a node is a descendent of another node that means that that group is included in the assessor group. And so the sparsity patterns are given by the poset. Is just like the tree case. So if we discard say the red node, this is going to throw away the violet one and the orange one and the black one, but not the green one. This could be useful if you want to do something like course define regularization. Here we are going to focus on the partial order that we want to define in our feature space and then define our regularizer that behaves according to that partial order. So we are going to say that partial speech features are coarser than word features. That makes sense. And then we are going to extend this partial order to, because we have features that are going to be bigrams of parts of speech or engrams of words, we are going to need to extend this partial order by extending these two engrams of partial speech and engrams of words. So essentially we are saying that a bigram of partial speech is finer than just a single part of speech and deposition a word is finer than a part of speech, so in this case a word bigram is finer than a part of speech bigram but also finer than just a word in a smaller context. So we get a poset like this and so the regularizer is going to promote selecting finer features only if the coarser ones also are going to be selected. >>: A bit of an issue if the model could be influenced by something that is two steps away and not one step away? Do you know what I am saying? >> Andre Martins: The number of steps away doesn't matter much because--for example, a word trigram is still finer than a part of speech bigram. >>: [inaudible] trigram unless you--I don't know. I think you have to watch how your steps are constructed sequentially. >>: If it was the case that you didn't have the finer feature connection but not the [inaudible] connection and that was supported by your data you would miss that. >>: Yeah, if the mid-level data was a mixture of interesting spaces but it was kind of a [inaudible] mixture it might find it the coarser level it didn't look like there was any information in it so you could… >> Andre Martins: Yes. It is conceivable that you would want something like that, but I think that in most cases you usually want something like this. >>: You might be able to structure your tree differently then. There would be separate alternatives than. >>: So in this part of it the most important feature was the trigram of the word and only [inaudible] then basically you're going to say take the other feature out because of [inaudible]. >> Andre Martins: Yes. That is true but that is very unusual because most [inaudible] problems people need to include backup features, so it is not common that you care about the trigram but you don't want to back off to a bigram or a unigram. So this sort of matches what people want. I mean it maybe doesn't solve all problems, but I am arguing that it is something reasonable that we may want. >>: [inaudible] if you have to [inaudible] a tree. >> Andre Martins: This is with prior knowledge and what happens if your prior knowledge is totally wrong and you can specify a different--yeah, that makes sense. Essentially all the things that we need to specify is a partial order for your features. Now it depends on what partial order you want. >>: I wonder like, I mean you could consider searching over different partial orderings, but then that might result in a different kind of over fitting from you [inaudible]. >> Andre Martins: That is good point. I guess you could use some regularizer corresponding to different partial orders and that would still be a group loss regularizer with overlaps. That's good. >>: [inaudible]. >> Andre Martins: Okay. >> Scott Yih: You only have 15 minutes left. >> Andre Martins: So I should get moving. Let's talk about algorithms, wrapping up. Recall that this is the problem that you want to solve. We are going to solve it using an online proximal gradient method. So this is similar to things like [inaudible] gradient descent but with a small twist. We are only taking gradients with respect to the loss function. And we are doing proximal steps with respect to the regularizer. So at each point we turn a training pair into a gradient step looking at loss for that example just like [inaudible] gradient descent and then we do a proximal step that looks at the regularizer. So I need to say what the proximal step is. [laughter]. Okay. So this is what the online proximal operator is so we can regard it as a generalization [inaudible] projection. So if you regularizer is the indicator function of a set, of a convex set, in other words if it is zero, each side belongs to a set and infinity otherwise, this is exactly the same thing as projecting on the left set. But this is more general because if you have something that is not an indicator function, like an L1 regularizer or something else, this is something different. So essentially we want each side to be close to the point that we started, theta, but still you want to penalize the value of the regularizer for that point. And so it turns out that for many regularizers it's pretty easy to compute these proximal steps. So for example, for L2 regularization, this will become just a scaling operation so there is some lambda that you can compute in closed form such that the solution of this is going to be just scaling theta. For L1 this is soft thresholding. So in other words, if your regularizer is a regularization constant tau times L1, then this proximal step is just going to subtract. So if your weight is positive at each dimension, it is going to subtract a fixed amount of tau. If it crosses zero, it is clipped to zero and the reversing if it's negative, so essentially it is pushing everything to zero by subtracting or adding a fixed amount that is essentially what is attached to the L1 norm. So what about group L1? So any non-overlapping case, the solution to this problem is what is called vector soft thresholding. This is a generalization of soft thresholding that works at group level, so for group number M we are going to take the L2 norm of that group and were going to shrink the norm. We are going to subtract this fixed amount of the M to that norm, by scaling everything inside that group to achieve that norm. And if by doing that we cross zero, then we just set everything inside that group to zero, so this is essentially discarding the entire group. So in a tree structured case, we can still compute the proximate step recursively, and then there is this very nice paper by Jenatton where it shows how to do it, but in the general case. So if you have something that is not a hierarchy, there is no efficient procedure known to compute the proximal step. >>: [inaudible] proximal step, then why go--and this proximal step thing is cool. Why not just take the gradient step… >> Andre Martins: The main reason for that is that is not going to give you sparsity if you keep iterating your algorithm. Even with L1, it is going to oscillate towards zero but because it is not differentiable, the L1 norm, you actually don't get to zero. And so here the proximal step… >>: [inaudible] in practice [laughter]? >> Andre Martins: No. Because you [inaudible] a lot. >>: Create a [inaudible], because you, I mean if you, if you play some tricks, I mean they won't be exactly, but they'll be… >> Andre Martins: But as you keep doing iterations we never, so it takes a while until it figures out that this would be a zero there. >>: Yeah, I agree with that yeah. >> Andre Martins: So you see if you want to have a low memory footprint… >>: [inaudible]. >> Andre Martins: Yes. And so this is something important because we care about the memory footprint here. >>: [inaudible] for L1, it's not a [inaudible], so you go down [inaudible] worse coordinates, but by doing the proximal step the [inaudible] that's a [inaudible] function. >>: You don't have to worry about the space where it's [inaudible] right around it where it is actual… >> Andre Martins: But that [inaudible] all the way matters that is [inaudible]. So if you use proximal gradient, it matters in batch we can actually get the same convergence of natural methods if the loss is at least continuous. But in the online setting you don't get that. >>: [inaudible] proximal stuff it's like, but how do you know how far to step for the proximal step part? Is there a… >> Andre Martins: That is going to be--let me go back. So that is going to depend on the learning grade on the [inaudible] which is [inaudible]. >>: [inaudible] schedule [inaudible] is that schedule as well. Okay. Same schedule as the gradient step [inaudible] I see. >> Andre Martins: You can also use a different [inaudible] I think, but they just… Right, so okay. In the graph structured case we cannot compute the proximal step, so we show [inaudible] paper, but you can still guarantee convergence even if you don't compute the proximal step exactly, but as soon as you can rewrite the proximal step as, if you can rewrite the regularizer as a sum of non-overlapping regularizers, you can acquire sequential proximal steps. This is not going to be the same as the proximal step with respect to [inaudible] but the algorithm will still be convergent. So I am just going to say a couple of things about practical issues. Each gradient step is linear in the number of feature templates, because you are assuming that for each data point you have as many features firing as the number of templates that you have. Each proximal step is linear in the number of groups because there are some smart things that we can do. We can keep track of norms of each group and use that for templates, so everything is independent of the dimension of the feature space, so we can actually play with very large feature spaces here. There are some tricks that I used like keeping a budget of the number of the number of groups that I want to keep instead of having a regularization constant, and then I just perceptronize the loss to avoid having to [inaudible] learning rate. And there's this very important final step that of debiasing. I ran this for a few iterations, and identified the templates that are not zero, and then I just run a standard on regularized learner at second stage that does not consider the ones that were discarded. In practice this is quite important. >>: [inaudible] suppressing, I mean why do you have to do the… >> Andre Martins: The problem is that these kind of L1 penalties are introducing the bias to the model, and if you have strong regularization which I am doing here, it is debiasing more and more and you would be better by having it at the second stage. >>: Even if you reach the point of the sparsity that you care about then still these proximal steps are affecting things enough that you're not getting the best solution you could without [inaudible]? >> Andre Martins: Yes. So this is showing the memory efficiency and this is why it is important to use proximal step here. So here each of these things is like a different proximal step that is applied and guarantees that you are discarding a lot of groups and keeping the memory footprint quite low. >>: [inaudible] so when you're doing the soft thresholding thing, once some guy gets under the Tao and you knock them down to zero is it gone forever, or can it get pushed back up? >> Andre Martins: The group is not going on forever. You can throw away the features and all of the weights, but in the next round you still need to compute the templates and all that. There are some heuristics that you could use, like for instance if you are really sure that the group is ever going to come back, it is dangerous, but in practice it can be useful. I have never tried it. Yes, so the tasks more like chunking, sequence labeling tasks. I am not going to spend a lot of time here. Essentially, we are achieving the same results or slightly better actually than L2 regularization with much fewer features and much fewer feature templates. This is also getting an impact in runtime, because usually computing the scores is something that is linear in the number of feature templates so if you have fewer feature templates, this is going to speed up the model. So the second set of experiments was for dependency parsing. We have like a crazy number of feature templates, 684. No one uses so many. But the goal here is exactly to try to select a smaller number out of these so many templates. And so we compared several things like just using standard lasso and just information gain score for selecting templates and we got something like these. So essentially, the blue line here is group lasso without overlaps. The light blue one is the fine regularization and so except for Spanish--this didn't work for Spanish, but for the other languages, it did a good job. But the disappointing thing is that the course to fine regularization did not outperform the just considering the non-overlapping groups. So we were expecting that by having that prior knowledge about the poset would help, but it didn't. But there are a lot of things that we could play with here. There are some claims that you can make about [inaudible] looking at the different languages where we applied these, but you should be a little skeptical because it may happen that some patterns that we are identifying are the properties of the data sets and not the languages. So essentially for languages that have a rich morphologically and the data sets are small, this seems to avoid using lexical features, which sort of makes sense because in this case there is a danger of overfitting, so this is in some sense seeming to avoid overfitting. >>: [inaudible] overfitting of those weights across the different data sets [inaudible] maintain your weights but actual [inaudible]. >> Andre Martins: Actually is a good idea. >>: Unless you really feel that there are language dependent changes in what's important which might be the case. >> Andre Martins: That is actually a good idea because you cannot do that to the actual features, because they are going to be different, but with the templates you can. So yes, to conclude you have two levels of structure here in the output space and in the feature space. We can promote structured sparsity by using a group lasso regularizer. So this algorithm that we propose is able to explore very large feature spaces, and there is a lot of future work to do regarding this, and so I would like to emphasize this prior group weights thing which is quite important. So here we just define that as like the log of number of features in the group, which can say that it is like the number of bits that you need to [inaudible] the feature in the group, but it is heuristic right, so there is a chance that you could do better by doing something smarter here. So if anybody has ideas on how to do that I will be happy to hear about that. So that is all that I have. [applause]. >>: [inaudible] results for debiasing? >> Andre Martins: For debiasing? >>: Yes. So [inaudible] did you get? >> Andre Martins: Oh, very bad results because I am doing very strong regularization. >>: [inaudible]. >> Andre Martins: Yes, it is like a huge drop, like five percentage points or something like that. But it is also because of the algorithm that I am using. If I used like batch methods it would probably be better because I, it has to do with the technique that I am using to define the regularization constant. >> Scott Yih: Thanks.