>> Scott Yih: I am Scott Yih. It... speaker. Andre is currently a PhD student in a...

advertisement
>> Scott Yih: I am Scott Yih. It is my pleasure to welcome Andre Martins today as our
speaker. Andre is currently a PhD student in a special joint program with CMU and
CMU Portugal programming language technologies and working with [inaudible], right?
And Andre done a lot of work in both machine learning and natural language processing
and his work entitled Concise Integer Linear Programming Formulations
For Dependency Parsing has won the best paper award in ACL 2009, and today Andre is
going to tell us how to structure output prediction problems using his methods based on
dual decomposition.
>> Andre Martins: Okay. Thanks Scott for introducing me. This is joint work with my
four advisors, Noah Smith, Mario Figueiredo, Pedro Aguiar and Eric Xing. This talk is
divided into two parts. In the first part I am going to talk a little bit about inference
algorithms, in particular a new dual decomposition method for doing inference and their
many overlapping components. In the second half I am going to talk about the learning
problem, and particularly how to do structured sparse modeling for structural prediction
in NLP.
Let's start with the inference part. This is based on two papers. When presented this year
at EMNLP, and another one at ICML, where it was more focused on general graphical
models. The EMNLP one was particularly dual parsing. So the basic idea and I think
that this is more close to the EMNLP talk so it is more directed to an NLP kind of
audience. So the basic idea is that we have been dealing with statistical models for many
different tasks in NLP and usually can improve accuracy as we keep breaking locality
assumptions and all these models with richer features.
The downside of this is that in many cases we can no longer do exact inference in a
practical way and we need to use approximate decoding methods. There are many
different things that have been proposed like sampling-based methods, using search,
using LP relaxations. Here I am going to focus on LP relaxations-based methods. And
one method that has gained prominence in EMNLP quite recently which was actually
introduced in vision problems by Komodakis is dual decomposition, which gives us a
principled way of combining models. It works by relaxing the original problem and then
optimizing the duel with subgradient method. I am going to talk in some detail about
these soon. And it has shown very good performance in NLP problems, in parsing in MT
and other sorts of problems. But all these past works have been focusing and combining
just a few models and here I will be concerned with massive decompositions in which
you have many overlapping pieces in our models. And so we are going to present an
alternative dual decomposition algorithm based on the alternating direction method of
multipliers that is particularly suitable when you have to deal with these kinds of massive
decompositions.
There are many implications of these and we are going to focus here on parsing, but we
can use these virtually for any kind of constrained conditional models with its first-order
Markov logic constraints on top of that. This is quite general and can be used in many
different applications. Just to provide a basic idea of the task that we are going to address
here which is dependency parsing, this is what the dependency tree looks like. So we
have a tree like this and we could actually replace Edinburgh with Seattle here and it
would make sense as well. [laughter]. So this is a dependency tree for this sentence. It
is a rooted tree and this is the root node. It spans all of the words in the sentence and so
you want to predict a structure like this. Just to try to get some intuition about what this
means, there is a subject here which is it. There is an object so there is rain into the
predicate. There is a prepositional phrase in the city of Edinburgh and so these arrows
are essentially representing the grammatical functions of the words. So you want to
predict something like this and we are going to consider models that have a bunch of
scores for different pieces of this tree, so you will have scores for the arcs that look at two
words and consider an arc between these two words. We will have scores for consecutive
siblings, so these are essentially words that, so these words it and rain are both children of
does and they are on the same side and they are consecutive.
We also are going to consider scores for the grandparent structures like these, and also for
arbitrary siblings, not necessarily consecutive. And also for directed paths, so we have a
score that reflects how likely two words are to be in the same path. So in this case I
guess it would be like a high score because rain in Seattle and Edinburgh are likely to be
in the same path. We also have scores for these like two consecutive words looking at
the heads, at the parents of these two words. And you want to--oh and you also have
these. I am going to skip some details here but essentially non-projective arcs are arcs
that cross each other, and we also have scores for pairs of arcs or for arcs that have other
arcs crossing them. So this is useful in actual NLP parsing.
So let's talk a little bit about dual decomposition, so this is our set up. We have some
inputs X, which in this case is a sentence. We have some structured output Y which is a
parse tree and we are going to represent Y as a binary vector that is essentially an
indicator vector of the basic parts that are active in the output. So in the parsing example,
each of these basic parts are going to be a candidate arc and if the entry is one that means
that the arc belongs to the tree and if it is zero that means that we are not using that arc.
Does that make sense?
>>: [inaudible].
>> Andre Martins: Yes, exactly. So we start with a complete graph connecting all the
words, and you will have a linear number of ones here. Suppose that you have some
model that gives a score to each parse tree and the goal is to predict by finding the parse
tree that maximizes this score. The scores that I am going to talk about here are going to
be linear scores, just linear function of parameters and features. A technique that is used
quite often to increase expressive power of models like these is to compose, combined to
models by--I am being very informal here but suppose that I have two views of the output
Y that I am calling Z1 and Z2 here and I have two different score functions for each of
the views and I just define and score the parse tree as the sum of these two individual
scores. But I assume that Z1 and Z2 are two overlapping views in the sense that we are
going to see what these mean later.
In this case the prediction problem becomes that of maximizing the sum of the two scores
with respect to Z1 and Z2 which belong to their own spaces and we have this constraint
that they have to agree on their overlaps. So this notation is just an informal notation that
says that if you need to say something about these two views agree with each other. This
is for two components but you can imagine that we combine, that you have more than
two and so we use a very simplified view of the problem. We suppose that if you have
some complex object Y, we represent it as a vector of basic arcs, arcs in the dependency
tree, for example. And we are going to consider several views of this object. This could
be, for example, the siblings or grandparents and the other score functions that I
described in the beginning. So those will induce other parts which are going to be these
other things which are not necessarily arcs but conjunctions of arcs, for example. And
there is some overlap between all of these views and we have that constraint that they
must be consistent on the overlaps.
So the goal is to be able to reconstruct Y by gluing together all of these views Z1 to ZS,
and I am arguing that this is a recurrence problem in NLP and in which we want to use
some local evidence and we want to glue everything and form a global meaning that is
consistent and collects all of this local evidence. Yes?
>>: I am trying to understand the [inaudible]. So one thing you say that they agree on
overlap. It is different to say that there exists a Y such that if you view it from the Z
direction you will see that and Z2 you will see that and so are you explicitly saying that
just that overlap agree or do you mean that also there exists Y such that the projections
will be exactly Z1, Z2, Z3?
>> Andre Martins: Yes. That is the second thing that you mentioned. I am assuming
that all of the Zs will be consistent in a way that there exists some well-defined Y that
collects all of this information from the different views. In the parsing example, Z1 to ZS
could be, for example, grandparent like two arcs, and of course there will be some
overlap between different structures like these. I am requiring that the arcs that these two
structures share have to be consistent, right? Let's make this more formal. So we make
the score composed of the sum over individual scores and this is the problem that you
want to solve, but to make all of the Zs consistent, we are going to add the witness factor
that we call U, and we are going to make sure that all of these components match this
witness factor. So this witness factor is indexed by the basic parts or by the arcs of the
dependency tree and this constraint makes sure that all of the arcs in these structures have
to match some common witness factor and this is going to force agreement between all of
the overlaps in the different components. Does that make sense?
So this is an [inaudible] problem. This is what we want to solve. And so this problem
can be intractable and actually for non-projective dependency parsing it becomes NP hard
to solve this with the features that I have described, and so a technique, an approximate
way of solving the problem is to consider a linear relaxation of this problem in which we
replace these sets YS, so the set corresponding to each view to its convex self. I am
skipping some details here but if you think in terms of graphical models, this is exactly
the same thing as approximating the marginal [inaudible] by the local [inaudible] which
is an upper bound of the set that contains all [inaudible] marginals for the graphical
model. So this is something that is commonly done. And in this problem and many other
problems the gap, the relaxation gap is not going to be too large and so we can still do
something useful by solving this relaxation.
Essentially we have replaced Y of X by the convex hull. And so the next thing to do is to
dualize this problem by introducing Lagrange variables, lambda sub s of r for these
constraints, so we are going to dualize these out. Then we are going to solve the dual by
using the projected sub gradient method.
>>: So the relaxation that you're doing is for the arcs are binary variables? You realize
zero one…
>> Andre Martins: Yes.
>>: [inaudible] 01, you could…
>> Andre Martins: Yes, yes exactly. So for this parsing problem that is exactly what I
am doing. I am essentially. Any further questions at this point? So if you skip all of the
mess that is implied by this projected sub gradient method applied to this problem, we
end up with an algorithm that looks like this. We first initialize all of the log branch
multipliers to 0 and then you have a loop in which at each point we are going to look at
each component s and we are going to make a z step which essentially involves
maximizing the score of that sub problem plus a linear term that involves long-range
variables. So this is srs because this [inaudible] is linear if the score functions are also
linear if f sub X is linear, this is still linear and so this is as hard as computing the map for
that particular component. And so then we aggregate all of the zs that we obtained in the
z step, we compute the average, so here this is Delta of r is essentially the number of
components where the basic arc r appears, so this is an average. So you aggregate
everything, we average and then you do an update of logarithm multipliers that
essentially subtracts out the average, and there is a step size which has to be diminishing
and all of that for this to converge, this itt.
This algorithm is granted to converge for the solution for the NLP relaxation and it can
happen that you can find the certificate if that some point in the algorithm all of these zs’s
are going to agree with each other, so if that happens that is going to mean that u r is also
going to agree with everyone, then we know that we are done and that can only happen if
there is no relaxation gap, so if the solution is actually integer, or it can just define the
maximum number of iterations and stop after that.
>>: [inaudible].
>> Andre Martins: This is the dual decomposition using the [inaudible] method.
>>: It looks kind of like non-augmented [inaudible] your branch multipliers by the
negative, the positive gradient [inaudible], so this looks very much like [inaudible].
>> Andre Martins: I am maximizing, not minimizing, so this is actually standard
[inaudible], but this lambda step is going to be the same for ADMM, and actually you're
going to see later how to take this algorithm and change it slightly to become ADMM.
So this is what I just said. And so the problem here is if you have many components, this
is going to become very slow. Because I am trying to reach a consensus among many
things that are overlapping and it is going to be slow to get that consensus. The only
thing that is pushing for a consensus here is this term here where that makes the Lagrange
multipliers intervene, so Lagrange multipliers are going to be set to try to get agreement
between all of the zs’s. But in ADMM we are going to get something stronger that
pushes for agreement.
Let's talk about ADMM algorithm. The goal is to accelerate dual decomposition. There
have been several approaches that have tried to do that, one by Vladimir Jojic, [inaudible]
and some others where they essentially smoothed the objective function by adding an
entropic prior, nd entropic term and then because the objective becomes differentiable,
they can just use gradient methods; in this case I think they used a [inaudible] of fast
gradient techniques and they managed to accelerate dual decomposition by doing that.
We are going to do something different which is considering the algorithm Lagrangian of
this problem, and using the alternating direction method of multipliers.
This was proposed in the ICML paper. This is an old method of optimization. It goes
back to the ‘70s. There seems to be a recent surge of interest, there was a recent
monograph by Stefan Boyd and others on this method and it applied for many different
problems. In the ICML paper we describe how to apply this to general graphical models,
so here the focus is more parsing but this is quite general. How can we start with this
gradient algorithm and obtain ADMM algorithm? There are just a couple of changes that
you need to do. First of all we need to initialize the u variables and we are just going to
initialize it to uniform. We call that, so all of these parts are binary variables, so we are
just saying that we don't have any evidence about the parts to be 1 or 0. We are going to
add this additional quadratic term in the z step which is essentially keeping track of the
residual between the zs, the variables and the u variable that is the average of everyone.
So essentially it is like in the previous round we have computed the average of these
votes, so we can regard this zs’s as a votes. We computed the average of the votes and
now we are regularizing towards this average. So this is the reason why this is pushing
for a faster consensus, because in some sense gives some basic information about what
everyone has predicted in the previous round.
>>: [inaudible] to make sure convergence [inaudible] a non-zero [inaudible] which
ensures convergence when you're other terms [inaudible]. I mean, that is what I have
seen. Is it really faster? I mean you have proof that it is faster?
>> Andre Martins: I think that it is an open problem to actually show the convergence
right of ADMM and like, so I am skipping some details here, but so pure [inaudible]
methods would do something differently. They would do joint optimization of u and zs
so they will do joint optimization of u and z, and then they would just use this multiplier
here. The problem is that you cannot do that sufficient because this term is coupling the
two variables together, and so ADMM essentially instead of doing this joint optimization
it does coordinate, so it optimizes first with respect to z and then with respect to u. That
is the basic difference between ADMM and the [inaudible]. Does that make sense?
>>: [inaudible] objective function [inaudible].
>> Andre Martins: The objective function is the same. This is just an artifact of the
method. So this quadratic term is something that…
>>: It just appears that [inaudible].
>> Andre Martins: Yes. Exactly. I forgot to mention that, so that is why this is a
residual. If you are--even in the NLP relaxation, any primal physical solution of the
relaxed problem will, this thing will be zero and so this term should vanish. So yeah. So
the second small difference is that you don't need [inaudible] this learning right here like
in this subgradient method. We can just keep this fixed. So in ADMM…
>>: [inaudible] problem in terms of [inaudible].
>> Andre Martins: Oh, you mean the…
>>: [inaudible] dual function [inaudible].
>> Andre Martins: Okay, I see. Dual function…
>>: [inaudible] you added the…
>> Andre Martins: I see what you mean. We can also modify our primal problem by
adding the residual term which is an internment term because it is going to be zero at the
optimal and we can derive the zero of that. That is what you are saying. And we get the
smooth dual which is--you get this, yeah. And so this can be regarded as a method. So
there is this useful thing that you don't need when you this and like, you know, sub
gradient method.
So this means that you--this is one of the reasons that make this sub gradient method
slow. You need to renew these step sizes which means that your updates to the dual
variables are going to be smaller and smaller and it is going to take more time to make
progress.
In addition, there are some better [inaudible] conditions so even if you have a relaxation
gap, we can keep track of the primal and the dual residual, so this residual term is giving
that information. And we can stop our algorithm if we are below some threshold and so
this gives us a method to stop the algorithm even if we, even if we don't have a certificate
for the exact problem. So any questions so far?
The only thing that is remaining here is how to solve this problem and this is a
complication that the ADMM algorithm introduced with respect to the gradient one,
because there are many cases, so this is the z step. So there are many problems, so we are
going to assume that the score of each component is linear and in that case this becomes a
quadratic problem because of the residual term. In the sub gradient method you didn't
have this term and so the entire thing was linear, which means that if you have a
component that is like a combinatorial piece of our problem we can use combinatorial
logarithms to solve like, for example, if we have a sequence model as one of the, as a
component, you can use Viterbi to determine the most likely path in that sequence or in
the independency parsing case we can use [inaudible] algorithms for determining
spanning trees. It depends on what you are, what your problem is, but all of these nice
combinatorial logarithms cannot be used if you have this quadratic term. So it becomes
harder to solve the z step if you have a complex problem.
>>: [inaudible].
>> Andre Martins: It is a QP so you can use any QP solver, but the point is that if your
component is very large, some of these combinatorial algorithms are very efficient at
solving that exactly, and it is more, harder to solve QP so if you are really concerned
about speed and you need to solve several times in ADMM, this could be a drawback.
>>: [inaudible].
>> Andre Martins: Yes, exactly, and that is something that we do and there are all sorts
of tricks that you can do and that is all very helpful. Also you don't need to solve the QP
exactly. There is a result by XStine and [inaudible] in the ‘90s. They show that ADMM
converges if you just approximately solve the QP as soon as something like the sum of
residuals has to be summable. I don't remember the details, but as soon as you are at
iteration solving the QPs more and more exactly, then you should be fine.
But there are some cases that we can address and we are going to talk about those and in
particular if you have any component that expresses a first order logical constraint, we
can still solve the QP exactly and efficiently. This is quite useful in NLP models because
often you want to inject some constraints into the model with some prior knowledge that
you have about the problem at hand.
An example, this is like a simple example. Suppose that the component is just composed
of three variables. Two variables Z1 and Z2 that we had in the model and we just add a
third variable that is the conjunction of these two. So this can be useful if you want to
make a model richer by including a potential for the simultaneous inclusion of these two
parts, and so it happens in parsing in these cases. So in this case Z1Z2 is like the
conjunction of these two arcs and we have the same thing for the siblings and you have
bigram features. So in that case the convex hull of the set corresponding to this
component is going to be [inaudible], just like this essentially, so essentially it is saying
that if Z1 and Z2 are both 1, Z12 has to be 1 so they would be vertex here, otherwise you
have these three vertices and after the convex hull, it becomes this and solving the
problem in this case as a closed form solution it is quite efficient to compute it, and so
this is a [inaudible] that you can do it. And so with these you can essentially tackle any
[inaudible] model in a graphical model because this would be like exponentials.
The next example is uniqueness quantification. Suppose you have a group of n variables
and binary variables require that exactly one of those variables can be one and all the
others have to be zero. So this is like a selector of the variables. So in that case this is
what the output set for this sort of problem means. So essentially it is any binary vector
such that the sum is one. So if you take the convex off of that it is just a problem of
simplex and the z step is just projecting go to the simplex, and so this problem can be
solved efficiently and exactly by just sorting all of the entries in this vector and applying
a soft thresholding operation. And so this can be done in log n and in practice it is faster
if we cache the previous solution, or if you warm start, like if you store the last sort that
we did for this problem. So this is a nice case. And it turns out that we also have a
similar result for other, for some of the logical constraints.
So in this case, we take a group of n variables, and require that at least one of the
variables is one. But there can be more than one variables with one value. And so in that
case, essentially we require that the conjunction of the variables is one and essentially the
output set in this case is just a faulty [inaudible] where you just remove the origin,
because the origin is the only thing that does not satisfy this constraint. And so
projecting onto this thing can also be done efficiently by sorting, so the ICML paper
described this in more detail, but it is also very efficient to do this. So there are other
examples involving first order logical constraints. For example, we can introduce a new
variable to the model which is the junction of existing variables. This is something useful
again to make our model richer by adding a specific potential for this new variable. And
so in that case you are playing, you know, a nice [inaudible] as well and you can also
solve this with a [inaudible]. In this case it's a little more tricky, but you can do it.
So this can be extended if we negate inputs and we can essentially deal with any firstorder logic restraint by doing this, so everything is just solvable with sort operations. I
am just going to show some experiments here. The experiments are in non-projective
dependency parsing. I also did some experiments in synthetic graphical models in the
ICML paper, but I am just going to skip those. So here I just used data sets from CoNll
tasks which are common data sets for different languages that have annotations for these
dependency trees and the goal is to train the parser and then evaluate it in a test
evaluation set, in a test set. So to try and we just use the MIRA algorithm which is an
online algorithm and you can think of it as an online version of structured SVM
algorithm. I'm going to skip details on that.
Finally because this is an approximate method, if you really want an exact solution, we
need to do some rounding to obtain a valid tree and it turns out that you can come up with
very simple rounding strategy that just looks at the arc entries in the vector and then take
those things as scores and use a combinatorial algorithm to obtain the closest valid tree to
that vector. We did that and actually we usually obtain exact solutions. The percentage
of fractional solutions that we get is quite small, and I think that this deserves further
study, but I think the reason for that is that we actually train also with NLP relaxation and
this is pushing the model to favor exact solutions and avoid eating fractional vertices.
These are the results that we have obtained. We actually tried two kinds of models with
different feature sets. This is a model that just uses arc features and certain other features
for consecutive siblings and grandparents. And this is the results with a full set of
features so there is some improvement. So we are putting that baseline. And this is
where the best published results end. This compares a bunch of different methods that
have been used for parsing. We don't win for all languages, but we get state-of-the-art
for eight of these 14. And they are pretty close for the others. So these are the ones that
we are winning.
So this is just giving a flavor of how faster is the ADMM algorithm compared with the
sub gradient method. So this is plotting as a number of iterations. It is plotting the
accuracy that we get in the evaluation sets, averaged over, no, so this is like there are
different runs here, one for each instance in the test set, and we are keeping track of the
accuracies that we get if we define the maximum number of iterations for the sub
gradient algorithm and for the ADMM algorithm. So you can see that ADMM usually
gives a more accurate response sooner, earlier than the sub gradient method. So here we
are using a considerable number of components but still not that many. If we increase
this further, and so this is a second order model, if you do this for the--this is another
thing showing the percentage of certificates that we get in sub gradient method and
ADMM algorithm, so we also get more certificates, which means that we can actually
stop our algorithm sooner. And this is showing that if we define a threshold on the
primal dual residuals, this is showing the percentage of instances that we can safely stop.
This is for the full model and here the results are much more impressive. So here we
have many more components on average, thousands of components and in the worst case
hundreds of thousands, so there are many overlapping pieces here. And this is the result
that we get in the sub gradient method and with the ADMM algorithm, so eventually this
red line will catch the blue line but it will just take too many iterations to get there. And
so again ADMM makes a good job after say 200 or 300 iterations it is already on the best
performance that you see.
>>: [inaudible] updates in the two cases based on [inaudible].
>> Andre Martins: That is a good point. It's almost, it is not exactly the same because
the sub gradient method is instead of analog [inaudible], it is linear for those components,
so there is like a log n difference here, but that sort of gets washed away if we cache the
problems and I am going to show that in the next slide. So you can consider this is
approximately the same.
>>: [inaudible] sweeps for all of the components?
>> Andre Martins: Yes, so I am also going to--it turns out that it doesn't need to revisit
some components. I am going to talk about that next. But yes, it's [inaudible].
>>: So [inaudible] a sense of how long--this is like one parse, right?
>> Andre Martins: Yes. This is very fast. I think the average parsing time is less than
one second. It is less than alpha second, so it is fast on average.
>>: I am wondering that [inaudible] upgrading the [inaudible] linear [inaudible] then
probably you can have a bigger component. So basically we do this kind of comparison
we use [inaudible] and then you are forced to use slower components and I'm not sure
that this is [inaudible] it is okay, but I am not sure that it is…
>> Andre Martins: That is a good point. I kept some slides, but if you go to a previous
one this is a fair comparison for the second order models because--there is one constraint
in the parse tree which is a three constraint where you want to enforce the whole thing to
be a well-defined tree, and so there is a large component that can use the sub gradient
method to force that constraint. Here we are using that. The sub gradient method is
using one large component for that tree and the ADMM method, because it cannot do
anything useful because of the quadratic term, it is splitting that tree constraint into many
small pieces and so it has, that is why it has more components on average with respect to
the sub gradient method. So this comparison is fair.
Then the next one is also fair, yes. So here we did the same thing. Sorry, not here but
here. Here we tried that for the sub gradient method using a tree, but because some of
our features require examining some things that are needed anyways and if we split that
pre-constraint into smaller pieces it didn't give any advantage to use the tree, so it was
actually better to split it also in the sub gradient method, because we need to reuse some
of the--it is kind of hard to explain. Maybe we can talk off-line about that. So this is a
fair comparison. Now there are weaker models that are not including as complex features
as this one in which you could come up with course the compositions in sub gradient
method. Those are for example what Michael Collings and Terry [inaudible] and Sasha
Rush used. And so in that case the sub gradient method is sufficient. So this is--there are
some scenarios were sub gradient method is more suitable than ADMM. ADMM is
better if you don't want to concern about how to find a good composition or if no such
good composition exists. So I guess I had that in some slides but I…
>>: You said for the ADMM you had to do the breakup into the small [inaudible]
because the quadratic term, but you could use that same breakdown for the sub gradient
method as well if you wanted to, right? And then couldn't you reuse the cache the
results…
>> Andre Martins: Yes, but it is much, much slower. Yeah, I didn't show that here but it
was slow. Actually here I did that because as I said, here you aren't gaining anything by
using the tree constraint so in this full model, not the second other model, but in the full
model this is actually using all of the small pieces and caching and all that. So again, we
can stop these if we fall below some threshold because of the primal dual residual
information. And this is like the impact of caching here. It turns out that caching can
give a substantial speed up. This is essentially showing the percentage of some problems
that we need to solve. You can see that after say 300 iterations, we only need to touch
like 20% of the factors, actually less than that. We need to touch like 20% of the factors.
>>: [inaudible] caching [inaudible]?
>> Andre Martins: It means that as we keep going, iterating with the ADMM algorithm,
many of these problems became exactly the same as in the previous round, and so we
don't need to solve them again. We can just cache the solution. But we need to do some
sort of smart caching because we don't want to examine the inputs of those problems
because it is actually a linear look as well, but you can do that, because it saves a lot of
time. So here are the running times. So this is like 0.34 seconds per sentence.
>>: [inaudible].
>> Andre Martins: I don't really know but it is probably something like 20 words per
sentence but it varies a lot. The average length doesn't give you much information
because the algorithms are not linear in terms of the sentence length, so if you get a long
sentence that is going to dominate. But it is like sentences that appear in newspaper
copy. We did not filter out any sentences. So this concludes the first part of the talk. We
have presented a new variant of dual decomposition which is faster to reach a consensus.
It is suitable for problems with many overlapping components. There are some
advantages and disadvantages with respect to sub gradient methods. So unlike the sub
gradient method, this doesn't allow us at least in an obvious way to use combinatorial
machinery to solve these problems, because you have this quadratic term. Although there
are some ideas that we could try to do something smart here to reuse the machinery, but
that is something that we can discuss afterwards. But on the other hand in many cases,
for example, any time we have first-order logical constraints, we can actually compute
the z step in closed form in an efficient way, so this is a good thing to use if you have
these models in like constraint conditional models and models in which you have to inject
these kinds of restraints.
And there is a lot of future work, so this ADMM algorithm is quite general and could be
used in many different problems. There are factors that I didn't talk about here but
suppose if you had something like a budget constraint; those types of things appear in
summarization. For example you want to summarize documents and you have like a
budget of say 100 words that you can use, we can actually have like a sub problem that
restricts the number of words to a particular number, and we can still solve the quadratic
problem in an efficient way by doing that. So there are also ways of tightening these
towards exact encoding and I think this will be a very nice--if this works, which is not,
we still need to work out theoretical details. So hybridizing sub gradient and ADMM
will give the benefits of allowing us to reuse the combinatorial machinery, so the idea
here is to let some components not having the quadratic penalty, and just use them for
other components, so essentially if you have something like a sequence model or a tree
model, we don't use the quadratic penalty there, and so you can use combinatorial
machinery or combinatorial algorithms to solve NLPs. If you have like first-order logic
constraints we put the penalty there and that is going to [inaudible] consensus and so it is
an open question if this hybridization will work. I believe it would and in practice it
seems to be something decent, and so it will offer a good way of solving real word
problems. Okay. So are there any questions regarding this first talk?
>>: So you talked about some of the situations in which you could use a closed form in z
steps, so in the process of running through your data sets, so if you had cases where you
could use this, you just take those and treat those outside the optimization and say I have
the same [inaudible] for these guys, and then let everything else go through whichever
method [inaudible] so you treated those guys separately?
>> Andre Martins: The model that I am using is such that any feature in that model can
be framed as a first order logic constraint like these, and so I could always solve the z
step efficiently in closed form using these techniques, basically sorting.
>>: [inaudible] not all trees that you had would break down into, not all of the
components would turn out to be closed forms [inaudible].
>> Andre Martins: I skipped some details, but it turns out that if you take this
dependency parsing task in which you have a component that constrains all arcs together
to define a tree, you can enforce the constraint by a sequence of formulas in first-order
logic. We need to do some lifting, so I skip the details because it is very complicated, but
we can, you have a multi commodity formulation for that problem, so it can essentially
impose that the arcs that we get are essentially defining a connected graph, and that is
going to imply that we have a tree, and to imply connectedness, you can use path
variables and flow variables. It is a little complicated, but you can do everything and that
is what we did. So that is what I meant by splitting the large piece into smaller pieces. It
is essentially splitting that tree constraint into a set of first-order logic constraints.
>>: So you could in general split all of the constraints [inaudible] into this [inaudible].
>> Andre Martins: Yes. So this is quite general. Suppose that you have a pairwise
graphical model where the variables are not binary, but you can take multiple labels. So
if you have like a [inaudible] model or something like that. You can also transform that
graph into a binary graph that uses some of these x r factors that I showed there, and then
you can apply these. So it is usually easy to take a big component and to split it into
small components, but it might not be the best way of solving the problem, because it
might turn out that if you don't have many overlaps, sub gradient is already a good
approach. It really depends on your model and. And so it is kind of an open problem, so
I think that there is some hope that you can still tackle these large factors with the
quadratic penalty there. For example, there are ways of solving the QP that involves
solving a sequence of LPs, so by using that, you could solve our QPs by repeatedly using
these grantor algorithms. Okay. Any further questions before I proceed to the second
talk?
So now I am going to talk about structured sparsity in structured prediction. This is
based on another EMNLP paper and the paper that we--it is vaguely related to the paper
that we presented at AISTATS this year, with kernels which is not the setting that we are
going to consider here. The basic idea is that we in many cases we care about sparsity in
an NLP model for several reasons. We may want a compact model or because you care
about runtime and it turns out that sparsity can also affect your runtime and you are going
to see this in more detail later. Or because you cannot afford the large memory footprint,
or because actually you want to interpret the model and try to understand what is relevant
in our model. And so there is a lot of previous work that addresses sparsity, but usually
just focusing on penalizing cardinality through a L1 regularization term and ignores the
structure of the feature space. So here the idea is to take this structure of the feature
space into account and so that is what this talk is about.
So our set up is again the same setup that we had in the previous talk with the previous
part. You have an input set X. For each X we have a set of candidate outputs Y of X. So
X could be sentences and Y of X could be parse trees or whatever. So we are assuming
that this is a structured set, so this is again structured prediction. And we are going to be
concerned with linear models where the scores are linear function of the parameter vector
theta and it joins input output feature map phi. To learn the parameter theta, we minimize
or regularize the empirical risk functional. So something that looks like this where you
have a regularizer Omega of Theta and the empirical loss term, or for any loss that we
can, preferred loss here.
In this talk we are going to focus on the regularizer. We also think that people have
played lots of times with different losses, and there are still a lot of things to do regarding
regularization.
>>: So can you say that in training you want to find the other [inaudible] but [inaudible]
generalize actually it is not enough that this term would be small. We want it to be
smaller, the smallest one over any other alternative Y. Right, because you have this input
output encoding?
>> Andre Martins: So this entire thing is going to be the score of output Y, this entire
thing, this product here. And I am just picking the output that maximizes this score.
>>: Yes, but in training it is not enough just to say that you want to minimize it for the
[inaudible] this should be small actually also need to have for the alternative Y this value
be large.
>> Andre Martins: This is inside the definition of the loss function, so this loss function
could be like a hinge loss where it will have a max inside of it saying something like that.
Or it could be a logistic loss where a model conditional probability of Y given X, does
that make sense? So you could define this L as like the margin where the margin uses…
>>: [inaudible] all of the alternatives.
>> Andre Martins: Yes. So this is like swapped inside this loss function. Let's see what
people have done in regularization. The most trivial choice is using L2 regularization
where the second most trivial is not using any regularization at all. So this has been done
for quite a long time. It works really well. Even if some features are irrelevant,
sometimes some combination of the features is relevant you can capture that phenomena
very well, but it doesn't give you a sparse solution, so if you care about sparsity this is not
solving the problem. So another option is to use L1 regularization which in regression is
called the lasso and this is encouraging sparsity, and we can then look at the components
of the weight factor that were zeroed out and discard those features. So this is essentially
allowing this to do feature selection.
There are examples, for example elastic nets that combine L1 with L2. People have
played with those things as well in NLP, so all of these options are treating each
dimension of the feature space equally so they all are ignoring structure of the feature
space. So here we want to promote structural patterns rather than just analyzing
cardinality. So using a simple example, suppose that we are just doing multi-class
classifications so no structure prediction here. We have an input feature vector that we
call psi of X here. We have a vector of labels that indicate the label of Y, and we can
consider a construct features that are a conjunction of these two things, like an input
feature can join with a label. This is something that people do commonly. So you can
represent the weight vector as a matrix which is labels times input features. So if you use
L2 you will get something which is dense like this. These colors are essentially showing
the magnitude of the weight for each feature.
If you use L1 you get something which is sparse, so the white squares means zeros, so we
get something that is sparse, but it is random sparsity. It doesn't any pattern. We may
care about something like this which is group sparse, like for example, we may want to
discard entire input features if they are not relevant for any label. And so we are going to
focus now on group sparsity. So essentially we allow density inside this group but we
want sparsity with respect to the groups that are selected. So in this case each group is
going to be a column in the matrix. So how can we choose the groups? Here we have
the opportunity to use our prior knowledge about what kind of sparsity patterns we want
in our model. So here is a general formulation for that. Suppose we have D features and
we group them into M groups, G1 to GM where each group is a subset of the features.
We can then form parameters of vectors theta 1 to theta M; each of those things is a sub
vector which corresponds to its own group.
People have proposed using this group loss or regularizer which essentially is going to
penalize the sum of the L2 norms of each sub vector. For each sub vector of M, we take
the L2 norm that, not the square norm but the actual L2 norm, and then we take the sum
of that. So you can regard these as the L1 norm of the L2 norms. It turns out that this is
actually, is also a norm that people call the mixed L1 L2 norm. So if you use this
regularizer, because this is the L1 of something, it is going to promote sparsity at the
group level. It is going to attempt to shrink some of these norms to zero which will
discard all features inside of this group. So people have used these in statistics under
different names like composite absolute penalties. They tried to play with different
norms, not just L1 and L2. For example, the L infinity norm is something which is used
here as well. Yes?
>>: [inaudible] larger than one another [inaudible] are the…
>> Andre Martins: Yes. I'm going to get--it matters a lot. I say that this is one of the
main problems that need to be solved. So this leads me to this. [laughter]. In general,
we want to penalize some groups more than others taking the number of features in the
group into account.
>>: [inaudible] specify some prior knowledge…
>> Andre Martins: Yes. So in this talk I am specifying that this, I am assuming that this
is given, but to be extremely interesting if someone comes up with a way of setting these
up automatically.
>>: Is a group a class label?
>> Andre Martins: Group, no. It could be in the other example, but here I am going to
consider it general--so groups can be different things and I am going to give several
examples of groups. But in the previous slide each group was an input feature including
all of the labels conjoined with that feature.
>>: [inaudible] want the groups to overlap?
>> Andre Martins: Yes, and I am going to talk about that. That is a good point. There
are two areas that we are going to consider. The first one is non-overlapping groups.
Then we are going to talk about tree structured groups and finally the general case of
graph structured groups, and we are going to start first with the non-overlapping case
which is the simplest. In this case the groups are all disjoint, which means that we
require that each feature belongs exactly to one group. So we can recover the wellknown regularizers that we already know by choosing some trivial groups. For example,
we can get L2 regularization back if you have one large group that has all of the features
there. We can recover L1 regularization if we have D Singleton groups, one group per
feature and we can also have nontrivial group and get something interesting like label
based groups where the groups are the columns in the step matrix or template based
groups which is what I am going to describe next.
So let me go to feature template selection. So suppose that you have a task like sequence
labeling so this is actually chunking. So here we have a sentence. We have parts of
speech tagged for each word in the sentence. This is the input and we want to predict
phrase boundaries, like this means the beginning of a non-phrase which is we. This is the
beginning inside a verb phrase and so we want to explore a verb phrase and so forth. So
this is a useful task in NLP. And so typically people define future templates when they
want to use a linear model to address this task. So examples of feature templates are this.
For example, for each position in the sentence, they look at word bigrams for example
and conjoin that with the label. And so in that case that would generate things like these
depending on which position you are. So these are just word bigrams. Another example
of a template would be part of speech trigrams, for example, and depending on which
position you are, you can get different things like this. And so you may want to select
feature templates, and to do that we make each group correspond to a feature template, so
we have a group for word bigrams and a group for part of speech trigrams. So this
notation means word at position zero, word at position one. This is relative to where we
are and the same thing for the parts of speech. And we can have some other group there
which is going to be zeroed out if we apply these penalties. So this is a choice of groups
that allows us to do feature template selection. Any questions so far about this?
>>: So presumably those templates are driven by some modeling knowledge that you
have about [inaudible] groups that you have [inaudible]?
>> Andre Martins: Typically at this point people usually select feature templates by
hand. They specify a bunch of feature templates, and we want a method that allows you
to specify or construct conjunctions of templates and all sorts of crazy things, and
selecting the ones that are going to be useful for the task automatically.
Let's look now at tree structured groups. Here we are going to allow some overlaps, but
we are going to constrain the kind of overlaps that we get. If two groups overlap were
going to require that one is containing the other. In other words, we want groups to be
nested, so you have this hierarchical structure. Use a diagram to help write what you
want something like this. So we redefine a new group inside of the blue one which is the
green one which contains a subset of the features that are there and you can represent that
by drawing an arrow that goes from the blue one to the green one. Then we can assume
that there is a violet group that subsumes these two and so you draw these and so forth
and you get a tree at the end, or a forest. It could be like a conjoined set of trees. Can
anyone tell me what kind of sparsity this is promoting? What is going to happen if I use
group lasso with this definition of groups? Can anyone guess?
>>: Well you won't end up with singletons. You won't be able to kill the purple guy
without also killing the green and the brown guy.
>> Andre Martins: Exactly. Essentially this sparsity pattern that is being promoted here
is that if a group is discarded, all of the descendent groups are also going to be discarded.
>>: And that is okay as long as the regularizers use the sum of the separate regularizer
[inaudible]?
>> Andre Martins: Yes. This has been proposed or was first proposed by--who actually
citations aren't here. I think Jenatton--I think some people are working with Francis Bach
that proposed using this sort of hierarchy priors, hierarchal groups and they first showed
these diagrams that provide a nice graphical representation of what is going on.
Okay. So let's go to the general case where we allow for arbitrary overlaps. In that case
we get a dog. So this dog is essentially the hasse diagram of the poset structure of the
feature space. So essentially by defining a partial order based on set inclusion, like if one
group is included into another group, we say that that group is less than or equal to the
other. We endow our feature space with the structure of a poset, and we can represent
that structure by using the by drawing the hasse diagram, so essentially if a node is a
descendent of another node that means that that group is included in the assessor group.
And so the sparsity patterns are given by the poset. Is just like the tree case. So if we
discard say the red node, this is going to throw away the violet one and the orange one
and the black one, but not the green one.
This could be useful if you want to do something like course define regularization. Here
we are going to focus on the partial order that we want to define in our feature space and
then define our regularizer that behaves according to that partial order. So we are going
to say that partial speech features are coarser than word features. That makes sense. And
then we are going to extend this partial order to, because we have features that are going
to be bigrams of parts of speech or engrams of words, we are going to need to extend this
partial order by extending these two engrams of partial speech and engrams of words. So
essentially we are saying that a bigram of partial speech is finer than just a single part of
speech and deposition a word is finer than a part of speech, so in this case a word bigram
is finer than a part of speech bigram but also finer than just a word in a smaller context.
So we get a poset like this and so the regularizer is going to promote selecting finer
features only if the coarser ones also are going to be selected.
>>: A bit of an issue if the model could be influenced by something that is two steps
away and not one step away? Do you know what I am saying?
>> Andre Martins: The number of steps away doesn't matter much because--for
example, a word trigram is still finer than a part of speech bigram.
>>: [inaudible] trigram unless you--I don't know. I think you have to watch how your
steps are constructed sequentially.
>>: If it was the case that you didn't have the finer feature connection but not the
[inaudible] connection and that was supported by your data you would miss that.
>>: Yeah, if the mid-level data was a mixture of interesting spaces but it was kind of a
[inaudible] mixture it might find it the coarser level it didn't look like there was any
information in it so you could…
>> Andre Martins: Yes. It is conceivable that you would want something like that, but I
think that in most cases you usually want something like this.
>>: You might be able to structure your tree differently then. There would be separate
alternatives than.
>>: So in this part of it the most important feature was the trigram of the word and only
[inaudible] then basically you're going to say take the other feature out because of
[inaudible].
>> Andre Martins: Yes. That is true but that is very unusual because most [inaudible]
problems people need to include backup features, so it is not common that you care about
the trigram but you don't want to back off to a bigram or a unigram. So this sort of
matches what people want. I mean it maybe doesn't solve all problems, but I am arguing
that it is something reasonable that we may want.
>>: [inaudible] if you have to [inaudible] a tree.
>> Andre Martins: This is with prior knowledge and what happens if your prior
knowledge is totally wrong and you can specify a different--yeah, that makes sense.
Essentially all the things that we need to specify is a partial order for your features. Now
it depends on what partial order you want.
>>: I wonder like, I mean you could consider searching over different partial orderings,
but then that might result in a different kind of over fitting from you [inaudible].
>> Andre Martins: That is good point. I guess you could use some regularizer
corresponding to different partial orders and that would still be a group loss regularizer
with overlaps. That's good.
>>: [inaudible].
>> Andre Martins: Okay.
>> Scott Yih: You only have 15 minutes left.
>> Andre Martins: So I should get moving. Let's talk about algorithms, wrapping up.
Recall that this is the problem that you want to solve. We are going to solve it using an
online proximal gradient method. So this is similar to things like [inaudible] gradient
descent but with a small twist. We are only taking gradients with respect to the loss
function. And we are doing proximal steps with respect to the regularizer. So at each
point we turn a training pair into a gradient step looking at loss for that example just like
[inaudible] gradient descent and then we do a proximal step that looks at the regularizer.
So I need to say what the proximal step is. [laughter]. Okay. So this is what the online
proximal operator is so we can regard it as a generalization [inaudible] projection. So if
you regularizer is the indicator function of a set, of a convex set, in other words if it is
zero, each side belongs to a set and infinity otherwise, this is exactly the same thing as
projecting on the left set. But this is more general because if you have something that is
not an indicator function, like an L1 regularizer or something else, this is something
different. So essentially we want each side to be close to the point that we started, theta,
but still you want to penalize the value of the regularizer for that point.
And so it turns out that for many regularizers it's pretty easy to compute these proximal
steps. So for example, for L2 regularization, this will become just a scaling operation so
there is some lambda that you can compute in closed form such that the solution of this is
going to be just scaling theta. For L1 this is soft thresholding. So in other words, if your
regularizer is a regularization constant tau times L1, then this proximal step is just going
to subtract. So if your weight is positive at each dimension, it is going to subtract a fixed
amount of tau. If it crosses zero, it is clipped to zero and the reversing if it's negative, so
essentially it is pushing everything to zero by subtracting or adding a fixed amount that is
essentially what is attached to the L1 norm.
So what about group L1? So any non-overlapping case, the solution to this problem is
what is called vector soft thresholding. This is a generalization of soft thresholding that
works at group level, so for group number M we are going to take the L2 norm of that
group and were going to shrink the norm. We are going to subtract this fixed amount of
the M to that norm, by scaling everything inside that group to achieve that norm. And if
by doing that we cross zero, then we just set everything inside that group to zero, so this
is essentially discarding the entire group. So in a tree structured case, we can still
compute the proximate step recursively, and then there is this very nice paper by Jenatton
where it shows how to do it, but in the general case. So if you have something that is not
a hierarchy, there is no efficient procedure known to compute the proximal step.
>>: [inaudible] proximal step, then why go--and this proximal step thing is cool. Why
not just take the gradient step…
>> Andre Martins: The main reason for that is that is not going to give you sparsity if
you keep iterating your algorithm. Even with L1, it is going to oscillate towards zero but
because it is not differentiable, the L1 norm, you actually don't get to zero. And so here
the proximal step…
>>: [inaudible] in practice [laughter]?
>> Andre Martins: No. Because you [inaudible] a lot.
>>: Create a [inaudible], because you, I mean if you, if you play some tricks, I mean
they won't be exactly, but they'll be…
>> Andre Martins: But as you keep doing iterations we never, so it takes a while until it
figures out that this would be a zero there.
>>: Yeah, I agree with that yeah.
>> Andre Martins: So you see if you want to have a low memory footprint…
>>: [inaudible].
>> Andre Martins: Yes. And so this is something important because we care about the
memory footprint here.
>>: [inaudible] for L1, it's not a [inaudible], so you go down [inaudible] worse
coordinates, but by doing the proximal step the [inaudible] that's a [inaudible] function.
>>: You don't have to worry about the space where it's [inaudible] right around it where
it is actual…
>> Andre Martins: But that [inaudible] all the way matters that is [inaudible]. So if you
use proximal gradient, it matters in batch we can actually get the same convergence of
natural methods if the loss is at least continuous. But in the online setting you don't get
that.
>>: [inaudible] proximal stuff it's like, but how do you know how far to step for the
proximal step part? Is there a…
>> Andre Martins: That is going to be--let me go back. So that is going to depend on the
learning grade on the [inaudible] which is [inaudible].
>>: [inaudible] schedule [inaudible] is that schedule as well. Okay. Same schedule as
the gradient step [inaudible] I see.
>> Andre Martins: You can also use a different [inaudible] I think, but they just… Right,
so okay. In the graph structured case we cannot compute the proximal step, so we show
[inaudible] paper, but you can still guarantee convergence even if you don't compute the
proximal step exactly, but as soon as you can rewrite the proximal step as, if you can
rewrite the regularizer as a sum of non-overlapping regularizers, you can acquire
sequential proximal steps. This is not going to be the same as the proximal step with
respect to [inaudible] but the algorithm will still be convergent.
So I am just going to say a couple of things about practical issues. Each gradient step is
linear in the number of feature templates, because you are assuming that for each data
point you have as many features firing as the number of templates that you have. Each
proximal step is linear in the number of groups because there are some smart things that
we can do. We can keep track of norms of each group and use that for templates, so
everything is independent of the dimension of the feature space, so we can actually play
with very large feature spaces here. There are some tricks that I used like keeping a
budget of the number of the number of groups that I want to keep instead of having a
regularization constant, and then I just perceptronize the loss to avoid having to
[inaudible] learning rate. And there's this very important final step that of debiasing. I
ran this for a few iterations, and identified the templates that are not zero, and then I just
run a standard on regularized learner at second stage that does not consider the ones that
were discarded. In practice this is quite important.
>>: [inaudible] suppressing, I mean why do you have to do the…
>> Andre Martins: The problem is that these kind of L1 penalties are introducing the
bias to the model, and if you have strong regularization which I am doing here, it is
debiasing more and more and you would be better by having it at the second stage.
>>: Even if you reach the point of the sparsity that you care about then still these
proximal steps are affecting things enough that you're not getting the best solution you
could without [inaudible]?
>> Andre Martins: Yes. So this is showing the memory efficiency and this is why it is
important to use proximal step here. So here each of these things is like a different
proximal step that is applied and guarantees that you are discarding a lot of groups and
keeping the memory footprint quite low.
>>: [inaudible] so when you're doing the soft thresholding thing, once some guy gets
under the Tao and you knock them down to zero is it gone forever, or can it get pushed
back up?
>> Andre Martins: The group is not going on forever. You can throw away the features
and all of the weights, but in the next round you still need to compute the templates and
all that. There are some heuristics that you could use, like for instance if you are really
sure that the group is ever going to come back, it is dangerous, but in practice it can be
useful. I have never tried it. Yes, so the tasks more like chunking, sequence labeling
tasks. I am not going to spend a lot of time here. Essentially, we are achieving the same
results or slightly better actually than L2 regularization with much fewer features and
much fewer feature templates. This is also getting an impact in runtime, because usually
computing the scores is something that is linear in the number of feature templates so if
you have fewer feature templates, this is going to speed up the model.
So the second set of experiments was for dependency parsing. We have like a crazy
number of feature templates, 684. No one uses so many. But the goal here is exactly to
try to select a smaller number out of these so many templates. And so we compared
several things like just using standard lasso and just information gain score for selecting
templates and we got something like these. So essentially, the blue line here is group
lasso without overlaps. The light blue one is the fine regularization and so except for
Spanish--this didn't work for Spanish, but for the other languages, it did a good job. But
the disappointing thing is that the course to fine regularization did not outperform the just
considering the non-overlapping groups. So we were expecting that by having that prior
knowledge about the poset would help, but it didn't. But there are a lot of things that we
could play with here.
There are some claims that you can make about [inaudible] looking at the different
languages where we applied these, but you should be a little skeptical because it may
happen that some patterns that we are identifying are the properties of the data sets and
not the languages. So essentially for languages that have a rich morphologically and the
data sets are small, this seems to avoid using lexical features, which sort of makes sense
because in this case there is a danger of overfitting, so this is in some sense seeming to
avoid overfitting.
>>: [inaudible] overfitting of those weights across the different data sets [inaudible]
maintain your weights but actual [inaudible].
>> Andre Martins: Actually is a good idea.
>>: Unless you really feel that there are language dependent changes in what's important
which might be the case.
>> Andre Martins: That is actually a good idea because you cannot do that to the actual
features, because they are going to be different, but with the templates you can. So yes,
to conclude you have two levels of structure here in the output space and in the feature
space. We can promote structured sparsity by using a group lasso regularizer. So this
algorithm that we propose is able to explore very large feature spaces, and there is a lot of
future work to do regarding this, and so I would like to emphasize this prior group
weights thing which is quite important. So here we just define that as like the log of
number of features in the group, which can say that it is like the number of bits that you
need to [inaudible] the feature in the group, but it is heuristic right, so there is a chance
that you could do better by doing something smarter here. So if anybody has ideas on
how to do that I will be happy to hear about that. So that is all that I have.
[applause].
>>: [inaudible] results for debiasing?
>> Andre Martins: For debiasing?
>>: Yes. So [inaudible] did you get?
>> Andre Martins: Oh, very bad results because I am doing very strong regularization.
>>: [inaudible].
>> Andre Martins: Yes, it is like a huge drop, like five percentage points or something
like that. But it is also because of the algorithm that I am using. If I used like batch
methods it would probably be better because I, it has to do with the technique that I am
using to define the regularization constant.
>> Scott Yih: Thanks.
Download