Slides

Dual Composition Applications Presenters: Ben & Red Section(s) 1, 2 and 3: Introduction to Dual Composition Abstract Dual decomposition, and more generally Lagrangian relaxation, is a classical problems in natural language processing(NLP) A central theme of this tutorial is that Lagrangian relaxation is naturally applied in conjunction with a broad class of combinatorial algorithms. Introduction In many problems in statistical natural language processing, the task is to map some input x(a string) to some structured output y(a parse tree).This mapping is often defined as y*= arg max h(y) y Where γ is a finite set of possible structures for the input x, and h: γ->R is a function that assigns a score h(y) to each y in γ. Introduction The problem of finding y* is referred to as the decoding problem. The size of γ typically grows exponentially with respect to the size of the input x, making exhaustive search for y* intractable. Introduction Dual decomposition leverages the observation that many decoding problems can be decomposed into two or more sub-problems, together with linear constraints that enforce some notion of agreement between solutions to the different problems. The sub-problems are chosen such that they can be solved efficiently using exact combinatorial algorithms. Properties They are typically simple and efficient. They have well-understood formal properties, in particular through connections to linear programming relaxations. In cases where the underlying LP relaxation is tight, they produce an exact solution to the original decoding problem, with a certificate of optimality. Dual Decomposition Some finite set γ Rd,Z Rd’ .Each vector y∈γ and z∈Z has a associated score f(y)=y·θ(1) , g(z)=z·θ(2) where θ(1) is a vector in Rd The decoding problem is then to find arg max y  (1)  z  ( 2) y , zZ Such that Ay+Cz=b Where A∈Rp ×d ,C∈Rp ×d’ ,b∈Rp Section 4: Optimization of the Finite State Tagger x: an input sentence y: a parse tree γ : the set of all parse trees for x h(y): the score for any parse tree y ∈ γ We want to find : y*= arg max h(y) y How to decide h(y) First, find the score for y under a weighted context-free grammar Second, find the score for the part-ofspeech(POS) sequence in y under a finite-state part –ofspeech tagging model h(y)=f(y)+g(l(y)) f(y) is the score for y under a weighted context-free grammar(WCFG). A WCFG consists of a context- free grammar with a set of rules G, A scoring function θ:G->R assigns a real-valued score to each rule in G f(y)= θ(S->NP VP) + θ(NP->N) + θ(N->United)+ θ(VP->V NP)+… How to define θ is not very important θ(a->b)=log p(a->b) in a probabilistic context-free grammar Θ(a->b)=w*Φ(a->b) in a conditional random field h(y)=f(y)+g(l(y)) l(y) is a function that maps a parse tree y to the sequence of part-of-speech tags in y l(y)= N V D A N h(y)=f(y)+g(l(y)) g(z) is the score for the part-of-speech tag sequence z under a k’th-order finite-state tagging model n Under this model g ( z )   (i, zik , zik 1 ,..., zi ) i 1 Where θ is the score for the sub-sequence of tags ending at position i in the sentence. Motivation The scoring function h(y)=f(y)+g(l(y)) combines information from both the parsingmodel and the tagging model. The POS tagger captures information about adjacent POS tags that will missing under f(y). This information may improve both parsing and tagging performance, in comparison to using f(y) alone. The dual decomposition algorithm Define T to be the set of all POS tags. Assume the input sentence has n words. 1...if . parse.tree. y.has.tag.t.at. position.i y (i, t )   0...otherwise 1...if .the.tag.sequence.has.tag.t.at. position.i z (i, t )   0...otherwise Find arg max . f ( y)  g ( z ) y , zZ Such that for all i∈{1…n}, for all t∈T, y(I,t)=z(I,t) Assume that we introduce variables u( i, t)∈R for i∈{i…n}, and t∈T. We assume that for any value of these variables, we can find arg max ( f ( y )   u (i, t ) y (i, t )) y efficiently. i ,t Assumption 2 arg max ( g ( z )   u (i, t ) z (i, t )) zZ i ,t The dual decomposition algorithm for integrated parsing and tagging Initialization: Set u (i, t )  0 for all i∈{1…n}, t∈T For k=1 to K y  arg max  ( f ( y )   u (i, t ) y (i, t )) [P z ( k )  arg max zZ ( g ( z )   u ( k 1) (i, t ) z (i, t )) [Ta i ,t (k ) (k ) If y (i, t )  z (i, t ) for all i,t Return ( y ( k ) , z ( k ) ) Else u ( k 1) (i, t )  u ( k ) (i, t )   k ( y ( k ) (i, t )  z ( k ) (i, t )) δk for k=1…K is the step size at the k’th iteration. ( 0) (k ) ( k 1) y i ,t Relationship to decoding problem The set γ is the set of all parses for the input sentence The set Z is the set of all POS sequences for the input sentence. Each parse tree y∈Rd is represented as a vector such that f(y)=u·θ(1) for some θ(1)∈Rd Each tag sequence z∈Rd’ is represented as a vector such that g(z)=z·θ(2) for some θ(2)∈Rd’ The constraints y(i,t)=z(i,t) for all (i,t) can be encoded though linear constraints Ay+Cz=b For suitable choices of A,C and b Assuming that the vectors y and z include components y(i,t) and z(i,t) respectively. Section 5: Formal Properties of the Finite State Tagger Recall  The equation we are optimizing:  Lagrangian Relaxation and Optimization  We first define a Lagrangian eq. for the optimization.  Next, from this we derive a dual objective.  Finally, we must minimize the dual objective.  (We do this with sub-gradient optimization) Creation of the Dual Objective  Find the Lagrangian:  Introduce Lagrangian Multiplier u(i, t) for each y(i, t) = z(i, t)  From the equality constraint we get:  Next we regroup based on y and z to get the dual objective: Minimization of the Lagrangian (1)  We need to minimize the Lagrangian:  Why? We see that for any value of u:  The proof is elementary given the equality constraint:  This is because if y(i, t) = z(i, t) for all (i, t) we have:  Therefore, our goal is to minimize the Lagrangian, as seen above because it sets an upper limit on the dual. Minimizing the Lagrangian (2)  Recall the Algorithm introduced in Section 4: (The algorithm for dual composition) From this we see the algorithm converges to the tightest possible upper bound given by the dual. From this we derive the theorem: given we end up with (as we reach infinite iterations)  Given infinite iterations, this dual is guaranteed to d-converge. This is actually a Sub-gradient minimization methodology, which we will return to in the next section. Minimizing the Lagrangian (3)  From the previous slide, we can derive these two functions:  So, if there exists some u such that:    For all (i, t),  Then,  (or ) The proof of this is also, relatively simple:  From the definitions of  Because, once again,  We also know that,  However, because y* and z* are optimal, we also see:  Therefore we have proven, , for all u, which implies: This equation is said to e-converge under these circumstances because we have found an iteration that completely converges to the optimization by bounding it from about and below. D-convergence vs. E-convergence  D-convergence occurs in the first case,  D-Convergence is guaranteed by Dual Composition provided infinite iterations.  Short for 'dual convergence'.  It refers to the convergence of the minimum dual value algorithm:   ( ). E-convergence occurs in the later case,  E-convergence is not guaranteed by Dual Composition.  Short for 'exact convergence'.  It refers to the convergence of ALL constraints.  (Which implies the discovery of a guaranteed optimization) Sub-gradient Optimization  What are sub-gradient algorithms?  Algorithm used to optimize convex functions.  Uses gradient descent to find the local minimum.   Takes iterative steps negative to the approximate gradient relative to the current point in order to find a local minimum of any function. To the right, we see a gradient optimization for: (Credits to: http://en.wikipedia.org/wiki/Gradient_descent )  The 'zig-zag' nature of the algorithm is obvious here. Sub-gradient Applications  Recall the equation (and the goal to minimize it):  We notice two things about the equation:  It is a convex function: Formally defined:  It is non-differentiable (It is a piecewise linear function).  Because it can't be differentiated, we can't use gradient descent.  However, because it is a convex function, we can use a gradient algorithm.  This is the property that allows us to compute duals efficiently. Section 6: Additional Applications of Dual Composition Applications of Dual Composition  Where is Dual Composition and Lagrangian Relaxation useful? Various NP-Hard Problems such as (as the paper suggests):      Language translation. Routing problems or Traveling Salesman problems. Syntactical parsing based on complex CFGs. The MAP problem Each of these techniques will be covered in the following segment. Other examples of NP-Hard problems solvable by DC optimization:          Minimum degree/leaf spanning tree Steiner tree problem Chinese postman problem Vehicle routing problem Bipartite match A lot of other problems, many can be found on Wikipedia*: “ http://en.wikipedia.org/wiki/List_of_NP-complete_problems “ * Some of these on the list are not applicable ( Just be wary when researching, use your brain! ) Example 1: The MAP Problem  As this was covered by the previous group we will not discuss it in detail.  Goal was to find optimization of:  We break the problem up into easier sub problems.   We then solve these 'easier' problems using an iterative subgradient algorithm.   (Into sub-trees of the main graph) Doable because it is convex. The sub-gradient algorithm for any given iteration (k): Example 2: Traveling Salesman Problems   The Traveling Salesman Problem (TSP) is one of the most famous NP-Hard problems of all time:  Goal is to find the maximum scoring route that visits all node at least once.  More formally stated it is the optimization of this problem:  Given that:  Y is the set of all possible tours.  E is the set of all edges.  Theta is the weight of edge e.  ye is 1 iff e is in the subset, 0 otherwise. Held & Karp made this an easy problem to solve (well, “easy”). (TSP) Held & Karp's idea  The idea of a 1-tree:  A 1-tree is a tree with a single cycle.  Most importantly, notice how any possible route of the TSP is a 1-tree.  How we can use this information:  Alter the problem to be:  where Y' is the set of all possible 1-trees. (Hence, Y is a subset of Y' b/c all tours are 1-trees)  However, the first step becomes to find the Maximum Scoring Tree.  And the second is to add the two highest scoring edges that include vertex 1.  Lastly, we add constraints on all edges in the form of the following:  This constrains the equation to 1-trees that each have 2 edges per vertex.  (This is the equivalent to a route and, when optimized, the best route) (TSP) Getting to the Solution  Next, we introduce the following Lagrangian:  (Where  And it is clearly constrained by 2 edges per vertex. are the Lagrange multipliers).  And define the dual objective as:  To finally end up with the following as our sub-gradient:  So, if the constraints hold the algorithm terminates and we have d-convergence; however, for e-convergence we must reevaluate the weights for the graph to include the Lagrange Multipliers (not shown). Example 3: Phrase Based Translation  Phrase based translation is a scheme to translate sentences from one language to another based on the scoring of certain phrases within a sentence.  Takes in an input sentence in the first language.  Pairs sequences of words in first language with the same sequence of words in the target lang.  For example, (1, 3, we must also), means that words 1, 2 and 3 in the input could possibly mean “we must also”.   Then, once we have a potential translation (all words mapped), we score it. The scoring function for any set of potential tuple translations is:  Where g(y) is the log-probability of y under a n-gram language model. (PBT) Constraining the Equation  First, we must find a constraint for the function.  The constraint would be to ensure that any correct translation only translates each word in the input 1 time in the output.  Defining this formally, we introduce y(i), to represent the number of times word i is translated in the output:  So, intuitively, the constraint would become:   Where P* is the set of all finite phrase sequences. So, the goal becomes to maximize the score as follows: (PBT) Relaxation and Solution  The relaxation for this algorithm would be:  Instead of solving the NP-hard problem, we find Y' such that Y is a subset of Y’.  The relaxation is based on:  (In other words, the relaxation only checks to see if there are n words in the generated phrase)  So,  So, from this, our goal is to ensure that the constraint y(i) = 1 is met for all i.  Thus, we introduce Lagrange multipliers for each constraint, giving us the Lagrangian:  From this, we define u(i) = 0 initially to get the iteration, and then the sub-gradient:  We can see when we have y(i) = 1, for all i = {1,...,n} the sub-gradient step is optimal.  Then, like in the TSP problems, we must reweigh the scores with the Lagrange Multipliers (if we want e-convergence): is a member of Y' and not Y. Section 7: Practical Issues of Dual Composition Practical Issues Each iteration of the algorithm produces a number of useful terms: Y(k) and z(k). The current dual value L(u(k)) When we have a function l:γ->Z that maps each structure y∈γ to a structure l(y)∈Z , we also have A primal solution y(k) , l(y(k)) A primal value f(y(k))+g(l(y(k))) An Example run of the algorithm Because L(u) provides an upper bound on f(y*)+g(z*) for any value of u, we have L(u(k))>=f(y(k))+g(l(y(k))) at every iteration We have e-convergence to an exact solution, which L(u(k))=f(y(k))+g(z(k)) with (y(k),z(k)) guaranteed to be optimal. The dual values L(u(k)) are not monotonically decreasing The primal value f(y(k))+g(z(k)) fluctuates Useful quantities L(u(k))-L(u(k-1)) is the change in the dual value from one iteration to the next. L*k=mink’<=kL(u(k’)) is the best dual value found so far. p*k=maxk’<=kf(y(k’))+g(l(y(k’))) is the best primal value found so far. L*k-p*k is the gap between the best dual and best primal solution found so far by the algorithm Choice of the step sizes δk c k  t 1 C>0 is a constant t is the number of iterations Prior to k where the dual value increases The dual value L(u) versus the number of iterations k, For different fixed step sizes. Small step size-convergence is smooth Large step size-dual fluctuates erratically Recovering Approximate Solutions Early Stopping Section 8: Linear Programming Relaxations (A Basic Introduction & Summary) Recall  The equations from Sections 3, 4 and 5: Defining the LP Relaxation  We first define two sets, one for y the other for z:   From this, we can define the new optimization to be:   This optimization is now a Linear Program. Next, we would define y' (and z' in a similar manner) as:   Both y and z are simplex, corresponding to the set of probability distributions over Y. Such that y is a subset of y' (Sounds familiar, right?) So, finally we get the optimization problem to be: Defining the LP Relaxation (2)  Finally, we end up with the Lagrangian (Note that it is identical to the one in section 5):  The dual objective is also rather similar:    We notice that all of these Linear Programming principles are VERY close to the Lagrangian relaxation that we see in Dual Composition. However, Linear Programming problems are often solved to tighten the econvergence on Dual Composition algorithms. Because e-convergence is not guaranteed, Linear Programs can often 'fix' the problems where dual composition only d-converges.  We can do this by constraining the Linear Program further. Summary of LP Relaxation    The dual L(u) from Sect 4/5 is the same as the dual for the LP Relaxation M(u, a, b). LP Relaxation is used as a tightening method.  These methods improve convergence.  Does this by selectively adding more constraints. LP solvers exist and therefore LP problems can be automatically solved by software.  This can be useful for debugging hard DC problems. Section 9: Conclusion Conclusion   Dual Composition...  Is useful in solving various machine learning problems.  Is useful in solving various inference / statistical NLP problems.  Uses combinatorial algorithms combined with constraints. Lagrangian Relaxation...   Introduces constraints using Lagrange multipliers.  Iterative methods are then used to solve the resulting LR.  The LR algorithm is always bounded by the dual objective.  Is similar to linear programming relaxation.  (Most importantly) has the potential to produce the exact solution. Therefore, DC and LR produce optimal solutions more efficiently than many exact methods, or other methodologies of approximation. Questions? Perhaps… Answers?

Slides

Related documents

Products

Support

Slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib