Incremental Integer Linear Programming for Non-projective Dependency Parsing Authors Sebastian Riedel and James Clarke Paper review by Anusha Buchireddygari Structured Prediction Cutting plane Algorithm • • • • Figure above shows a convex region of feasible solutions defined by several constraints. The grid indicates where inside the polygon the feasible integer solutions lie The dot represents the optimal solution (for the linear programming problem) gained from maximizing x1 + x2 Note that although it is not an integer solution it is the upper bound for an optimum one Why this paper? • Integer Linear Programming (ILP) was applied for inference of sequential conditional random fields(Roth and Yin,2004). • Exponential number of constraints required to prevent cycles occurring in the dependency graph. • Model these constraints using ILP is too large to solve efficiently. What does the paper solve? • Method which extends applicability of ILP to more complex set of problems. • Instead of adding all the constraints authors tried to solve with a fraction of constraints. • Solution is examined and additional constraints are added if required. • Procedure is repeated until all the constraints are satisfied. What’s the problem the authors picked? • Authors applied dependency parsing approach to Dutch due to language’s non-projective nature. • Took the parser of McDonald et al as starting point for our model. • “I will come at twelve and then you will get what you deserve”. What’s Dependency Parsing? • Task of attaching words to arguments. • Dependency graph “kom” attached to “ik”. • “kom” is head for “ik” and “ik” is child. • Dependency tree every token must be the child of exactly one other node, either another token or dummy root token. • It cannot have cycles like “en”-> “kom” -> “ik” >”en”. How the model looks like? • X is a sentence, y is a set of labelled dependencies • F(I,j,l) multi dimensional feature vector of the edge from token I to token j with label l Constraints • Decoding/Inference in this model is to maximize • T1 for every non-root token in x there exists exactly one head; the root token has no head • T2 There are no cycles • This corresponds to maximum spanning tree problem. Linguistic constraints • A1 Heads are not allowed to have more than one outgoing edge labelled l for all l in a set of labels U. • A1 tells that there can be only one “subject” in the sentence • C1 In a symmetric coordination there is exactly one argument to the right of the conjunction and at least one argument on the left. • C1 applies to “and”,”or”,”but” Some more linguistic constraints ● ● ● C2 In an asymmetric coordination there are no arguments to the left of the conjunction and at least two arguments to the right. Example is “both” having arguments to its left. There are other such constraints total of 8 constraints in the paper. Process ● ● ● ● The function to be maximized is called objective function Ox. Variables Vx, e(I,j,l) is 1 if edge exists else 0 Base constraints T1 Constraints representation • A1 • C1 Incremental Constraints • T2 Algorithm • For a sentence x, Bx is the set of constraints that we add in advance. • Ix are the constraints we add in incrementally • Ox is objective function and Vx is a set of variables including variable declaration. What happens in the algorithm? • Solve(C,O,V) maximizes the objective function O with respect to the set of constraints C and variables V. • Violated(y,I) inspects the proposed solution (y) and returns all constraints in I which are violated Interesting result The number of iterations is most polynomial with respect to number of variables. In practice this technique converges quickly i.e. less than 20 iterations in 99% approximately 12,000 sentences. Yielding average solve times of less than 0.5 seconds. Experiment • How much do our additional constraints help improve accuracy? • How fast is our generic inference method in comparison with Chu-LiuEdmonds algorithm? • Can approximations be used to increase the speed of our method while remaining accurate? • Data • Alpino Treebank 13,300 sentences with a average length of 14.6 tokens • Environment • Intel Xeon with 3.8 Ghz and 4 GB RAM. Mixed Integer Programming lib IP_Solve. Code ran on Java and called JNI-wrapper aroung IP_solve lib • Feature Sets • Along with POS tags there are additional attributes like gender,number and case. Combined attr of head to child Results • Accuracy • Nl is the number of tokens with correct head and label and Nt is the total number of tokens. • Unlabelled accuracy • Nu is the number of tokens with correct head Accuracy Results table • bl means baseline without any linguistic constraints which is compared it to a system with additional constraints (cnstr) • There are problems the system suffers from poor next best solution due to inaccurate local score distributions. Runtime Efficiency • Average solve time (ST) for sentences with respect to no of tokens in each sentence. • Approximation • Total runtime of the system is competitive to CLE. Discussion • Higher order features by using extended set of variables and a modified function which is likely to increase runtime. • Fast for real world applications. • Large time for first iteration after that algorithm uses last state to efficiently search for solutions in the presence od new constraints. • Exponential blow when the cycle constraints are added Conclusion • Novel approach to inference of ILP • Efficiently use ILP for dependency parsing • Slower than baseline approach but parses large sentences with more than 50 tokens. • Parsing time can be significantly reduced using a simple approximation like from q=10 to q=5 which will only marginally degrades the performance from 85% to 84% in the table we have seen in approximation. THANK YOU Questions?