The Traveling Salesman Problem in Theory & Practice Lecture 11: Branch & Cut & Concorde 8 April 2014 David S. Johnson dstiflerj@gmail.com http://davidsjohnson.net Seeley Mudd 523, Tuesdays and Fridays Outline 1. Cutting Planes 2. Branch and Bound Presentation strongly reliant on [ABCC06] 3. Performance of Concorde 4. Student Presentation by Chun Ye: “Improving Christofides’ Algorithm for the Metric s-t Path TSP” Branch & Cut Combine Branch-&-Bound with Cutting Planes • The term “branch & cut” was coined by Manfred Padberg and Giovanni Rinaldi in their paper [Padberg & Rinaldi, “Optimization of a 532-city traveling salesman problem by branch-and-cut,” Operations Res. Lett. 6 (1987), 1-7. • First known use of the approach was in Saman Hong’s 1972 PhD thesis A Linear Programming Approach for the Traveling Salesman Problem at Johns Hopkins University. – Although, handicapped by his LP solver, he could only solve 20-city problems. This was not all that impressive, since Dantzig et al. had already done 42 cities in 1954, and Held & Karp did 64 in 1971. – Even worse: Our permutation based B&B approach, using NN + iteration, solves such instances in 0.01 seconds with just 165 lines of unsophisticated code. – Machines were a bit slower in 1971 than our 3.5Ghz processor, but – A 1968 era PDP 10 was probably about a 0.5Mhz machine (1 μs cycle time, 2.1 μs per addition), so roughly 7000 time slower. – Our code would thus still have taken only 70 seconds on such a machine. – Moral : Ideas can be much more important than (initial) performance. Refresher: The Cutting Plane Approach Solve the edge-based integer programming formulation for the TSP, as follows: 1. Start by solving a weak linear programming relaxation. 2. While the LP solution is not a tour, a. Identify a valid inequality that holds for all tours but not the current solution (a “cutting plane” or “cut” for short) . b. Add it to the formulation and re-solve. Refresher: Branch & Bound for the TSP Assume edge lengths are integers, and we have an algorithm ALB that computes a lower bound on the TSP length when certain constraints are satisfied, such as a set of edges being “fixed” (forced to be in the tour, or forced not to be in the tour), and which, for some subproblems, may produce a tour as well. • Start with an initial heuristic-created “champion” tour TUB, an upper bound UB = length(TUB) on the optimal tour length, and a single “live” subproblem in which no edge is fixed. • While there is a live subproblem, pick one, say subproblem P, and apply algorithm ALB to it. – – • If LB > UB-1, delete subproblem P and all its ancestors that no longer have live children. No improved tour is possible in this case (since tour length is an integer). Otherwise, we have LB ≤ UB-1. 1. Pick an edge e that is unfixed in P and create two new subproblems as its children, one with e forced to be in the tour, and one with e forced not to be in the tour. 2. If algorithm ALB produced a tour T, and length(T) < UB 1. Set UB = length(T) and TUB = T. 2. Delete all subproblems with current LB > UB-1, as well as their children and their ancestors that no longer have any live children. Halt. Our current champion is an optimal tour. Some Key Implementation Issues for Branch & Cut Coverage by [ABCC06] • How do we find the initial tour (upper bound)? • How do we find violated inequalities (cuts)? • What are the best inequalities to add? Chapter 15, 64 pages Chapters 5-11, 216 pages – Based on speed of generating them – Based on effectiveness in improving lower bounds Split when we (1) Can’t find any more cuts, or (2) reach point of 1 page, somewhere: diminishing returns. • How do we decide when to split a case? • How do we choose the variable on which to split a case? • How do we pick the next subcase to work on? Chapter 14, 1 Chapter 14, 14 pages Subproblem with smallest lower bound. paragraph: --------------------------------------------------------------------• How do we manage the inequalities? – Solving the LP may take too long if there are too many inequalities – Some inequalities may lose their effectiveness in later subcases • How do we cope with repeatedly solving LPs with millions of variables? • What LP code do we use and how do we apply it? Chapter 12, 28 pages Chapter 13, 38 pages Finding Cuts: The Template Approach Look for cuts (preferably facet-inducing) with structures from a (prioritized) predefined list. • • • • • Subtour Cuts Combs Clique-Tree Inequalities Path Inequalities … For each cut class, can have an ordered sequence of stronger and stronger (possibly slower and slower) cut-finding heuristics, up to exact algorithms, should they exist. (These heuristics typically assume that our current LP solution satisfies all the degree-2 constraints.) Heuristics for finding violated subtour constraints Connected Component Test • • • • Construct a graph G, where e is an edge if variable xe > 0. Compute the connected components of G (running time linear in the number of e with xe > 0). If the graph is not connected, we get a subtour cut for each connected component. Solve the resulting LP, and repeat the above if the resulting graph is still not connected. The cuts found in this way already get us very close to the Held-Karp bound. For random Euclidean instances with coordinates in [0,106], ABCC06 got • Within 0.409% of the HK bound for an instance with N = 1,000, and • Within 0.394% of the HK bound for an instance with N = 100,000. But note that the connected graph may only be connected because of edges with very low values for xe. To deal with this, one can use a “parametric connectivity” test. Heuristics for finding violated subtour constraints Parametric Connectivity Test • • Start with a graph G with no edges, and with every city c being a connected component of size 1 with weight(c) = 0. For all edges e = {u,v} of our TSP instance, in non-increasing order by the value of xe. – Find the the connected components Su and Sv containing u and v (using the “unionfind” data structure). – If Su = Sv, increase weight(Su) by xe. – Otherwise, • Set Su = Su U Sv, and increase weight(Su) by weight(Sv) + xe. • If now δx(Su) = 2|Su| - 2weight(Su) < 2, add the corresponding subtour cut and continue. – If there are now just two connected components, quit the loop (which took time O(mlogN), where m is the number of e with xe > 0). • Solve the resulting LP. Repeat the above until the process yields no new cuts. 1,000 Cities 100,000 Cities Connectivity - 0.409% - 0.394% + Parametric Connectivity - 0.005% - 0.029% Heuristics for finding violated subtour constraints Interval Test • Let (c0, c1, …, cN-1) be the current champion tour in order. Note that every set consisting of a subinterval (ci,ci+1,…,ck) of this order induces a subtour constraint with S = {ci,ci+1,…,ck}. • For each i, 0 ≤ i < N-1, consider the set {ci,…ct} for the t that minimizes δx({ci,…ct}). If this minimum is less than 2, we have a subtour cut, for a total of as many as N-2 cuts. Add them all. 1,000 Cities 100,000 Cities Connectivity - 0.409% - 0.394% + Parametric Connectivity - 0.005% - 0.029% = HK - 0.0008% + Interval Test • With the right algorithms and data structures, this can be done in O(mlogN) time, where m is the number of edges e with xe > 0. General Speedup Trick: Safe Shrinking Exploiting the “shrinking” of a set of cities S, given a current LP solution x: • Replace S by a single city σ and x by a new function x’, where for all distinct c,c’ in C-S, x’(u,v) = x(u,v), and for all c in C-S, x’(c,σ) = ∑v∈Sx(c,v). • Find cut in the shrunken graph, then unshrink it back to a cut in the original graph. • A set S is “safe” for shrinking if, whenever there is a violated TSP cut for x, then there is also one for x’. • It is “template safe” for a given type of cut (subtour, comb, etc.) if, whenever there is a violated cut in the given template for x, then there is also one for x’. • A natural candidate: Edges e with xe = 1. • Unfortunately, not always safe… Unsafe Edges 1.0 0.5 0.5 0.5 1.0 1.0 0.5 0.5 Violated Comb 0.5 1.0 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 Convex Combination of Two Tours Edge Safety Theorem [Padberg & Rinaldi, “Facet identification for the symmetric traveling salesman problem,” Math. Programming 47 (1990), 219-257] Theorem: If x(u,v) = 1 and there exists a vertex w with x(w,u) + x(w,v) = 1, in the solution to the current LP, then it is safe to shrink edge e = {u,v}. Proof: Suppose that x’ has no violated TSP cuts. Then it must be a convex combination of tours through the shrunken graph. Lemma: If x is a convex combination of tours in a shrunken (or unshrunken) graph, and we have x(e) = 1, then every tour in the convex combination contains edge e. Proof of Lemma: The tours containing e must have coefficients summing to 1 in the combination, meaning no other tours can have coefficients greater than 0, and hence none are part of the combination. QED Let σ be the result of the merging of u and v. By our hypothesis, we have x’(w,σ) = 1, so every tour T in our convex combination of optimal tours uses edge {w,σ}. For each such T, let αT be its multiplier in the convex combination. Edge Safety Theorem Proof Continued If we restrict attention to the edges e of these tours that do not involve σ, and hence have xe = x’e, then we have that the same convex combination precisely sums to xe for all such e. Consider a particular tour T in the combination, and let z1 be the city (other than w) that is adjacent to σ in the tour, as in the figure below (where at least one of the two edges {z1,u} and {z1,v} must be present in the original graph). z1 Original Graph αT … u v β 1-β w zk z1 αT … σ zk Shrunken Graph w Because of the tour multiplier, we must have x’{z1,σ} ≥ αT and x’{z1,σ} ≥ αT. Thus in the original graph, the min cut between w and z1 in the graph induced by those two cities and {u,v} must be at least αT and there must be a flow of size αT between w and z1. This flow must be partitioned among the four paths (w,u,z1), (w,v,z1), (w,u,v,z1), and (w,v,u,z1). So the tour T in the shrunken graph can be replaced by 4 (or fewer) tours in the original graph, with multipliers summing to αT. This holds for all tours in the convex combination for the shrunken graph, perhaps involving other tour neighbors zi of σ. We thus get a convex combination of tours. I claim that the union of all these tours in the original graph is a convex combination of optimal tours with edge weights matching those under x. Edge Safety Theorem Proof Continued By the previous argument, we have a convex combination of tours in the original graph whose value ye for any edge e satisfies ye ≤ xe. I claim that in fact we must have ye = xe for edges e. For note that, since x is the solution to a valid LP, we have ∑exed(e) ≤ OPT. So if for any e we have ye < xe, we would have a convex combination y of tours with ∑eyed(e) < ∑exed(e) ≤ OPT, and so at least one tour of length less than OPT, a contradiction. Thus x must be a convex combination of tours, as desired. QED Note that this theorem can be applied sequentially, leading to the merging of many edges, and in particular long paths. This can greatly reduce the size of the graph and consequently speed up our cut finding heuristics and algorithms. Untangling Convex Combinations The previous discussion reminds us of the problem of degeneracy, where the final LP (no more TSP cuts possible) still contains fractional values, and so represents a convex combination of tours rather than a single tour. This is not a problem if our champion tour already has length equal to the LP solution value, but what if not? [ABCC06] doesn’t seem to address this issue, so it probably is a rare occurrence. The likely approach is to use heuristics, and exploit the fact that the graph is probably now very sparse and can be made much smaller. • Only e such that xe > 0 can be in a tour. • All edges e with xe = 1 must be in an optimal tour, and so maximal paths of such edges can be collapsed into a single forced edge representing that path. Such an approach can also be used even before the we have an unimprovable LP, as a way to potentially find new champion tours. Managing and Solving the LP’s: Core Sets Problem: Our LP’s potentially involve billions of variables. Solution: “Core sets” • • • • We observe that only a relatively small number of edges typically get non-zero values when solving our LP’s. If we knew which ones they would be, we could simply eliminate them from the formulation by fixing their values at 0. Since we do not know them in advance, Concorde uses a simple heuristic to define a “core set” of edges that are allowed non-zero values. Standard possibilities: – – – • • The edges in some collection of good tours. The edges to the k-nearest neighbors for each city for some k. A combination of the two. Concorde uses the union of the edges occurring in the tours resulting from 10 runs of Chained Lin-Kernighan for its initial core set. Thereafter, it will delete an edge e from the core set if for some constant L1 consecutive LP solves its value remains below a tolerance e1. Typical values are L1 = 200 and e1 = 0.0001. Managing and Solving the LP’s: Adding edges to the Core Set • In solving the LPs, Concorde computes both a primal and a dual solution. • Consequently, it has access to reduced costs cj - yTAj for all the edges, including those not in the core. • Edges with negative reduced costs are candidates for addition to the core (just as non-basic core variables with negative reduced costs are candidates for the entering variable in a step of the primal simplex algorithm). • These are added to a queue, and every so often, the 100 with the most negative reduced costs are added to the LP. It is then re-solved and the new reduced costs for the remaining edges in the queue are computed. Any edge with reduced cost greater than some e2 < 0 is removed from the queue. Concorde uses e2 = - 0.0001. • One difficulty: For very large instances, computing the reduced costs for all non-core edges can be very expensive because there are so many of them ( ~500 billion for N = 106). Concorde’s solutions: – Heuristics for approximating the reduced costs quickly (followed by exact pricing of the good candidates). – Only price a fraction of the non-core edges each time, cycling through all of them over many iterations. – Alternate this with cycling through just those non-core edges that are within the 50 nearest neighbors of some city. – For geometric instances, the list of edges to consider can be substantially pruned based on the city coordinates and values from the current dual solution. Managing Cuts • Start with just the degree-2 constraints. – Note that we do not need an integer solution (as I described last time, using b-matching). – A fractional solution, with edge values in {0,½,1} will suffice for getting us started. – This can be accomplished without solving the LP, via a primal-dual algorithm. • Subsequently, when a cut is found by a separation routine, it is appended to the end of a queue of cuts waiting to be added to the core LP. If the queue is small, we may call these routines many times, thus adding many cuts to the queue. • When the cut-finding process stops, we repeated the following process until the queue is empty or we have added 250 cuts to the core LP: – Take the first cut from the queue – Check to confirm that it is still violated by the current x by at least some small tolerance (say 0.002). This is needed since the cut may have been found for some earlier value of x. – • If the cut is still valid in this sense, add it to the core LP. Otherwise, discard it. Cuts are deleted if the dual variables for them are less than some fixed tolerance (say 0.001) for 10 consecutive LP solves. Storing Cuts • Problem: It is not necessarily efficient to store cuts as an actual inequality on variables (or, worse yet, a vector of length |C|), since this can be a very inefficient use of space. • 81 Gbytes for TSPLIB instance pla7397 in vector form. • This can be reduced to 96 Mbytes by more efficient representations. – Lists of sets – Variable-length codes to point to sets (based on their frequency of occurrence) – Intervals represented by their endpoints, or better yet, their first endpoint and the interval length – Etc. • We need to do some computation to decode the representations (and their effect on the core set of variables) but this is a worthwhile tradeoff. • Also, it pays to choose, among the many equivalent representations, the one that leads to the fewest non-zeros in the core-set LP formulation. – For instance, one can choose to represent a subtour inequality by either S or C-S, and the smaller of those two sets is likely to be better in this regard. Solving the LP’s • Use the Dual Steepest Edge variant of the Simplex algorithm. • Used the CPLEX package when [ABCC06] was being written. • Current Concorde package includes its own LP solving package (QSopt), tailored for solving the kinds of LP’s encountered in Concorde. • I am omitting loads of details. • (as I have in all the other Concorde-related issues I have discussed). • One detail that should be discussed: Round-off error and valid bounds. Coping with Round-Off Error Commercial Linear Programming codes use floating point arithmetic. Their arithmetic routines presumably meet the IEEE standard, but this still means that one can only report results to within some fixed tolerance. Given this, our LP solutions only generate imprecise lower bounds. And may be in error since the solution may violate some of the LP cut constraints if it is very close to the bound of the cut, and merely setting tolerances may not suffice. One frequently used approach: Use exact-arithmetic LP codes (or exact-arithmetic hand computations in the case of [DFJ54]). Unfortunately, this is very slow. Concorde’s approach: • Start with the solution to the dual (which it already computes). • In the exact arithmetic world, this equals the primal opt, and any feasible solution to the dual is a lower bound on the primal opt. • Find a fixed-precision feasible solution to the dual by exploiting the fact that all the dual constraints have unique slack variables. (In Concorde’s case that precision is 32 bits each to the right and left of the decimal point). • This remains quite close to the floating point optimal. Branching Actually there are two types of branching in Concorde. 1. Edge Branching (xe = 1 or xe = 0). 2. Subtour Branching [Clochard and Naddef, “Using path inequalities in a branch and cut code for the symmetric traveling salesman problem,” Third IPCO Conference (1993), 291-311]. For some subset S of cities that does not yield a violated subtour inequality, break into cases depending on whether δx(S) ≤ 2 or δx(S) ≥ 4. This covers all possibilities, since in a tour we cannot have an odd value for δx(S). Before we split the root subproblem, we first fix as many of the edge values at 0 or 1 as possible, using the reduced costs λe of the edges in the final LP and the “integrality gap” Δ between the length Length(T*) of our current best tour and the LP solution: xe = 1 if λe < -(Δ-1) and xe = 0 if λe > (Δ-1). (In each case, fixing the variable to the other choice from {0,1} would cause the LB to grow larger than Length(T*) – 1, allowing us to prune the subproblem.) Choosing the Split Edge Candidates: For each fractional variable xe, estimate the change (z0 or z1) in LP objective value that would result from setting xe to either 0 or 1 in our current LP and making a single dual Simplex pivot. Rank the choices by the formula p(z0, z1, γ) = (γ∙minimum(z0, z1) + maximum(z0, z1))/(γ+1), with γ = 10, saving the top 5 as candidates. (It is more important to improve the smaller bound than the larger.) Subtour Candidates: For each of the 3500 sets S involved in our cuts that have δx(S) closest to 3, rank each choice in an analogous way and save the top 25 candidates. Get another 25 candidates from a separate, more combinatorial approach. Ranking the Candidates: Strong Branching Each candidate consists of a pair of constraint sets which produce a disjoint partition of the possibilities, either • (xe = 1, xe = 0), or • (δx(S) ≤ 2, δx(S) ≥ 4) Denote these pairs by (P01, P11), (P02, P12), …, (P0k, P1k). For each Pij, add the corresponding constraints to the LP, and starting with the current basic solution, perform a fixed (limited) number of dual-steepest-edge Simplex pivots and let zij be the resulting dual objective value. (If the problem is infeasible, set zij = length(T*)). Pick the candidate that maximizes p(z0j, z1j, 100). For the most difficult instances, we can take this one step further (“tentative branching”), whose added cost may be justified by the need for fewer subproblems. • Take the top h candidates according to the above ranking. • For each use the full cutting plane approach to get a lower bound żij. • Then rank these by p(ż0j, ż1j, 10) and take the best. Branching Performance Start with a root LP with value 0.00009% below the optimal tour length. (Tour found by Held Kelsgaun using a variant of his “LKH” algorithm. More to come about how the root LP was built). Computations performed on a network of 250 2.6 Ghz AMD Opteron processors. The CPU times are the sum of the times used on the individual processors. pla85900 # of Subproblems CPU Time >> 3,000 >> 4 years Tentative Branching, h = 4 1,149 1.8 years Tentative Branching, h = 32 243 0.8 years Strong Branching More on pla85900’s Solution More on pla85900’s Solution Not solved with vanilla Concorde. Instead: • Perform an initial run (without branching) to drive up the root solution. Result: bound 0.0802% below optimal, 22.6 CPU days. • Perform a run with an upper bound of 142,320,000 (below optimal) and starting with cuts from previous run. Result: bound 0.0456% below optimal, 144.0 CPU days. • Perform a run with an upper bound of 142,340,000 (still below opt), and starting with the cuts from previous runs. Result: bound 0.0395% below optimal, 227.3 CPU days. • Perform a run with the true upper bound, starting with the cuts from all the previous runs and running until there were 1000 active search nodes. Result: bound 0.0324% below optimal, 492.8 CPU days. • Repeat previous step, but including the new cuts it generated. Result: bound 0.0304% below optimal, 740.0 CPU days. • Over the course of a year, repeatedly apply all of Concorde’s separation routines to further drive up the root bound. Result: bound 0.00087% below optimal, lots of CPU days… More on pla85900’s Solution Cutting Planes in the pla85900 root LP Type Number Subtours 1,030 Combs 2,787 Paths and Stars 4,048 Bipartition 41 Domino Parity 164 Others (Local Cuts) 809 With this last root LP, Concorde finally solved pla85900 in just 2,719.5 CPU days Adding the additional cuts found to the root LP led to the even better root gap of 0.00009% and the results quoted 3 slides back. Vanilla Concorde Random Euclidean Instances N Samples ABCC06 2d Mean ABCC06 2d Samples iMac14 2d Mean iMac14 2d Samples iMac14 3d Mean iMac14 3d Samples iMac14 7d Mean iMac14 7d 100 10,000 0.7 200 10,000 3.6 300 10,000 10.6 400 10,000 25.9 500 10,000 50.2 1,000 20.9 1,000 6.6 1,000 2.6 600 10,000 92.0 1,000 35.2 1,000 9.8 1,000 4.1 700 10,000 154.5 1,000 52.3 1,000 13.2 1,000 5.6 800 10,000 250.4 1,000 81.7 1,000 17.4 1,000 6.6 900 10,000 384.7 1,000 116.0 1,000 21.0 1,000 8.1 1000 10,000 601.6 277 158.6 1,000 29.4 1,000 9.1 1500 1,000 3,290.9 1,000 61.3 1,000 10.6 2000 1,000 14,065.6 1,000 27.1 2500 1,000 53,737.9 100 45.7 7d requires a version of Concorde modified for higher fixed-precision. (Modified by Jeffrey Stoltz with hints from Bill Cook.) Next Time More properties of Random Euclidean Instances (as revealed by Concorde)