Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram Ramanujan) MCTS Workshop at ICAPS-2011 June 12, 2011 1 © 2011 IBM Corporation MCTS and Combinatorial Search Monte Carlo Tree Search (MCTS): widely used in a variety of domains in AI Upper Confidence bounds on Trees (UCT): a form of MCTS, especially successful in two-agent game tree search, e.g., Go, Kriegspiel, Mancala, General Game Playing Based on single-agent tree search: one multi-armed bandit at each node of a tree goal: find the most “rewarding” root-to-leaf path in the tree graph coloring Combinatorial Search A discrete search space, e.g., {0,1}N or {R, G, B}N A “feasible” subspace of interest: typically defined indirectly by a finite set of constraints Goal: find a solution – an element of the discrete space that satisfies all constraints If a utility function / objective function given: find an optimal solution E.g., Boolean Satisfiability (SAT), Graph Coloring (COL), Constraint Satisfaction Problems (CSPs), Constraint Optimization, Integer Programming (IP) Can MCTS/UCT inspired techniques be used to improve the performance of combinatorial search algorithms? 2 © 2011 IBM Corporation Mixed Integer Programming (MIP) : A Challenging but Promising Opportunity MIP: linear inequality constraints, continuous & discrete variables Typically with a linear (or quadratic) objective function NP-hard; highly useful, with several academic and commercial solvers available MIP search appears much more suitable than, e.g., SAT for applying UCT! Opportunity for applying UCT MIP solvers such as IBM ILOG’s CPLEX, Gurobi, etc.: maintain a “frontier” of open nodes, exploring them with a combination of best-first search, “diving” to the bottom of the tree, etc. rely on spending substantial effort per node, e.g., computing LP relaxation to obtain a bound on the objective value in the subtree: an estimate of the true value In contrast, state-of-the-art SAT solvers not easily adapted to UCT: are based on enhancements to basic depth-first search traversal rely on processing nodes extremely fast (~ 2000-5000 per second) Can we improve CPLEX by letting UCT decide search tree exploration order? 3 © 2011 IBM Corporation Mixed Integer Programming (MIP) : A Challenging but Promising Opportunity Challenges and Differences from the “usual” setup for UCT Biggest success of UCT so far: two-agent game tree search, rather than single-agent Random playouts are costly to implement in MIP search Unlike game tree search, too costly to create a full UCT tree at each node Exploitation isn’t very meaningful after true value of a node is revealed: no reason to repeatedly visit that node even if it is optimal LP relaxation – available for “free”, provides a guaranteed bound on the true value averaging backups may not be the best strategy! Highly optimized commercial MIP solvers such as CPLEX very hard to improve upon! Implementation: no easy access to CPLEX’s internal data structures; must maintain our own “shadow tree” for exploring UCT strategies – additional overhead Main Finding: Guidance near the top of the tree can improve performance across a variety of instances! 4 © 2011 IBM Corporation How does Search in CPLEX (roughly) work? CPLEX explores the search tree by alternating between two operations: I. Node Selection: Select the next open search node to continue search on: CPLEX selects node with the best estimate E II. Branching: Select the next variable to branch on (assume binary branching) Search Tree Root-Node Ei x 10 E0 x 10 y 5 E1 y 5 z2 E3 E5 E4 v 1 E7 E2 z2 v 1 E6 E8 - Node Selection: Initially only one node that can be selected - Branching: Select variable x - Node Selection: Select node with estimate E1 - Branching: Select variable y - Node Selection: Select node with estimate E2 - Branching: Select variable z - Node Selection: Select node with estimate E5 - Branching: Select variable v CPLEX open nodes and corresponding quality estimate E of the underlying sub-tree (e.g., LP objective value) CPLEX closed nodes 5 © 2011 IBM Corporation Guiding Node Selection in CPLEX with UCT Node Selection with UCT Idea: expand nodes in the order in which UCT would expand them Traverse search tree from root to a current leaf node (i.e., “open” node) while at each node selecting the child that has the highest UCT score s. UCT score s: Combines estimate of the “quality” of a node (the same CPLEX uses) with how often this node has been visited already Tree Update Phase 6 Goal: Balance Exploration / Exploitation in CPLEX search When node selection reaches a leaf node, compute its quality estimate (e.g., objective value of LP relaxation) and propagate it upwards towards the root branch on this node using the default variable/value selection of CPLEX Update rule / backup operator: max of the two children (no averaging!), if maximization problem; min if minimization Result: estimate at each node N along this leaf-to-root path equals the best value seen in the entire sub-tree under N © 2011 IBM Corporation Guiding Search in CPLEX with UCT Node Selection Node Selection is now guided by UCT scores (as illustrated below) UCT score is based on estimate E and number of visits to a search nod In order to employ UCT one needs to maintain a shadow tree of CPLEXs search tree CPLEX maintains just a frontier of open nodes; the underlying search tree only exists implicitly Search Tree Root-Node Ei #visits0 x 10 E0 - Node Selection: Initially only one node that can be selected - Branching: Select variable x #visits1 #visits2 E - Node Selection: E 2 1 y5 y5 z2 z 2 Select node with highest UCT score based on E1and #visits #visits5 1 E E3 E6 - Branching: Select variable E54 E4 y v 1 v 1 Node Selection: #visits3 #visits6 - Select #visits4 node with highest UCT score based E7 E8 #visits7 #visits8 …on E2and #visits2 x 10 CPLEX open nodes and corresponding quality estimate E of the underlying sub-tree (e.g., LP objective value) CPLEX closed nodes 7 © 2011 IBM Corporation Guiding Search in CPLEX with UCT Tree Update Phase After selecting a node N and branching on a variable, two child nodes N_left and N_right will be created with their corresponding estimates E_left and E_right When propagating estimates upwards, we only consider the best estimate (e.g., no averaging) Update using the “backup operator” Search Tree Root-Node Ei x 10 E y5 1 y5 E3 E4 E0 x 10 - Propagate max(E1 , E2 ) E1 to E2 E0 - Propagate max(E3 , E4 ) E4 to E1 as long as new estimates improve current best estimate at a node on path to the root. E.g., only if E4 E0 then propagate new estimate to node labeled with E0 . However, visit counts are updated for each node on the path to root. CPLEX open nodes and corresponding quality estimate E of the underlying sub-tree (e.g., LP objective value) CPLEX closed nodes 8 © 2011 IBM Corporation UCT Score: “Epsilon Greedy” Variant of UCB1 UCT Score computation: N = tree node under consideration P = parent of N = a constant balancing exploration and exploitation (0.7 in experiments) = theoretically a number decreasing inversely proportional to visits(N) ( = a constant set to 0.01 in experiments) 9 Fast and accurate enough for our purposes, compared to the standard UCB1 formula © 2011 IBM Corporation Experimental Evaluation Starting with 1,024 publically available MIP instances we removed: All instances solved by default CPLEX within 10 seconds (too easy) All instances not solved by default CPLEX within 900 seconds (too hard) Experimental Evaluation is based on the 170 remaining instances Spanning a variety of domains Experimentation not limited to any particular instance family (e.g., TSP instances, set covering, etc.) Experiments were conducted on: Intel Xeon CPU E5410, 2.33GHz with 8 cores, and 32GB of memory 10 Only a single run per machine since multiple CPLEXs on one machine can (and often do!) interfere with each other OS: Ubuntu © 2011 IBM Corporation Experimental Evaluation: Solvers Default CPLEX Uses various strategies, including a combination of best-first node selection and depth-first “diving” to reach a leaf node from each best node Highly optimized; very challenging to beat by a large margin across a large variety of problem domains CPLEX with node selection guided by UCT 11 Best results when guidance limited to the top 5 levels of the tree; then revert to the default node selection of CPLEX Other standard exploration schemes Best-first Breadth-first Depth-first © 2011 IBM Corporation Preliminary Experimental Results [ timeout: 600 sec ] Promising performance: UCT guidance results in the fewest instances timing out (8) Fastest on 39 instances Lowest average runtime (albeit only by a few seconds) 12 © 2011 IBM Corporation Preliminary Experimental Results Pairwise performance measure (timeout: 600 sec) : how often does the row solver outperform the column solver? e.g., UCT guidance outperforms default CPLEX on 64 instances; 52 times vice versa Promising performance: UCT guidance outperforms default CPLEX and other natural alternatives 13 © 2011 IBM Corporation Conclusion 14 Explored the use of MCTS/UCT in a combinatorial search setting Specifically, for mixed integer programming (MIP) search, with CPLEX Typical “random playouts” very costly but LP relaxation objective value serves as a good estimate – a guaranteed one-sided bound! Max-style update rule performs better here than the usual averaging backups Guiding combinatorial search with UCT holds promise! Improving performance of highly optimized MIP solvers across a variety of problem domains is a huge challenge UCT-inspired guidance for node selection shows promise Most benefit when UCT used only near the top of the search tree Further exploration along these lines appears fruitful, e.g.: using UCT for variable or value selection (rather than node selection) building a “full” UCT tree at each search tree node before branching © 2011 IBM Corporation