Uploaded by Saurav Suman

Thesis Expose Saurav

advertisement
Thesis Expose
Likely topic:
Reinforcement learning for Combinatorial Optimization (CO): an in-depth look at the
Travelling Salesman Problem (TSP)
Optimization is the process of finding the optimal value among different possibilities for a
problem. Combinatorial Optimization (CO) problems have been conventionally understood to
be optimization problems in discrete space. One common example of a CO problem is the
Travelling Salesman Problem (TSP). Finding the shortest route that touches each vertex and
returns to the initial endpoint is the end goal of TSP.
CO problems are NP-hard, and many algorithms have been designed that solve these problems
approximately or heuristically. In recent times, training a machine learning (ML) model to
solve CO problems is becoming increasingly popular. In this study, we focus on a particular
branch of ML called reinforcement learning (RL) and how it stacks against other ML
approaches as well as classical heuristics in terms of performance, generalizability, efficiency,
etc. Please note that in this study ML refers to the popular machine learning algorithms like
deep learning and other supervised machine learning algorithms like neural networks.
RL could be used for end-to-end optimization or as a tool to support lower-level decisionmaking within a different higher-level optimization framework. This study will compare how
these different approaches compare against one another. The study will be divided more
broadly into the following sections: Introduction & Related work, Background, Approaches
(Methodology), and Outlook (Conclusion).
Introduction
The introduction section will provide the motivation for the relevance of the RL approach in
CO, despite the existing popular heuristic and other solvers. It will also provide definitions of
mixed-integer linear programs (MILP), State-space, action space, reward function, transition
function, discount factor, etc. It will also introduce the general idea of the Markov Decision
Process by means of concepts like policy function, and optimal policy including the basic
differences between value-based and policy-based approaches. In addition, this section will
also introduce the various relevant terms for CO with a particular focus on the TSP. It will end
by describing the process of solving a CO problem using RL.
Background
The background section will provide the mathematical formulation of Combinatorial
Optimization problems and Reinforcement Learning algorithms along with Mixed-Integer
Linear Programming (MILP). Within the combinatorial Optimization problem, various classes
of CO problems will be introduced with a particular focus on TSP. This section will dive deeper
into the RL terminologies giving mathematical details of the Markov-decision process bringing
all the definitions mentioned in the Introduction section together along with concepts of
Bellman equations, dynamic programming, exploration vs exploitation dilemma, etc. This
section will also dive deeper into the TSP along with a brief outlook on the historical literature
on how this problem has been addressed.
Approaches / Methodology
The approaches/methodology section will make up the bulk of this study where the recent
literature will be broadly investigated in trying to answer the following research questions:
1. Can the end-to-end reinforcement learning approach for combinatorial optimization be
improved by using reinforcement learning approaches in combination with CO algorithms
(optimization algorithms)?
2. Does RL play a role as a modeling tool for CO rather than merely improving the performance
within other optimization algorithms?
3. What is the appropriate size of the domain of the CO problems that RL can handle
effectively? Does that vary across different CO problems? Does scaling cause a problem?
4. Which of the approaches can be more readily generalized (i.e., applied more broadly on a
range of problems)?
5. Is there an inherent characteristic of a CO problem that makes it more prone to better or
worse performance while using RL?
Brief description of the various approaches:
One approach to solving CO is that theoretical and/or empirical knowledge about the decisions
to be made for the CO is assumed but the computational burden for all or some of those
decisions is relieved using ML and more recently RL. Hence the ML/RL model supports lowerlevel decisions while the high-level structure is controlled by a master algorithm. One such
popular master algorithm is the branch-and-bound tree for mixed-integer linear programming
(MILP) where the general algorithm remains a branch-and-bound framework while the task of
selecting the branching variable could be a good candidate for the application of ML/RL.
Another such concept is found in the branch-and-cut algorithm that uses ML/RL as a tool to
compute the bound improvement/relaxation to select the most promising cuts of the feasibility
space. This eases the very significant computational burden of solving semidefinite
programming as for example in He aume III and Eisner (2014), the authors devised a policy to
select the node containing the optimal solution in its subtree. This study will investigate if
there is a benefit to using RL here and whether the readily available branching strategies are
too heuristic or slow and RL could help remedy some of these drawbacks. An additional
objective of this study is also to see how RL stacks against other ML algorithms like Deep
Learning as well as other supervised and unsupervised ML algorithms. Similarly, this study
will also investigate if it could be interesting to look at how these various approaches perform
across the various range of problems within CO. The work of Lodi & Zarpellon (2017) as well
as Hottung et al. (2017) are addressing this issue and will be referred for the survey. While the
study will mostly focus on the TSP when literature is available, a more generic view of different
problem classes within CO will also be provided.
Another approach to this is to train a model for end-to-end CO using RL to learn a policy using
the Markov decision process framework by matching the reward signal with the optimization
objective. Although the idea of using ML in solving CO is not a new idea and started in the
nineties, some pioneering work on this was done by Vinyals, Fortunato and Jaitly (2005) where
they introduce the idea of encoder/decoder to produce a probability distribution over the nodes
of the TSP which makes it possible to use the network over different input graph sizes. The
authors use a supervised ML method for this study. Bello, Pham, Le, Norouzi and Bengio
(2017) train a similar model using RL and having tour length as a reward signal. This study
will compare these two approaches and focus on if RL has some advantages over using other
ML algorithms.
Finally, this study will also look across the two aforementioned approaches to investigate the
combination in which the RL contributes best, whether it be and end to end approach or a
combination of RL with other optimization techniques.
Outlook:
In this section, the main findings of the study will be presented along with the challenges in
using RL for machine learning. The findings of this study will attempt to address the research
questions raised in the preceding section mainly focusing on the TSP while also including other
CO problems. Similarly, the study will conclude with an outline of the future areas of work
and recommended directions for future studies.
Download