Thesis Expose Likely topic: Reinforcement learning for Combinatorial Optimization (CO): an in-depth look at the Travelling Salesman Problem (TSP) Optimization is the process of finding the optimal value among different possibilities for a problem. Combinatorial Optimization (CO) problems have been conventionally understood to be optimization problems in discrete space. One common example of a CO problem is the Travelling Salesman Problem (TSP). Finding the shortest route that touches each vertex and returns to the initial endpoint is the end goal of TSP. CO problems are NP-hard, and many algorithms have been designed that solve these problems approximately or heuristically. In recent times, training a machine learning (ML) model to solve CO problems is becoming increasingly popular. In this study, we focus on a particular branch of ML called reinforcement learning (RL) and how it stacks against other ML approaches as well as classical heuristics in terms of performance, generalizability, efficiency, etc. Please note that in this study ML refers to the popular machine learning algorithms like deep learning and other supervised machine learning algorithms like neural networks. RL could be used for end-to-end optimization or as a tool to support lower-level decisionmaking within a different higher-level optimization framework. This study will compare how these different approaches compare against one another. The study will be divided more broadly into the following sections: Introduction & Related work, Background, Approaches (Methodology), and Outlook (Conclusion). Introduction The introduction section will provide the motivation for the relevance of the RL approach in CO, despite the existing popular heuristic and other solvers. It will also provide definitions of mixed-integer linear programs (MILP), State-space, action space, reward function, transition function, discount factor, etc. It will also introduce the general idea of the Markov Decision Process by means of concepts like policy function, and optimal policy including the basic differences between value-based and policy-based approaches. In addition, this section will also introduce the various relevant terms for CO with a particular focus on the TSP. It will end by describing the process of solving a CO problem using RL. Background The background section will provide the mathematical formulation of Combinatorial Optimization problems and Reinforcement Learning algorithms along with Mixed-Integer Linear Programming (MILP). Within the combinatorial Optimization problem, various classes of CO problems will be introduced with a particular focus on TSP. This section will dive deeper into the RL terminologies giving mathematical details of the Markov-decision process bringing all the definitions mentioned in the Introduction section together along with concepts of Bellman equations, dynamic programming, exploration vs exploitation dilemma, etc. This section will also dive deeper into the TSP along with a brief outlook on the historical literature on how this problem has been addressed. Approaches / Methodology The approaches/methodology section will make up the bulk of this study where the recent literature will be broadly investigated in trying to answer the following research questions: 1. Can the end-to-end reinforcement learning approach for combinatorial optimization be improved by using reinforcement learning approaches in combination with CO algorithms (optimization algorithms)? 2. Does RL play a role as a modeling tool for CO rather than merely improving the performance within other optimization algorithms? 3. What is the appropriate size of the domain of the CO problems that RL can handle effectively? Does that vary across different CO problems? Does scaling cause a problem? 4. Which of the approaches can be more readily generalized (i.e., applied more broadly on a range of problems)? 5. Is there an inherent characteristic of a CO problem that makes it more prone to better or worse performance while using RL? Brief description of the various approaches: One approach to solving CO is that theoretical and/or empirical knowledge about the decisions to be made for the CO is assumed but the computational burden for all or some of those decisions is relieved using ML and more recently RL. Hence the ML/RL model supports lowerlevel decisions while the high-level structure is controlled by a master algorithm. One such popular master algorithm is the branch-and-bound tree for mixed-integer linear programming (MILP) where the general algorithm remains a branch-and-bound framework while the task of selecting the branching variable could be a good candidate for the application of ML/RL. Another such concept is found in the branch-and-cut algorithm that uses ML/RL as a tool to compute the bound improvement/relaxation to select the most promising cuts of the feasibility space. This eases the very significant computational burden of solving semidefinite programming as for example in He aume III and Eisner (2014), the authors devised a policy to select the node containing the optimal solution in its subtree. This study will investigate if there is a benefit to using RL here and whether the readily available branching strategies are too heuristic or slow and RL could help remedy some of these drawbacks. An additional objective of this study is also to see how RL stacks against other ML algorithms like Deep Learning as well as other supervised and unsupervised ML algorithms. Similarly, this study will also investigate if it could be interesting to look at how these various approaches perform across the various range of problems within CO. The work of Lodi & Zarpellon (2017) as well as Hottung et al. (2017) are addressing this issue and will be referred for the survey. While the study will mostly focus on the TSP when literature is available, a more generic view of different problem classes within CO will also be provided. Another approach to this is to train a model for end-to-end CO using RL to learn a policy using the Markov decision process framework by matching the reward signal with the optimization objective. Although the idea of using ML in solving CO is not a new idea and started in the nineties, some pioneering work on this was done by Vinyals, Fortunato and Jaitly (2005) where they introduce the idea of encoder/decoder to produce a probability distribution over the nodes of the TSP which makes it possible to use the network over different input graph sizes. The authors use a supervised ML method for this study. Bello, Pham, Le, Norouzi and Bengio (2017) train a similar model using RL and having tour length as a reward signal. This study will compare these two approaches and focus on if RL has some advantages over using other ML algorithms. Finally, this study will also look across the two aforementioned approaches to investigate the combination in which the RL contributes best, whether it be and end to end approach or a combination of RL with other optimization techniques. Outlook: In this section, the main findings of the study will be presented along with the challenges in using RL for machine learning. The findings of this study will attempt to address the research questions raised in the preceding section mainly focusing on the TSP while also including other CO problems. Similarly, the study will conclude with an outline of the future areas of work and recommended directions for future studies.