Reinforcement Learning for Solving the Vehicle Routing Problem Overview: This paper introduces a reinforcement learning method to tackle the vehicle routing problem via policy gradient methods. The proposed approach reduces computational barriers by removing the need to re-train the model on every problem it is presented with, as done in classical methods, by deploying a recurrent architecture. In addition, the architecture is scalable to more problems and can be applied to many classes of discrete optimization problems. The take home message is the potential of reinforcement learning algorithms to solve traditional discrete optimization problems and their efficacy when deployed in practice. Related Work: The related works section broadly describes sequence to sequence architectures and neural combinatorial optimization. For the former, the authors highlight the use of encoders and decoders which they later reference for their architecture since they employ a decoder RNN in their model. For the latter, the authors introduce policy networks which are used to solve combinatorial optimization problems and argue that their approach is a simplified version of that architecture. Overall the work is properly situated with previous publications and adds on top of previous works with more novel ideas. Significance and Originality: The ideas presented are novel with respect to prior works. In particular, the authors propose the removal of an encoder RNN from the previously proposed pointer networks in an attempt to remove order dependence on the input and for efficiency purposes. The authors also investigate stochastic environments of the vehicle routing problem which is not commonly addressed by classical methods. Thus, the architecture is unique to the problem at hand and encourages more research into that direction. Soundness: The authors attach appendices with detailed algorithmic explanations of REINFORCE and other VRP baselines. The authors also describe the attention mechanism and apply it to the problem at hand. There are no explicit theoretical justifications of the algorithm performing optimally on the VRP, but it rests on the assumption that reinforcement learning algorithms can perform well in a Markovian setting which is a reasonable assumption to make for the given problem. Hence, the ideas presented are sound. Evaluation: The authors attempt to evaluate their work empirically by running the algorithm on multiple benchmarks with various numbers of customer nodes (10,20,50,100) and vehicle capacities. The other methods include Clarke-Wright savings heuristic, the Sweep heuristic, and Google’s optimization tools. For the smaller number of customer nodes (10,20), the metric user is the distance from the optimal solution which is computed by mixed integer programming; this metric is termed “optimality gap”. For experiments with a higher number of customer nodes, a pairwise comparison is used on a select number of samples for each method by denoting the percentage of samples on which the current method outperforms the other. Moreover, the authors perform comparisons of the outlined methods based on the times taken to produce the solution normalized by the number of customers. This comparison is relevant because many systems attempting to tackle the vehicle routing problem need to do so quickly in real time. The analysis shows that the proposed method maintains a constant ratio with the number of customers whereas other methods exhibit a linear relationship, suggesting that the proposed method is scalable. The authors do not investigate problems with nodes of more than 110. A suggestion to the authors would be comparing the number of parameters in each model deployed. The author reports that total training time will be 13.5 hours. The authors also suggest a stochastic version of the problem which is closer to the real world in the appendix section. Readability Structure Although the concepts are carefully explained, it would be beneficial if the authors introduced the Vehicle Routing Problem in earlier sections as opposed to doing so in section 4. Moreover, adding more justification to why RNNs may be a good candidate solution may prove helpful to readers. Writing The writing is clear and concise. References: [1] Nazari, M., Oroojlooy, A., Snyder, L., & Takác, M. (2018). Reinforcement learning for solving the vehicle routing problem. Advances in neural information processing systems, 31.