Evolving Cost Functions for Model Predictive Control of Multi-Agent UAV Combat Swarms David D. Fan Inst. for Robotics and Intelligent Machines Georgia Institute of Technology david.fan@gatech.edu Evangelos Theodorou Dept. of Aerospace Engineering Georgia Institute of Technology evangelos.theodorou@gatech.edu Calculating the optimal control involves sampling trajectories from stochastic dynamics with randomly chosen controls. Then, the total cost of these sampled trajectories are used to compute a weighted average of the control sequences used to sample the trajectories. Because MPPI is a sampling-based optimal control method, it does not require quadratic, or even differentiable cost functions, and can handle complex nonlinear dynamics in stochastic settings. The optimal control that MPPI finds is: Z T f g 1 | u∗ (·) = arg minEQ ϕ(xT ,T ) + q(xt ,t ) + ut R(xt ,t )ut dt , 2 t0 u(·) (1) where xt is the state and ut are the controls, ϕ is the final cost, q is the running cost, and the expectation EQ is taken over trajectories governed by the dynamics of the system. In typical control problems the final and running cost usually specifies the task, such as target following or moving towards some goal. It is these function that we will learn using neuroevolution. ABSTRACT Recent advances in sampling-based Model Predictive Control (MPC) methods have enabled the control of nonlinear stochastic dynamical systems with complex and non-smooth cost functions. However, the main drawback of these methods is that they can be myopic with respect to high-level tasks, since MPC relies on predicting dynamics within a short time horizon. Furthermore, designing cost functions which capture high-level information may be prohibitive for complex tasks, especially multi-agent scenarios. Here we propose a hierarchical approach to this problem where the NeuroEvolution of Augmenting Topologies (NEAT) algorithm is used to build cost functions for an MPC trajectory optimization algorithm known as Model-Predictive Path Integral (MPPI) control. MPPI and NEAT are particularly well-suited to one another since MPPI can control an agent in a way that minimizes a non-differentiable cost function (including logic or non-smooth functions), while NEAT can build a neural network comprised of any arbitrary activation functions, including those which are non-differentiable or logic-based. We utilize this approach in controlling agile swarms of unmanned aerial vehicles (UAVs) in a simulated swarm vs. swarm combat scenario. 2 EVOLVING COST FUNCTIONS WITH NEAT The NeuroEvolution of Augmenting Topologies (NEAT) algorithm is a well-proven neuroevolutionary technique which can continuously grow networks with increasing complexity and random topologies [4]. The main theoretical contribution of our work lies in replacing the state-dependent cost term q(xt ,t ) used by the MPPI controller with a neural network evolved by the NEAT algorithm. The fitness of the neural network is determined by the task performance of the agents being controlled by the MPPI algorithm. In this way the NEAT algorithm is able to evolve a simple, compact function which may drastically affect the behavior of the MPPI controller. This change in behavior affects the performance of the agents on the task, which then informs the NEAT algorithm which structures are beneficial and which are not. In order to speed learning we use a seed for the neuroevolution which encodes a hand-crafted cost function. This hand-crafted cost function can encode some easily designed desired behavior such as simple goal-seeking, target tracking, and obstacle avoidance. The NEAT algorithm should then build more complex, coordinated behaviors on top of these baseline abilities. CCS CONCEPTS •Computing methodologies → Cooperation and coordination; Computational control theory; Neural networks; KEYWORDS Model Predictive Control, Optimal Control, Path Integral Control, NEAT, Evolution, Multi Agent Systems, Learning ACM Reference format: David D. Fan, Evangelos Theodorou, and John Reeder. 2017. Evolving Cost Functions for Model Predictive Control of Multi-Agent UAV Combat Swarms. In Proceedings of GECCO ’17 Companion, Berlin, Germany, July 15-19, 2017, 2 pages. DOI: http://dx.doi.org/10.1145/3067695.3076019 1 John Reeder SPAWAR Systems Center Pacific San Diego, CA, USA john.d.reeder@navy.mil MODEL PREDICTIVE PATH INTEGRAL CONTROL The MPPI controller is a sampling-based control algorithm [5]. 3 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). GECCO ’17 Companion, Berlin, Germany © 2017 Copyright held by the owner/author(s). 978-1-4503-4939-0/17/07. . . $15.00 DOI: http://dx.doi.org/10.1145/3067695.3076019 SWARM VS. SWARM UAV COMBAT SIMULATION We used the MASON simulator toolkit to simulate a 20 vs. 20 UAV combat scenario [2][1]. Our simulation includes realistic UAV flight dynamics, collisions, targeting and weapons models, and a red vs. blue competition and scoring system. We used a 3DOF rigid 55 GECCO ’17 Companion, July 15-19, 2017, Berlin, Germany David D. Fan, Evangelos Theodorou, and John Reeder fixed-wing model for the UAV dynamics [6]. The nonlinear model consists of 9 states. The simulation environment consists of an open arena where the two teams face off and battle. UAVs can die in one of three ways - they can collide with another UAV or the ground, they can be shot down, or they can be removed after successfully conducting a suicide attack on the enemy base. The winner of the match was the team which earned the most points, where a point was earned for shooting an enemy down or successfully attacking the enemy base, and a point was lost for being killed. 4 The simulation runs at about 10x real-time speed on an Intel Core i7-4770 CPU at 3.40GHz on a single thread. We tried two different ways of assigning fitnesses. The first was for each team to fight 5 games against an opposing team which used a hand-coded cost function, and receive a fitness based on its performance in the match, including UAVs lost, enemies killed, etc. At the start of evolution the best fitness was close to 0.5 indicating ties, and after 500 generations the fitness increased to 0.6837 (Figure 1). The behavior of the team with the highest fitness was where the top half of UAVs would drop down and follow the bottom half, creating a more compact group of UAVs flying closer to the ground. Upon reaching the battlefront the UAVs would then sweep up and attack from below (Figure ??). This seemed to be an effective strategy against the team using the hand-designed cost function. The second method we used to evaluate fitness was through a tournament where a team’s fitness was determined through a series of matches against different teams. For each generation, each team played 3 games against 5 different opponents. The team’s fitness was the average score it earned over the games that it played. After <200 generations, the best fitness plateaued at around 0.5, indicating that strong chromosomes were evenly matched with other members of the population (Figure 1). We observed the behavior where the team spreads out at the start of the match, then as opponents are detected, the team converges on the enemy, attacking from three sides in a pincer-like move. If the team won, the remaining UAVs would then fly in large sweeps, searching for enemies (Figure 2). These results can be seen in the supplementary video. vs. Hand-crafted (b) Blue spreads out (c) Blue attacks from sides (d) Red is at a disadvantage while the class of control problems we can practically solve with the MPPI controller is widened by using the NEAT algorithm. Future work includes further investigation of learning cost functions with other reinforcement learning methods, especially in stochastic settings. Another avenue of investigation is the use of other neuroevolution variants, such as odNEAT [3], which is for online, distributed evolution. REFERENCES [1] Uwe Gaerther. 2013. UAV swarm tactics: an agent-based simulation and Markov process analysis. Ph.D. diss. Naval Postgraduate School. DOI:http://dx.doi.org/ 10.4135/9781452276335.n310 [2] Sean Luke, Claudio Cioffi-Revilla, Liviu Panait, and Keith Sullivan. 2004. Mason: A New Multi-Agent Simulation Toolkit. In Proceedings of the 2004 SwarmFest Workshop, Vol. 8. 316–327. DOI:http://dx.doi.org/10.1.1.158.4758 [3] F Silva, P Urbano, S Oliveira, and AL Christensen. 2012. odNEAT: An Algorithm for Distributed Online, Onboard Evolution of Robot Behaviours. Artificial Life (2012). [4] Kenneth O Stanley. 2004. Competitive Coevolution through Evolutionary Complexification. J. Artif. Intell. Res.(JAIR) 21 (2004), 63–100. [5] Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. 2016. Aggressive Driving with Model Predictive Path Integral Control. In 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1433–1440. [6] Li Jun Zhang, He Ming Jia, and Da Peng Jiang. 2014. Sliding mode prediction control for 3D path following of an underactuated AUV. In IFAC Proceedings Volumes (IFAC-PapersOnline), Vol. 19. 8799–8804. DOI:http://dx.doi.org/10.0/ Linux-x86 64 0.55 Fitness Fitness Figure 2: 20 vs. 20 UAV Simulation with red team using handdesigned cost functions and blue team using a tournamentevolved cost function. At the start of the simulation (a), blue team spreads out much more than red (b), then attacks the red team in a pincer move, partially surrounding them (c). This can place the red team at a disadvantage (d). If any blue UAVs survive, they then fly in wide circles searching for more opponents (not shown). Tournament 0.7 0.6 0.5 0.45 0.35 0 200 400 Generation # 0 50 100 150 Generation # Figure 1: Fitness improvement over successive generations. 5 (a) Simulation begins RESULTS CONCLUSION We proposed the combination of the NEAT algorithm with the MPPI controller, for real-time control of a team of combat UAVs. This approach is very general and could be applied to a wide variety of single or multi-agent control problems, from difficult manipulation tasks in robotics, to coordination of multi-agent swarms. The scalability of the NEAT algorithm is increased by the MPPI controller, 56