Uploaded by Davis Hill

Evolving Cost Functions for Model Predictive Control of Multi-Agent UAV Combat Swarms

advertisement
Evolving Cost Functions for Model Predictive Control of
Multi-Agent UAV Combat Swarms
David D. Fan
Inst. for Robotics and Intelligent
Machines
Georgia Institute of Technology
david.fan@gatech.edu
Evangelos Theodorou
Dept. of Aerospace Engineering
Georgia Institute of Technology
evangelos.theodorou@gatech.edu
Calculating the optimal control involves sampling trajectories
from stochastic dynamics with randomly chosen controls. Then,
the total cost of these sampled trajectories are used to compute
a weighted average of the control sequences used to sample the
trajectories. Because MPPI is a sampling-based optimal control
method, it does not require quadratic, or even differentiable cost
functions, and can handle complex nonlinear dynamics in stochastic
settings. The optimal control that MPPI finds is:
Z T
f
g
1 |
u∗ (·) = arg minEQ ϕ(xT ,T ) +
q(xt ,t ) + ut R(xt ,t )ut dt ,
2
t0
u(·)
(1)
where xt is the state and ut are the controls, ϕ is the final cost, q is
the running cost, and the expectation EQ is taken over trajectories
governed by the dynamics of the system. In typical control problems
the final and running cost usually specifies the task, such as target
following or moving towards some goal. It is these function that
we will learn using neuroevolution.
ABSTRACT
Recent advances in sampling-based Model Predictive Control (MPC)
methods have enabled the control of nonlinear stochastic dynamical systems with complex and non-smooth cost functions. However,
the main drawback of these methods is that they can be myopic with
respect to high-level tasks, since MPC relies on predicting dynamics
within a short time horizon. Furthermore, designing cost functions
which capture high-level information may be prohibitive for complex tasks, especially multi-agent scenarios. Here we propose a
hierarchical approach to this problem where the NeuroEvolution
of Augmenting Topologies (NEAT) algorithm is used to build cost
functions for an MPC trajectory optimization algorithm known as
Model-Predictive Path Integral (MPPI) control. MPPI and NEAT
are particularly well-suited to one another since MPPI can control
an agent in a way that minimizes a non-differentiable cost function
(including logic or non-smooth functions), while NEAT can build
a neural network comprised of any arbitrary activation functions,
including those which are non-differentiable or logic-based. We utilize this approach in controlling agile swarms of unmanned aerial
vehicles (UAVs) in a simulated swarm vs. swarm combat scenario.
2
EVOLVING COST FUNCTIONS WITH NEAT
The NeuroEvolution of Augmenting Topologies (NEAT) algorithm is
a well-proven neuroevolutionary technique which can continuously
grow networks with increasing complexity and random topologies
[4]. The main theoretical contribution of our work lies in replacing
the state-dependent cost term q(xt ,t ) used by the MPPI controller
with a neural network evolved by the NEAT algorithm. The fitness
of the neural network is determined by the task performance of
the agents being controlled by the MPPI algorithm. In this way
the NEAT algorithm is able to evolve a simple, compact function
which may drastically affect the behavior of the MPPI controller.
This change in behavior affects the performance of the agents on
the task, which then informs the NEAT algorithm which structures
are beneficial and which are not.
In order to speed learning we use a seed for the neuroevolution
which encodes a hand-crafted cost function. This hand-crafted cost
function can encode some easily designed desired behavior such
as simple goal-seeking, target tracking, and obstacle avoidance.
The NEAT algorithm should then build more complex, coordinated
behaviors on top of these baseline abilities.
CCS CONCEPTS
•Computing methodologies → Cooperation and coordination;
Computational control theory; Neural networks;
KEYWORDS
Model Predictive Control, Optimal Control, Path Integral Control,
NEAT, Evolution, Multi Agent Systems, Learning
ACM Reference format:
David D. Fan, Evangelos Theodorou, and John Reeder. 2017. Evolving
Cost Functions for Model Predictive Control of Multi-Agent UAV Combat
Swarms. In Proceedings of GECCO ’17 Companion, Berlin, Germany, July
15-19, 2017, 2 pages.
DOI: http://dx.doi.org/10.1145/3067695.3076019
1
John Reeder
SPAWAR Systems Center Pacific
San Diego, CA, USA
john.d.reeder@navy.mil
MODEL PREDICTIVE PATH INTEGRAL
CONTROL
The MPPI controller is a sampling-based control algorithm [5].
3
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
GECCO ’17 Companion, Berlin, Germany
© 2017 Copyright held by the owner/author(s). 978-1-4503-4939-0/17/07. . . $15.00
DOI: http://dx.doi.org/10.1145/3067695.3076019
SWARM VS. SWARM UAV COMBAT
SIMULATION
We used the MASON simulator toolkit to simulate a 20 vs. 20 UAV
combat scenario [2][1]. Our simulation includes realistic UAV flight
dynamics, collisions, targeting and weapons models, and a red
vs. blue competition and scoring system. We used a 3DOF rigid
55
GECCO ’17 Companion, July 15-19, 2017, Berlin, Germany
David D. Fan, Evangelos Theodorou, and John Reeder
fixed-wing model for the UAV dynamics [6]. The nonlinear model
consists of 9 states. The simulation environment consists of an open
arena where the two teams face off and battle. UAVs can die in one
of three ways - they can collide with another UAV or the ground,
they can be shot down, or they can be removed after successfully
conducting a suicide attack on the enemy base. The winner of the
match was the team which earned the most points, where a point
was earned for shooting an enemy down or successfully attacking
the enemy base, and a point was lost for being killed.
4
The simulation runs at about 10x real-time speed on an Intel Core
i7-4770 CPU at 3.40GHz on a single thread. We tried two different
ways of assigning fitnesses. The first was for each team to fight
5 games against an opposing team which used a hand-coded cost
function, and receive a fitness based on its performance in the match,
including UAVs lost, enemies killed, etc. At the start of evolution the
best fitness was close to 0.5 indicating ties, and after 500 generations
the fitness increased to 0.6837 (Figure 1). The behavior of the team
with the highest fitness was where the top half of UAVs would drop
down and follow the bottom half, creating a more compact group
of UAVs flying closer to the ground. Upon reaching the battlefront
the UAVs would then sweep up and attack from below (Figure ??).
This seemed to be an effective strategy against the team using the
hand-designed cost function.
The second method we used to evaluate fitness was through a
tournament where a team’s fitness was determined through a series
of matches against different teams. For each generation, each team
played 3 games against 5 different opponents. The team’s fitness
was the average score it earned over the games that it played. After
<200 generations, the best fitness plateaued at around 0.5, indicating that strong chromosomes were evenly matched with other
members of the population (Figure 1). We observed the behavior
where the team spreads out at the start of the match, then as opponents are detected, the team converges on the enemy, attacking
from three sides in a pincer-like move. If the team won, the remaining UAVs would then fly in large sweeps, searching for enemies
(Figure 2). These results can be seen in the supplementary video.
vs. Hand-crafted
(b) Blue spreads out
(c) Blue attacks from sides
(d) Red is at a disadvantage
while the class of control problems we can practically solve with
the MPPI controller is widened by using the NEAT algorithm.
Future work includes further investigation of learning cost functions with other reinforcement learning methods, especially in
stochastic settings. Another avenue of investigation is the use of
other neuroevolution variants, such as odNEAT [3], which is for
online, distributed evolution.
REFERENCES
[1] Uwe Gaerther. 2013. UAV swarm tactics: an agent-based simulation and Markov
process analysis. Ph.D. diss. Naval Postgraduate School. DOI:http://dx.doi.org/
10.4135/9781452276335.n310
[2] Sean Luke, Claudio Cioffi-Revilla, Liviu Panait, and Keith Sullivan. 2004. Mason:
A New Multi-Agent Simulation Toolkit. In Proceedings of the 2004 SwarmFest
Workshop, Vol. 8. 316–327. DOI:http://dx.doi.org/10.1.1.158.4758
[3] F Silva, P Urbano, S Oliveira, and AL Christensen. 2012. odNEAT: An Algorithm
for Distributed Online, Onboard Evolution of Robot Behaviours. Artificial Life
(2012).
[4] Kenneth O Stanley. 2004. Competitive Coevolution through Evolutionary Complexification. J. Artif. Intell. Res.(JAIR) 21 (2004), 63–100.
[5] Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A.
Theodorou. 2016. Aggressive Driving with Model Predictive Path Integral Control.
In 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE,
1433–1440.
[6] Li Jun Zhang, He Ming Jia, and Da Peng Jiang. 2014. Sliding mode prediction
control for 3D path following of an underactuated AUV. In IFAC Proceedings
Volumes (IFAC-PapersOnline), Vol. 19. 8799–8804. DOI:http://dx.doi.org/10.0/
Linux-x86 64
0.55
Fitness
Fitness
Figure 2: 20 vs. 20 UAV Simulation with red team using handdesigned cost functions and blue team using a tournamentevolved cost function. At the start of the simulation (a), blue
team spreads out much more than red (b), then attacks the
red team in a pincer move, partially surrounding them (c).
This can place the red team at a disadvantage (d). If any
blue UAVs survive, they then fly in wide circles searching
for more opponents (not shown).
Tournament
0.7
0.6
0.5
0.45
0.35
0
200
400
Generation #
0
50
100
150
Generation #
Figure 1: Fitness improvement over successive generations.
5
(a) Simulation begins
RESULTS
CONCLUSION
We proposed the combination of the NEAT algorithm with the MPPI
controller, for real-time control of a team of combat UAVs. This
approach is very general and could be applied to a wide variety of
single or multi-agent control problems, from difficult manipulation
tasks in robotics, to coordination of multi-agent swarms. The scalability of the NEAT algorithm is increased by the MPPI controller,
56
Download