Local Optimization for Simulation of Natural Motion Tom Erez

Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10)
Local Optimization for
Simulation of Natural Motion
Tom Erez
Washington University in St. Louis
etom@cse.wustl.edu
Reinforcement Learning (Sutton and Barto 1998) is a theoretical framework for optimizing the behavior of artificial
agents. The notion that behavior in the natural world is in
some sense optimal is explored by areas such as biomechanics and physical anthropology. These fields propose a variety
of candidate optimality criteria (?, e.g.,)]eng,cost as possible
formulations of the principles underlying natural motion.
Recent developments in computational biomechanics allow us to create articulated models of living creatures with
a significant degree of biological realism (Delp et al. 2007).
I aim to bring these elements together in my research by using Reinforcement Learning to generate optimized behavior
in biomechanical simulations. Such a generative approach
will allow us to examine critically postulated optimality criteria and investigate hypotheses that cannot be easily studied
in the real world.
uous domains require a different set of tools, and we cannot
rely on most of the convergence guarantees offered by wellstudied discrete algorithms. In continuous domains with linear dynamics and quadratic rewards, the value function and
optimal policy can be found analytically using traditional
LQG techniques. However, these methods do not apply to
the general case, where dynamics are non-linear and cost
is non-quadratic. In such domains, generating an optimal
feedback policy requires the numerical approximation of the
value function, at least locally.
Research Principles
The research community of computational biomechanics
has been studying articulated models of full-body motion
with an ever-increasing degree of realism. In parallel, the
scientific discussion about potential optimality criteria that
may underlie natural motion postulates multiple hypotheses (?, e.g.,)]eng. I intend to use RL to bring the two together, and generate motion from the proposed first principles in realistic biomechanical models, and compare the
results to the behavior of living creatures. This is a nontrivial problem: biomechanical models are continuous, highdimensional and nonlinear, and the optimality criteria considered in the literature are non-quadratic. In order to address these profound challenges, I propose three basic principles that can be employed to make this problem computationally tractable.
Background — Reinforcement Learning
The Reinforcement Learning (RL) agent interacts with a dynamical system whose states capture all the relevant information about the current configuration of the agent and its
environment. By specifying a sequence of actions, the agent
alters the state transitions of this dynamical system. The optimality criterion is formalized by a reward function defined
over state-action pairs, and the agent’s goal is to maximize
the cumulative reward.
The target of optimization in RL is the policy, mapping
from states to actions — the optimal policy chooses the action that would yield maximum future rewards. Another important construct is the value function, which determines the
desirability of different states by measuring what is the maximum cumulative reward that can be expected from a given
state. Given the optimal value function, the the optimal policy is reduced to a greedy choice.
The basic algorithms of RL in discrete state and action
spaces use tabulation to approximate the value function.
Such algorithms do not scale to the continuous case — if
we tried to discretize a continuous state space and use discrete RL methods, we would soon hit a wall, as the number
of discretized states grows exponentially with the dimensionality of the continuous state space. Therefore, contin-
1. Both the policy and the value function have the state
space as their domain. In the general case of highdimensional, non-linear domains, a globally-optimal solution is simply impossible to obtain, due to the curse
of dimensionality. Therefore, much of my research has
been focused on developing local methods of optimization (Tassa, Erez, and Smart 2008; Erez and Smart 2009;
). Furthermore, I argue that local optimization should be
good enough to recover natural motion: as anyone who
practiced yoga knows, there are vast areas of the configuration space of our own body that we rarely explore, and
probably never consider in our motor optimization.
2. Local methods would be ineffective in tasks whose fitness landscape is beset with many local optima. In such
cases, I propose to find a solution by shaping the task
(Erez and Smart 2008): at first, a simplified variant of the
c 2010, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
1978
References
task is considered, with a smoother fitness landscape. The
locally-optimal solution is then used as an initial guess for
a slightly harder variant. This iterative process tackles the
full complexity of the domain gradually, and allows the
locally optimal optimization to discover the more relevant
parts of state-space.
Chowdhary, A. G., and Challis, J. H. 1999. Timing accuracy in human throwing. Journal of Theoretical Biology
201(4):219 – 229.
Crandall, M., and Lions, P. 1983. Viscosity solutions of
hamilton-jacobi equations. Transactions of the American
Mathematical Society 277:1–42.
Delp, S. L.; Anderson, F. C.; Arnold, A. S.; Loan, P.; Habib,
A.; John, C. T.; Guendelman, E.; and Thelen, D. G. 2007.
OpenSim: Open-Source software to create and analyze dynamic simulations of movement. IEEE Transactions on
Biomedical Engineering 54(11):1940–1950.
Engelbrecht, S. E. 2001. Minimum principles in motor control. Journal of Mathematical Psychology 45(3):497–542.
Erez, T., and Smart, W. D. A scalable method for solving
high-dimensional continuous pomdps using local approximation. Under Review.
Erez, T., and Smart, W. D. 2008. What does shaping mean
for computational reinforcement learning? 7th IEEE International Conference on Development and Learning (ICDL)
215–219.
Erez, T., and Smart, W. D. 2009. Coupling perception
and action using minimax optimal control. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 58–65.
Nagano, A.; Umberger, B. R.; Marzke, M. W.; and Gerritsen, K. G. 2005. Neuromusculoskeletal computer modeling
and simulation of upright, straight-legged, bipedal locomotion of Australopithecus afarensis (a.l. 288-1). American
Journal of Physical Anthropology 126(1):2–13.
Pandy, M. G., and Anderson, F. C. 2000. Dynamic simulation of human movement using large-scale models of the
body. In IEEE International Conference on Robotics and
Automation (ICRA), 676–681.
Pawłowski, B. 2007. Origins of homininae and putative selection pressures acting on the early hominins. In Handbook
of Paleoanthropology. Springer. 1409–1440.
Sellers, W. I., and Manning, P. L. 2007. Estimating dinosaur maximum running speeds using evolutionary
robotics. Proceedings of the Royal Society B: Biological
Sciences 274(1626):2711–2716.
Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press.
Tassa, Y., and Erez, T. 2007. Least Squares Solutions
of the HJB Equation With Neural Network Value-Function
Approximators. IEEE Transactions on Neural Networks
18(4):1031–1041.
Tassa, Y.; Erez, T.; and Smart, W. D. 2008. Receding
horizon differential dynamic programming. In Advances in
Neural Information Processing Systems (NIPS), volume 20.
Cambridge, MA: MIT Press.
3. Studies of the Hamilton-Jacobi-Bellman equation in optimal control have shown that the value function of a nonlinear domain can be discontinuous, even if the domain
itself is perfectly continuous (Crandall and Lions 1983).
This is a serious hurdle for many value-function approximation algorithms. However, the same theory also suggests that in the presence of domain stochasticity, these
discontinuities dissipate and smooth out (Tassa and Erez
2007). I argue that this is a crucially-beneficial feature,
which can be harnessed to ensure the differntiability of
dynamics with discontinuities, such as collisions (Erez
and Smart ).
Proposed Work
The most important component of my future research is
the advancement and development of algorithms for modelbased local optimization. This direction is not perfectly
aligned with current trends in RL (that focus on discovering
the dynamical model, and on partially-observable systems),
but I believe that it has the potential to open the door to a
whole host of high-quality studies, with contribution to both
RL and potential application domains.
Ultimately, the testbed for such local optimization techniques will be the generation of simulated natural motion
from first principles. In order to advance my research towards natural motion applications, I will use OpenSim (Delp
et al. 2007), a general-purpose tool for biomechanical modeling and simulation. As an active member of the development team, I recently built an interface between OpenSim
and M ATLAB, and I can now harness existing RL algorithms
to work with OpenSim models. By applying local optimization to OpenSim models, I aim to take a generative approach
to the study of the various postulated optimality criteria.
There have been several studies of biomechanics and
anthropology that studied the simulation of natural motion
from first principles (Chowdhary and Challis 1999;
Pandy and Anderson 2000; Nagano et al.
2005;
Sellers and Manning 2007), employing a variety of
optimization techniques, such as simulated annealing,
genetic programming, and policy gradient. However, no
previous study within these disciplines tackled the generation of feedback policies, or harnessed the computational
efficiency afforded by local approximation of the value
function. The goal of my work is to develop techniques that
could be adopted as standard tools of investigation by the
scientific communities of computational biomechanics and
physical anthropology.
1979