Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Local Optimization for Simulation of Natural Motion Tom Erez Washington University in St. Louis etom@cse.wustl.edu Reinforcement Learning (Sutton and Barto 1998) is a theoretical framework for optimizing the behavior of artificial agents. The notion that behavior in the natural world is in some sense optimal is explored by areas such as biomechanics and physical anthropology. These fields propose a variety of candidate optimality criteria (?, e.g.,)]eng,cost as possible formulations of the principles underlying natural motion. Recent developments in computational biomechanics allow us to create articulated models of living creatures with a significant degree of biological realism (Delp et al. 2007). I aim to bring these elements together in my research by using Reinforcement Learning to generate optimized behavior in biomechanical simulations. Such a generative approach will allow us to examine critically postulated optimality criteria and investigate hypotheses that cannot be easily studied in the real world. uous domains require a different set of tools, and we cannot rely on most of the convergence guarantees offered by wellstudied discrete algorithms. In continuous domains with linear dynamics and quadratic rewards, the value function and optimal policy can be found analytically using traditional LQG techniques. However, these methods do not apply to the general case, where dynamics are non-linear and cost is non-quadratic. In such domains, generating an optimal feedback policy requires the numerical approximation of the value function, at least locally. Research Principles The research community of computational biomechanics has been studying articulated models of full-body motion with an ever-increasing degree of realism. In parallel, the scientific discussion about potential optimality criteria that may underlie natural motion postulates multiple hypotheses (?, e.g.,)]eng. I intend to use RL to bring the two together, and generate motion from the proposed first principles in realistic biomechanical models, and compare the results to the behavior of living creatures. This is a nontrivial problem: biomechanical models are continuous, highdimensional and nonlinear, and the optimality criteria considered in the literature are non-quadratic. In order to address these profound challenges, I propose three basic principles that can be employed to make this problem computationally tractable. Background — Reinforcement Learning The Reinforcement Learning (RL) agent interacts with a dynamical system whose states capture all the relevant information about the current configuration of the agent and its environment. By specifying a sequence of actions, the agent alters the state transitions of this dynamical system. The optimality criterion is formalized by a reward function defined over state-action pairs, and the agent’s goal is to maximize the cumulative reward. The target of optimization in RL is the policy, mapping from states to actions — the optimal policy chooses the action that would yield maximum future rewards. Another important construct is the value function, which determines the desirability of different states by measuring what is the maximum cumulative reward that can be expected from a given state. Given the optimal value function, the the optimal policy is reduced to a greedy choice. The basic algorithms of RL in discrete state and action spaces use tabulation to approximate the value function. Such algorithms do not scale to the continuous case — if we tried to discretize a continuous state space and use discrete RL methods, we would soon hit a wall, as the number of discretized states grows exponentially with the dimensionality of the continuous state space. Therefore, contin- 1. Both the policy and the value function have the state space as their domain. In the general case of highdimensional, non-linear domains, a globally-optimal solution is simply impossible to obtain, due to the curse of dimensionality. Therefore, much of my research has been focused on developing local methods of optimization (Tassa, Erez, and Smart 2008; Erez and Smart 2009; ). Furthermore, I argue that local optimization should be good enough to recover natural motion: as anyone who practiced yoga knows, there are vast areas of the configuration space of our own body that we rarely explore, and probably never consider in our motor optimization. 2. Local methods would be ineffective in tasks whose fitness landscape is beset with many local optima. In such cases, I propose to find a solution by shaping the task (Erez and Smart 2008): at first, a simplified variant of the c 2010, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 1978 References task is considered, with a smoother fitness landscape. The locally-optimal solution is then used as an initial guess for a slightly harder variant. This iterative process tackles the full complexity of the domain gradually, and allows the locally optimal optimization to discover the more relevant parts of state-space. Chowdhary, A. G., and Challis, J. H. 1999. Timing accuracy in human throwing. Journal of Theoretical Biology 201(4):219 – 229. Crandall, M., and Lions, P. 1983. Viscosity solutions of hamilton-jacobi equations. Transactions of the American Mathematical Society 277:1–42. Delp, S. L.; Anderson, F. C.; Arnold, A. S.; Loan, P.; Habib, A.; John, C. T.; Guendelman, E.; and Thelen, D. G. 2007. OpenSim: Open-Source software to create and analyze dynamic simulations of movement. IEEE Transactions on Biomedical Engineering 54(11):1940–1950. Engelbrecht, S. E. 2001. Minimum principles in motor control. Journal of Mathematical Psychology 45(3):497–542. Erez, T., and Smart, W. D. A scalable method for solving high-dimensional continuous pomdps using local approximation. Under Review. Erez, T., and Smart, W. D. 2008. What does shaping mean for computational reinforcement learning? 7th IEEE International Conference on Development and Learning (ICDL) 215–219. Erez, T., and Smart, W. D. 2009. Coupling perception and action using minimax optimal control. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 58–65. Nagano, A.; Umberger, B. R.; Marzke, M. W.; and Gerritsen, K. G. 2005. Neuromusculoskeletal computer modeling and simulation of upright, straight-legged, bipedal locomotion of Australopithecus afarensis (a.l. 288-1). American Journal of Physical Anthropology 126(1):2–13. Pandy, M. G., and Anderson, F. C. 2000. Dynamic simulation of human movement using large-scale models of the body. In IEEE International Conference on Robotics and Automation (ICRA), 676–681. Pawłowski, B. 2007. Origins of homininae and putative selection pressures acting on the early hominins. In Handbook of Paleoanthropology. Springer. 1409–1440. Sellers, W. I., and Manning, P. L. 2007. Estimating dinosaur maximum running speeds using evolutionary robotics. Proceedings of the Royal Society B: Biological Sciences 274(1626):2711–2716. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press. Tassa, Y., and Erez, T. 2007. Least Squares Solutions of the HJB Equation With Neural Network Value-Function Approximators. IEEE Transactions on Neural Networks 18(4):1031–1041. Tassa, Y.; Erez, T.; and Smart, W. D. 2008. Receding horizon differential dynamic programming. In Advances in Neural Information Processing Systems (NIPS), volume 20. Cambridge, MA: MIT Press. 3. Studies of the Hamilton-Jacobi-Bellman equation in optimal control have shown that the value function of a nonlinear domain can be discontinuous, even if the domain itself is perfectly continuous (Crandall and Lions 1983). This is a serious hurdle for many value-function approximation algorithms. However, the same theory also suggests that in the presence of domain stochasticity, these discontinuities dissipate and smooth out (Tassa and Erez 2007). I argue that this is a crucially-beneficial feature, which can be harnessed to ensure the differntiability of dynamics with discontinuities, such as collisions (Erez and Smart ). Proposed Work The most important component of my future research is the advancement and development of algorithms for modelbased local optimization. This direction is not perfectly aligned with current trends in RL (that focus on discovering the dynamical model, and on partially-observable systems), but I believe that it has the potential to open the door to a whole host of high-quality studies, with contribution to both RL and potential application domains. Ultimately, the testbed for such local optimization techniques will be the generation of simulated natural motion from first principles. In order to advance my research towards natural motion applications, I will use OpenSim (Delp et al. 2007), a general-purpose tool for biomechanical modeling and simulation. As an active member of the development team, I recently built an interface between OpenSim and M ATLAB, and I can now harness existing RL algorithms to work with OpenSim models. By applying local optimization to OpenSim models, I aim to take a generative approach to the study of the various postulated optimality criteria. There have been several studies of biomechanics and anthropology that studied the simulation of natural motion from first principles (Chowdhary and Challis 1999; Pandy and Anderson 2000; Nagano et al. 2005; Sellers and Manning 2007), employing a variety of optimization techniques, such as simulated annealing, genetic programming, and policy gradient. However, no previous study within these disciplines tackled the generation of feedback policies, or harnessed the computational efficiency afforded by local approximation of the value function. The goal of my work is to develop techniques that could be adopted as standard tools of investigation by the scientific communities of computational biomechanics and physical anthropology. 1979