Octopus Arm Project Final Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel Contents 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Project Goal The Octopus and its Arm Octopus Arm Model Reinforcement Learning On-Line GPTD Algorithm System Implementation Debug and Testing Learning Process Challenges and Problems Results Conclusions 1. Project Goal Teach a 2D octopus arm model to reach a given point in space by novel means of reinforcement learning. 2. The Octopus and its Arm (1/2) An Octopus is a carnivorous eight-armed sea creature with a large soft head and two rows of suckers on the underside of each arm (usually lives on the bottom of the ocean). 2. The Octopus and its Arm (2/2) An Octopus Arm is a muscular hydrostat organ capable of exerting force with the sole use of muscles, without requiring a rigid skeleton. 3. Octopus Arm Model (1/3) The octopus arm’s model is the outcome of a previous project in this lab. It is a 2D model written in C. We have upgraded it to C++ and added some features for better performance and compliance with other parts of our project. 3. Octopus Arm Model (2/3) The model simulates the physical behavior of the arm in its natural environment, considering: Muscle forces. Internal forces (keeping the constant volume of the arm, which is filled with fluids). Gravitation and the floatation vertical forces. Drag forces caused by the water. 3. Octopus Arm Model (3/3) The model approximates the continuous octopus arm by dividing it into segments and calculating the forces on each segment separately. 4. Reinforcement Learning (1/4) Reinforcement Learning (RL) Learning a behavior through trial and error interactions with a dynamic environment. Basic RL Definitions Agent - A learning and decision making entity. Environment - An entity the agent interacts with, comprising everything that cannot be changed arbitrarily by the agent. 4. Reinforcement Learning (2/4) Basic RL Definitions (cont.) State - The condition of the environment. Action - A choice made by the agent, based on the state. Reward - The input upon which the actions are evaluated by the agent. 4. Reinforcement Learning (3/4) The agent uses two functions to produce its next action: Value - For a given state X, the Value function is the expectation sum of rewards when the agent starts from X and uses the current Policy. Policy - For a given Value function, the Policy function returns the action to take when the agent is located in state X. 4. Reinforcement Learning (4/4) In our case: The State is the location and speed of all the vertexes in the arm. An arm represented by N segments will have 2(N+1) vertexes and 8(N+1) dimensions. The Action is a muscle activation in the arm, chosen by the agent module from a finite and constant set of activations. The Environment is the arm simulation that provides the result of the activation (the next state of the arm). The Reward the agent gets is state dependant, and set according to the proximity between arm and goal. The Agent is an On-Line GPTD based learning module. 5. On-Line GPTD Algorithm (1/3) TD(l) - An algorithm family in which temporal differences are used to estimate the value function on-line. GPTD - Gaussian Processes for TD Learning: Assuming the sequence of rewards is a Gaussian random process (with noise), and the rewards are samples of that process, we can estimate the value function using Gaussian estimation and a kernel function. 5. On-Line GPTD Algorithm (2/3) GPTD disadvantages: Space consumption of O(t2). Time consumption of O(t3). The proposed solution: On-Line Sparsification applied on the GPTD algorithm. Instead of keeping a large number of results of a vector function, we keep a dictionary of input vectors that can span, up to an accuracy threshold, the original vector function’s space. 5. On-Line GPTD Algorithm (3/3) Applying on-line sparsification on GPTD yields: Recursive update rules. No matrix inversion needed. Matrix dimensions depend on mt (dictionary size at time t), which is generally not linear in t. Using this, we can calculate: Value estimate with O(mt) time. Value variance with O(mt2) time. Each dictionary update with O(mt2) time. 6. System Implementation (1/5) Octopus Arm Environment 3 Rewarder Simulation 4 4 3 Agent Explorer (OPI / Actor-Critic) 2 3 On-Line GPTD 5 1 6. System Implementation (2/5) The Agent • Uses the On-line GPTD algorithm for value estimation. • Policy by one of the following: 1. Optimistic Policy Iteration (OPI). In this case there is a choice between two methods: 1. The e - greedy policy: 2. The Softmax policy: e C P(ai ) e0 e 0 N goals e V ( si ) e V ( sn ) ; 1 T n 2. Actor-Critic - By using two GPTD instances. 3. Interval Estimation - Greedy over: V (si ) C V2 ( s ) i 6. System Implementation (3/5) The GPTD algorithm’s two versions 1. Deterministic - Calculates value for an MDP with deterministic state transitions. • Works better with Actor-Critic. • Relatively fast convergence. 2. Stochastic - Calculates value for an MDP with stochastic state transitions. • Works better with Softmax and e - greedy. • Slow convergence (more parameters). 6. System Implementation (4/5) Goal types • The goal is a small circular zone in the 2D space. • It can be reached in two predefined ways: 1. With the tip of the arm. 2. With any vertex along the arm. • In both cases it can be reached with either side of the arm (dorsal or ventral). • A speed limit can be applied, i.e. reaching the goal zone faster than limit will not count as goal. 6. System Implementation (5/5) The Rewarder There are three types of rewarding: 1. Discrete: 2. Exponential: rgoal 10 , rno goal 1 rgoal 10 , rno goal (d ) 10 e 3. Negative exponential: rgoal 10 , rno goal (d ) 11 e d 2 d 2 1 In all cases: ► Hitting an obstacle yields a reward of rgoal , and ends the trial. ► A penalty proportional to the arm’s energy consumption in the last transition is reduced from the reward. 7. Debug and Testing • GPTD implementation included the non-recursive version of the algorithm (used only for debugging). • We tested GPTD (both deterministic and stochastic) on a discrete maze and witnessed correct value convergence. • We tested the system with a short, degenerate arm of 3 segments. • We ran the system in Supervised Learning mode (with g0) and witnessed the value converging to the reward function. 8. Learning Process (1/4) Agent’s Learning 1. Running a large number (~103-104) of consecutive learning trials. Each trial starts from a random initial state. 2. Saving GPTD parameters each N-th trial. 8. Learning Process (2/4) Agent’s Performance Measurement 1. 2. 3. 4. Creating a pool of evenly distributed arm states. Loading each N-th GPTD save. Running greedy policy over the manifold of initial states, noting whether and when was the goal reached. Merging the raw data to statistics - goal percentage for each GPTD save, and goal times for each initial state. 8. Learning Process (3/4) Parameters’ Calibration Simulation Time - Total time, interval between activations. Activations - Total number and types. GPTD Stochastic vs. Deterministic General parameters - 02, n, g, l Kernel function - Gaussian, Polynomial 8. Learning Process (4/4) Parameters’ Calibration (cont.) Exploration OPI - Softmax temperature, e - greedy descent factors. Actor-Critic - Trial interval between GPTD switches. Interval Estimation - Variance multiplying factor. Reward Function Discrete vs. Exponential Energy penalty factor Obstacles Goal Size Location Speed limit 9. Challenges and Problems ► A 10 segment arm has 88 dimensions. It uses a huge amount of memory and lengthens simulation time. Solution: Occupy all of the lab’s available computers for months. ► Too many activations cause greedy moves to consume a lot of time. Solution: Creating a small set of activations, yet one that still supports the arm’s full range of movement. ► GPTD is an estimator for an MDP value function. If we use OPI, the policy is not MDP. Solution: Forgetfulness (l) or Actor-Critic. ► Starting from a random state improves learning, yet a random state’s production on-line is problematic with our simulation. Solution: Selecting random states from GPTD’s dictionary, and using the less visited amongst them. 10. Results (1/8) 10 Segment Arm - OPI, Gaussian Kernel 10. Results (2/8) 10 Segment Arm - OPI, Polynomial Kernel 10. Results (3/8) 10 Segment Arm - OPI, Polynomial Kernel 10. Results (4/8) 10 Segment Arm - OPI, Polynomial Kernel 92% 10. Results (5/8) 10 Segment Arm - OPI, Polynomial Kernel 96% 10. Results (6/8) 10 Segment Arm - OPI, Polynomial Kernel 90% 10. Results (7/8) 10 Segment Arm - Actor-Critic, Gaussian Kernel 96% 10. Results (8/8) Summary 1. OPI yields a peak performance of 96%. 2. Actor-Critic yields a peak performance of 96%. GOAL !!! GOAL !!! GOAL !!! 11. Conclusions (1/2) 1. It is evident that the arm has gone through a significant learning process. Though not to perfection, but the arm clearly improved its behavior, from random movements to goal oriented movements. 2. OPI and Actor-Critic both prove to be effective learning strategies. Some further investigation is needed to establish the performance differences between the two. 3. Despite some disappointing results, we are optimistic (hence still using OPI ;-). We believe that further fine-tuning of system parameters, and providing longer training time will result in a more decisive convergence. 11. Conclusions (2/2) 4. The system’s large number of dimensions slows down the simulation and inflates the dictionary. This is the current bottleneck of the learning process. More computing power is needed for a reasonable learning time. 5. Based on our latest results, we believe we will establish that OnLine GPTD is practical algorithm which tackles successfully, and with moderate computing resources, the difficult problem of reinforcement learning in a continuous state space with vast number of dimensions. Questions ? THE END