Reinforcement Learning in Simulated Soccer with Kohonen Networks Chris White and David Brogan University of Virginia Department of Computer Science Simulated Soccer How does agent decide what to do with the ball? Complexities Continuous inputs High dimensionality Reinforcement Learning (RL) Learning to associate utility values with stateaction pairs Agent incrementally updates value associated with each state-action pair based on interaction with environment (Russell & Norvig) Problems State space explodes exponentially in terms of dimensionality Current methods of managing state space explosion lack automation RL does not scale well to problems with complexities of simulated soccer… Quantization Divide State Space into regions of interest No automated method for regions Tile Coding (Sutton & Barto, 1998) granularity Heterogeneity location Prefer a learned abstraction of state space Kohonen Networks Clustering algorithm Data driven No nearby opponents Agent near opponent goal Teammate near opponent goal Voronoi Diagram State Space Reduction 90 continuous valued inputs describe state of a soccer game Naïve discretization 290 states Filter out unnecessary inputs still 218 states Clustering algorithm only 5000 states Big Win!!! Two Pass Algorithm Pass 1: Use Kohonen Network and large training set to learn state space Pass 2: Use Reinforcement Learning to learn utilities for states (SARSA) Fragility of Learned Actions What happens to attacker’s utility if goalie crosses dotted line? Unresolved Issues Increased generalization leads to frequency aliasing… Example: Riemann Sum Few samples vs. Many samples This becomes a sampling problem… Aliasing & Sampling Utility function not band limited How can we sample to reduce error? Uniformly increase sampling rate? (not the best idea) Adaptively super sample? Choose sample points based on special criteria? Forcing Functions Use a forcing function to only sample action in a state when it is likely to be effective (valleys are ignored) Reduces variance in experienced reward for state-action pair How do we create such a forcing function? Results Evaluate three systems Control – Random action selection SARSA Forcing Function Evaluation criteria Goals scored Time of possession Cumulative Score SARSA vs. Random Policy 900 700 600 500 Learning Team 400 Random Team 300 200 100 Games Played 919 865 811 757 703 649 595 541 487 433 379 325 271 217 163 109 55 0 1 Cumulative Goals Scored 800 Time of Possession Time of Possession 6000 4000 3000 Time of Possession 2000 1000 Games Played 945 886 827 768 709 650 591 532 473 414 355 296 237 178 119 60 0 1 Time of Possession 5000 Team with Forcing Functions SARSA with Forcing Function vs. Random Policy 1200 800 Learning Team with Forcing Functions 600 Random Team Against Team with Forcing Functions 400 200 Games Played 897 833 769 705 641 577 513 449 385 321 257 193 129 65 0 1 Cumulative Score 1000 With Forcing vs. Without Performance With Forcing Functions vs Performance Without Forcing Functions 1200 Learning Team Without Forcing Functions 800 Random Team Against Team Without Forcing Functions 600 Learning Team with Forcing Functions 400 Random Team Against Team with Forcing Functions 200 Games Played 937 885 833 781 729 677 625 573 521 469 417 365 313 261 209 157 105 53 0 1 Cumulative Score 1000 Summary Two-Pass learning algorithm for simulated soccer State space abstraction is automated Data driven technique Improved state of the art for simulated soccer Future Work Learned distance metric Additional automation in process Better generalization