Reinforcement Learning in Simulated Soccer with Kohonen Networks Chris White and David Brogan

advertisement
Reinforcement Learning in
Simulated Soccer with
Kohonen Networks
Chris White and David Brogan
University of Virginia
Department of Computer Science
Simulated Soccer


How does agent decide what to do with
the ball?
Complexities


Continuous inputs
High dimensionality
Reinforcement Learning (RL)


Learning to associate utility values with stateaction pairs
Agent incrementally updates value associated
with each state-action pair based on
interaction with environment
(Russell & Norvig)
Problems


State space explodes exponentially in
terms of dimensionality
Current methods of managing state
space explosion lack automation
RL does not scale well to problems with
complexities of simulated soccer…
Quantization

Divide State Space into regions of interest


No automated method for regions




Tile Coding (Sutton & Barto, 1998)
granularity
Heterogeneity
location
Prefer a learned abstraction of state space
Kohonen Networks


Clustering
algorithm
Data driven
No nearby
opponents
Agent near
opponent goal
Teammate near
opponent goal
Voronoi Diagram
State Space Reduction

90 continuous valued inputs describe
state of a soccer game



Naïve discretization  290 states
Filter out unnecessary inputs  still 218
states
Clustering algorithm  only 5000 states

Big Win!!!
Two Pass Algorithm

Pass 1:


Use Kohonen Network and large training
set to learn state space
Pass 2:

Use Reinforcement Learning to learn
utilities for states (SARSA)
Fragility of Learned Actions
What happens to
attacker’s utility if
goalie crosses
dotted line?
Unresolved Issues

Increased generalization leads to
frequency aliasing…
Example: Riemann Sum
Few samples

vs.
Many samples
This becomes a sampling problem…
Aliasing & Sampling


Utility function not band limited
How can we sample to reduce error?

Uniformly increase sampling rate?



(not the best idea)
Adaptively super sample?
Choose sample points based on special
criteria?
Forcing Functions

Use a forcing function to only sample
action in a state when it is likely to be
effective (valleys are ignored)


Reduces variance in experienced reward
for state-action pair
How do we create such a forcing
function?
Results

Evaluate three systems




Control – Random action selection
SARSA
Forcing Function
Evaluation criteria


Goals scored
Time of possession
Cumulative Score
SARSA vs. Random Policy
900
700
600
500
Learning Team
400
Random Team
300
200
100
Games Played
919
865
811
757
703
649
595
541
487
433
379
325
271
217
163
109
55
0
1
Cumulative Goals Scored
800
Time of Possession
Time of Possession
6000
4000
3000
Time of Possession
2000
1000
Games Played
945
886
827
768
709
650
591
532
473
414
355
296
237
178
119
60
0
1
Time of Possession
5000
Team with Forcing Functions
SARSA with Forcing Function vs. Random Policy
1200
800
Learning Team with Forcing
Functions
600
Random Team Against Team
with Forcing Functions
400
200
Games Played
897
833
769
705
641
577
513
449
385
321
257
193
129
65
0
1
Cumulative Score
1000
With Forcing vs. Without
Performance With Forcing Functions vs Performance Without Forcing Functions
1200
Learning Team Without Forcing
Functions
800
Random Team Against Team Without
Forcing Functions
600
Learning Team with Forcing Functions
400
Random Team Against Team with
Forcing Functions
200
Games Played
937
885
833
781
729
677
625
573
521
469
417
365
313
261
209
157
105
53
0
1
Cumulative Score
1000
Summary

Two-Pass learning algorithm for
simulated soccer



State space abstraction is automated
Data driven technique
Improved state of the art for simulated
soccer
Future Work

Learned distance metric


Additional automation in process
Better generalization
Download