Combining Genetics, Learning and Parenting Michael Berger Based on:

advertisement
Combining Genetics,
Learning and Parenting
Michael Berger
Based on:
“When to Apply the Fifth Commandment: The Effects of
Parenting on Genetic and Learning Agents” /
Michael Berger and Jeffrey S. Rosenschein
Submitted to AAMAS 2004
Abstract Problem
•Hidden state
•Metric defined over state space
•Condition C1: When state changes, it is only
to an “adjacent” state
•Condition C2: State changes occur at a low,
but positive rate
The Environment
0.5
0.5
0.2
0.7
0.5
0.2
0.2
0.2
0.2
0.7
0.5
0.5
• 2D Grid
• Food patches



Probabilistic
Unlimited (reduces analytical complexity)
May move to adjacent squares (retains structure)
• Cyclic (reduces analytic complexity)
Agent Definitions
•Reward = Food Presence (0 or 1)
•Perception = <Position, Food Presence>
•Action  {NORTH, EAST, SOUTH, WEST, HALT}
•Memory = <<Per, Ac>, …, <Per, Ac>, Per>


Memory length = |Mem| = No. of elements in
memory
No. of possible memories =
(2*<Grid Width>*<Grid Height>)|Mem| * 5|Mem|-1
•MAM - Memory-Action Mapper


Table
One entry for every possible memory
•ASF - Action-Selection Filter
Genetic Algorithm (I)
•Algorithm on a complete population, not on a
single agent
•Requires introduction of generations

Every generation consists of a new group of agents

Each agent is created at the beginning of a
generation, and is terminated at its end

Agent’s life cycle:
Birth --> Run (foraging) --> Possible matings -->
Death
Genetic Algorithm (II)
•Each agent carries a gene sequence

Each gene has a key (memory) and a value (action)

A given memory determines the resultant action

Gene sequence remains constant during the life-time
of an agent

Gene sequence is determined at the mating stage of
an agent’s parents
Genetic Algorithm (III)
•Mating consists of two stages:

Selection stage - Determining mating rights. Should
be performed according to two principles:
•
Survival of the fittest (as indicated in performance during
the life-time)
•

Preservation of genetic variance
Offspring creation stage:
•
One or more parents create one or more offspring
•
Offspring inherit some combination of parents’ gene
sequence
•Each of the stages has many variants
Genetic Algorithm Variant
•Selection:

Will be discussed later.
•Offspring creation:

Two parents mate and create two offspring

Gene sequences of parents are aligned against one
another, and then two processes occur:

•
Random crossover
•
Random mutation
Resultant pair of gene sequences are inherited by
the offspring (one by each offspring).
Parent1
Genetic Inheritance
Crossov
er
Crossov
er
Mutation
Offspring
K1, V1
K1, U1
K2, V2
K2, U2
K3, V3
K3, U3
K4, V4
K4, U4
K5, V5
K5, U5
K1, V1
K1, U1
K2, V2
K2, U2
K3, U3
K3, V3
K4, U4
K4, V4
K5, V5
K5, U5
K1, V1
K1, U1
K2, V2
K2, U2*
K3, U3*
K3, V3
K4, U4
K4, V4
K5, V5
K5, U5
Parent2
Crossov
er
Crossov
er
Mutation
Offspring
Genetic Agent
•MAM:


Every entry is considered a gene
•
First column - Possible memory (key)
•
Second column - Action to take (value)
No changes after creation
•Parameters:
Gen Memory length
Gen
Crossover probability for each gene pair
Cros
Gen
Mutation probability for each gene
Mut
m
P
P
Learning Algorithm
•Reinforcement Learning type algorithm:

After performing an action, agents receive a signal
informing them how well their choice of action was
(in this case, the reward)
•Selected algorithm: Q-learning with Boltzmann
exploration
Basic Q-Learning (I)
•Definitions:

Discount factor (non-negative, less than 1)
rj
Reward at round j


i 0
i
rn i
Rewards’ Discounted sum at round n
•Q-learning attempts to maximize the expected
rewards’ discounted sum of an agent as a
function of any given memory at any round n
Basic Q-Learning (II)
•Q(s,a) - “Q-value”. The expected discounted
sum of future rewards for an agent when its
memory is s and it selects action a and follows
an optimal policy thereafter.
•Q(s,a) is updated after every time an agent
selects action a when at memory s. After action
execution, agent receives reward r and contains
memory s’. Q(s,a) is updated as follows:
Q( s, a)  
Lrn
[r  
Lrn
max Q(s' , b)  Q(s, a)]
b
Basic Q-Learning (III)
•Q(s,a) values can be stored in different forms:

Neural network

Table (nicknamed a Q-table)
•When saved as a Q-table, each row corresponds
to a possible memory s, and each column to a
possible action a.
•When an agent contains memory s, it should
simply select an action a with that maximizes
Q(s,a) - WRONG
right ???!!!
Boltzmann Exploration (I)
•Full exploitation of a Q-value might hide other,
better Q-values
•Exploration of Q-values needed, at least in early
stages
•Boltzmann exploration:
The probability of selecting action ai:
p ( ai ) 
e
Q ( s , ai )
t
e
a
Q ( s ,a )
t
Boltzmann Exploration (II)
•t - An annealing temperature
•At round n:
t f
Lrn
Temp
(n)
•t decreases ==> exploration decreases,
exploitation increases

For a given s, the probability for selecting its best Qvalue approaches 1 as n increases
•Variant here uses a freezing temperature:
t
Lrn
Freeze
Freezing temperature - when t is below it,
exploration is replaced by full exploitation
Learning Agent
•MAM:

A Q-table (dynamic)
•Parameters:
m
Lrn

Lrn

Lrn
Memory length
Learning rate
Rewards’ discount factor
f
Lrn
Temp
t
Lrn
Freeze
(n)
Temperature annealing function
Freezing temperature
Parenting Algorithm
•No classical “parenting” algorithm around, this
needs to be simulated
•Selected algorithm: Monte-Carlo (another
Reinforcement Learning type algorithm)
Monte-Carlo (I)
•Some similarity to Q-learning:

A table (nicknamed an “MC-table”) stores values
(“MC-values”) that describe how good it is to take
action a given memory s

Table dictates a policy of action-selection
•Major differences from Q-learning:

Table isn’t modified after every round, but only after
episodes of rounds (in our case, a generation)

Q-Value and MC-values have different meanings
Monte-Carlo (II)
•“Off-line” version of Monte-Carlo:

After completing an episode (generation) where one
table has dictated the action-selection policy, a new,
second table is constructed from scratch to evaluate
how good any action a is for a given memory s

Second table will dictate policy in the next episode
(generation)

Equivalent to considering the second table as being
built during the current episode, as long as it isn’t
used in the current episode
Monte-Carlo (III)
•MC(s,a) is defined as the average of all rewards
received after memory s was encountered and
action a was selected
•What if (s,a) was encountered more than once?
•“Every-visit” variant:

The average of all subsequent rewards is calculated
for each occurrence of (s,a)

MC(s,a) is the average of all calculated averages
Monte-Carlo (IV)
•“Every-visit” variant more suitable than “firstvisit” variant (where only the first encounter
with (s,a) counts)

Environment can change a lot since the first
encounter with (s,a)
•Exploration variants not used here

For a given memory s, action a with the highest MCvalue is selected

Full exploitation here because we have the
experience of the previous episode of rounds
•MAM:
Parenting Agent

An MC-table (doesn’t matter if dynamic or static)

Dictates action-selection for offsprings only
•ASF:

Selects between the actions suggested by both
parents with equal chance
•Parameters:
m
Par
Memory length
Complex Agent (I)
•Contains a genetic agent, a learning agent
and a parenting agent in a subsumption
architecture
•Mating selection (debt from before) occurs
among complex agents:

At a generation’s end, each agent’s average
reward serves as its score

Agents receive mating rights according to scores
“strata” (determined by scores’ average and
standard deviation)
Complex Agent (II)
•Mediates between the inner agents and the
environment
•Perceptions passed directly to inner agents
•Actions suggested by all inner agents passed
through an ASF, which selects one of them
•Parameters:
Comp
Gen
Comp
Lrn
Comp
Par
P
P
P
ASF’s prob. to select genetic action
ASF’s prob. to select learning action
ASF’s prob. to select parenting action
Complex Agent - Mating
Complex (Previous Generation)
Genetic
Memory
MAM
Learning
Memory
MAM
Learning
Memory
MAM
ASF
Parenting
Memory
MAM
ASF
ASF
Complex (Current Generation)
Genetic
Memory
MAM
Learning
Memory
MAM
Parenting
Memory
MAM
ASF
ASF
ENVIRONMENT
Parenting
Memory
MAM
ASF
Complex (Previous Generation)
Genetic
Memory
MAM
Complex Agent - Perception
Complex (Previous Generation)
Genetic
Memory
MAM
Learning
Memory
MAM
Learning
Memory
MAM
ASF
Parenting
Memory
MAM
ASF
ASF
Complex (Current Generation)
Genetic
Memory
MAM
Learning
Memory
MAM
Parenting
Memory
MAM
ASF
ASF
ENVIRONMENT
Parenting
Memory
MAM
ASF
Complex (Previous Generation)
Genetic
Memory
MAM
Complex Agent - Action
Complex (Previous Generation)
Genetic
Memory
MAM
Learning
Memory
MAM
Learning
Memory
MAM
ASF
ASF
Parenting
Memory
MAM
ASF
Complex (Current Generation)
Genetic
Memory
MAM
Learning
Memory
MAM
Parenting
Memory
MAM
ASF
PGen
PLrn
ASF
PPar
ENVIRONMENT
Parenting
Memory
MAM
ASF
Complex (Previous Generation)
Genetic
Memory
MAM
•Measures:

Experiment (I)
Eating-rate: average reward for a given agent
(throughout its generation)

BER: Best Eating-Rate (in a generation)
•Framework:

20 agents in generation

9500 generations

30000 rounds per generation
•Dependent variable:

Success measure (Lambda) - Average of the BERs
in the last 1000 generations
Experiment (II)
•Environment:

Grid: 20 x 20

A single food patch, 5 x 5 in size
0.2
0.2
0.2
0.2
0.2
0.2
0.4
0.4
0.4
0.2
0.2
0.4
0.8
0.4
0.2
0.2
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
Experiment (III)
•Constant values:
Gen
m
Gen
PCros
Gen
PMut
1
Lrn
1
m

Lrn

Lrn
f
Lrn
Temp
t
(n)
Lrn
Freeze
m
Par
0.02
0.005
0.2
0.95
5 * 0.999n
0.2
1
Experiment (IV)
•Independent variables:

Complex agent parameters:
ASF probabilities (111 combinations)
( PGen , PLrn , PPar )

Environment parameter:
Probability that in a given round, the food patch
moves in a random direction (0, 10-6, 10-5, 10-4,
10-3, 10-2, 10-1)
Env
“Movement Probability”

•One run for each combination of values
Results: Static Environment
•Best combination:


Genetic-Parenting hybrid (PLrn = 0)
PGen > PPar
Mov. Prob. Best (PGen, PLrn, PPar) Success
•Pure genetics
don’t perform well

0
(0.7, 0, 0.3)
GA converges
slower if not
assisted by
learning or
parenting
•Pure parenting
performs poorly
•For a given PPar,
success improves
as PLrn decreases
(Graph for movement prob. 0)
0.7988
Results: Low Dynamic Rate
•Best
combination:



GeneticLearningParenting hybrid
PLrn > PGen + PPar
PPar >= PGen
Mov. Prob. Best (PGen, PLrn, PPar)
10-6
(0.15, 0.7, 0.15)
10-5
(0, 0.9, 0.1)
10-4
(0.03, 0.9, 0.07)
10-3
(0.02, 0.8, 0.18)
Success
0.7528
0.7011
0.6021
0.3647
•Pure parenting
performs poorly
(Graph for movement prob. 10-4)
Results: High Dynamic Rate
•Best
combination:

Pure learning
(PGen = 0,
Ppar = 0)
Mov. Prob. Best (PGen, PLrn, PPar) Success
10-2
(0, 1, 0)
0.1834
10-1
(0, 1, 0)
0.0698
•Pure parenting
performs poorly
•Parenting loses
effectiveness:

Non-parenting
agents have
better success
(Graph for movement prob. 10-2)
Conclusions
•Pure parenting doesn’t work
•Agent algorithm A will be defined as an actionaugmentor of agent algorithm B if:



A and B are always used for receiving perceptions
B is applied for executing an action in most steps
A is applied for executing an action in at least 50%
of the other steps
•In a static enviornment (C1 + ~C2), parenting
helps when used as an action-augmentor for
genetics
•In slowly changing enviornments (C1 + C2),
parenting helps when used as an actionaugmentor for learning
•In quickly changing enviroments (C1 only),
parenting doesn’t work - pure learning is best
Bibliography (I)
•Genetic Algorithm:



Q-Learning:


R. Axelrod. The complexity of Cooperation: AgentBased Models of Competition and Collaboration.
Princeton University Press, 1997.
H.G. Cobb and J.J. Grefenstette. Genetic algorithms
for tracking changing environments. In Proceedings
of the Fifth International Conference on Genetic
Algorithms, pages 523-530, San Mateo, 1993.
T.W. Sandholm and R.H. Crites. Multiagent
reinforcement learning in the iterated prisoner’s
dilemma. Biosystems, 37: 147-166, 1996.
Monte-Carlo methods, Q-Learning,
Reinforcement Learning:

R.S. Sutton and A.G. Barto. Reinforcement Learning:
An Introduction. The MIT Press, 1998.
Bibliography (II)
•Genetic-Learning combinations:




G. E. Hinton and S. J. Nowlan. How learning can
guide evolution. In Adaptive Individuals in Evolving
Populations: Models and Algorithms, pages 447-454.
Addison-Wesley, 1996.
T.D. Johnston. Selective costs and benefits in the
evolution of learning. In Adaptive Individuals in
Evolving Populations: Models and Algorithms, pages
315-358. Addison-Wesley, 1996.
M. Littman. Simulations combining evolution and
learning. In Adaptive Individuals in Evolving
Populations: Models and Algorithms, pages 465-477.
Addison-Wesley, 1996.
G. Mayley. Landscapes, learning costs and genetic
assimilation. Evolutionary Computation, 4(3): 213234, 1996.
Bibliography (III)
•Genetic-Learning combinations (cont.):




S. Nolfi, J.L. Elman and D. Parisi. Learning and
evolution in neural networks. Adaptive Behavior,
3(1): 5-28, 1994.
S. Nolfi and D. Parisi. Learning to adapt to changing
environments in evolving neural networks. Adaptive
Behavior, 5(1): 75-98, 1997.
D. Parisi and S.Nolfi. The influence of learning on
evolution. Models and Algorithms, pages 419-428.
Addison-Wesley, 1996.
P.M. Todd and G.F. Miller. Exploring adaptive agency
II: Simulating the evolution of associative learning.
In From Animals to Animats: Proceedings of the First
International Conference on Simulation of Adaptive
Behavior, pages 306-315, San Mateo, 1991.
Bibliography (IV)
•Exploitation vs. Exploration:


D. Carmel and S. Markovitch. Exploration strategies
for model-based learning in multiagent systems.
Autonomous Agents and Multi-agent Systems, 2(2):
141-172, 1999.
Subsumption architecture:

R.A. Brooks. A robust layered control system for a
mobile robot. IEEE Journal of Robotics and
Automation, 2(1): 14-23, March 1986.
Backup - Qualitative Data
Qual. Data: Mov. Prob. 0
Pure Parenting
Pure Genetics
Pure Learning
Best: (0.7, 0, 0.3)
Qual. Data: Mov. Prob. 10-4
Pure Parenting
Pure Learning
Best: (0.03, 0.9, 0.07)
Qual. Data: Mov. Prob. 10-2
Pure Parenting
(0.09, 0.9, 0.01)
Best: Pure Learning
Download