Combining Genetics, Learning and Parenting Michael Berger Based on: “When to Apply the Fifth Commandment: The Effects of Parenting on Genetic and Learning Agents” / Michael Berger and Jeffrey S. Rosenschein Submitted to AAMAS 2004 Abstract Problem •Hidden state •Metric defined over state space •Condition C1: When state changes, it is only to an “adjacent” state •Condition C2: State changes occur at a low, but positive rate The Environment 0.5 0.5 0.2 0.7 0.5 0.2 0.2 0.2 0.2 0.7 0.5 0.5 • 2D Grid • Food patches Probabilistic Unlimited (reduces analytical complexity) May move to adjacent squares (retains structure) • Cyclic (reduces analytic complexity) Agent Definitions •Reward = Food Presence (0 or 1) •Perception = <Position, Food Presence> •Action {NORTH, EAST, SOUTH, WEST, HALT} •Memory = <<Per, Ac>, …, <Per, Ac>, Per> Memory length = |Mem| = No. of elements in memory No. of possible memories = (2*<Grid Width>*<Grid Height>)|Mem| * 5|Mem|-1 •MAM - Memory-Action Mapper Table One entry for every possible memory •ASF - Action-Selection Filter Genetic Algorithm (I) •Algorithm on a complete population, not on a single agent •Requires introduction of generations Every generation consists of a new group of agents Each agent is created at the beginning of a generation, and is terminated at its end Agent’s life cycle: Birth --> Run (foraging) --> Possible matings --> Death Genetic Algorithm (II) •Each agent carries a gene sequence Each gene has a key (memory) and a value (action) A given memory determines the resultant action Gene sequence remains constant during the life-time of an agent Gene sequence is determined at the mating stage of an agent’s parents Genetic Algorithm (III) •Mating consists of two stages: Selection stage - Determining mating rights. Should be performed according to two principles: • Survival of the fittest (as indicated in performance during the life-time) • Preservation of genetic variance Offspring creation stage: • One or more parents create one or more offspring • Offspring inherit some combination of parents’ gene sequence •Each of the stages has many variants Genetic Algorithm Variant •Selection: Will be discussed later. •Offspring creation: Two parents mate and create two offspring Gene sequences of parents are aligned against one another, and then two processes occur: • Random crossover • Random mutation Resultant pair of gene sequences are inherited by the offspring (one by each offspring). Parent1 Genetic Inheritance Crossov er Crossov er Mutation Offspring K1, V1 K1, U1 K2, V2 K2, U2 K3, V3 K3, U3 K4, V4 K4, U4 K5, V5 K5, U5 K1, V1 K1, U1 K2, V2 K2, U2 K3, U3 K3, V3 K4, U4 K4, V4 K5, V5 K5, U5 K1, V1 K1, U1 K2, V2 K2, U2* K3, U3* K3, V3 K4, U4 K4, V4 K5, V5 K5, U5 Parent2 Crossov er Crossov er Mutation Offspring Genetic Agent •MAM: Every entry is considered a gene • First column - Possible memory (key) • Second column - Action to take (value) No changes after creation •Parameters: Gen Memory length Gen Crossover probability for each gene pair Cros Gen Mutation probability for each gene Mut m P P Learning Algorithm •Reinforcement Learning type algorithm: After performing an action, agents receive a signal informing them how well their choice of action was (in this case, the reward) •Selected algorithm: Q-learning with Boltzmann exploration Basic Q-Learning (I) •Definitions: Discount factor (non-negative, less than 1) rj Reward at round j i 0 i rn i Rewards’ Discounted sum at round n •Q-learning attempts to maximize the expected rewards’ discounted sum of an agent as a function of any given memory at any round n Basic Q-Learning (II) •Q(s,a) - “Q-value”. The expected discounted sum of future rewards for an agent when its memory is s and it selects action a and follows an optimal policy thereafter. •Q(s,a) is updated after every time an agent selects action a when at memory s. After action execution, agent receives reward r and contains memory s’. Q(s,a) is updated as follows: Q( s, a) Lrn [r Lrn max Q(s' , b) Q(s, a)] b Basic Q-Learning (III) •Q(s,a) values can be stored in different forms: Neural network Table (nicknamed a Q-table) •When saved as a Q-table, each row corresponds to a possible memory s, and each column to a possible action a. •When an agent contains memory s, it should simply select an action a with that maximizes Q(s,a) - WRONG right ???!!! Boltzmann Exploration (I) •Full exploitation of a Q-value might hide other, better Q-values •Exploration of Q-values needed, at least in early stages •Boltzmann exploration: The probability of selecting action ai: p ( ai ) e Q ( s , ai ) t e a Q ( s ,a ) t Boltzmann Exploration (II) •t - An annealing temperature •At round n: t f Lrn Temp (n) •t decreases ==> exploration decreases, exploitation increases For a given s, the probability for selecting its best Qvalue approaches 1 as n increases •Variant here uses a freezing temperature: t Lrn Freeze Freezing temperature - when t is below it, exploration is replaced by full exploitation Learning Agent •MAM: A Q-table (dynamic) •Parameters: m Lrn Lrn Lrn Memory length Learning rate Rewards’ discount factor f Lrn Temp t Lrn Freeze (n) Temperature annealing function Freezing temperature Parenting Algorithm •No classical “parenting” algorithm around, this needs to be simulated •Selected algorithm: Monte-Carlo (another Reinforcement Learning type algorithm) Monte-Carlo (I) •Some similarity to Q-learning: A table (nicknamed an “MC-table”) stores values (“MC-values”) that describe how good it is to take action a given memory s Table dictates a policy of action-selection •Major differences from Q-learning: Table isn’t modified after every round, but only after episodes of rounds (in our case, a generation) Q-Value and MC-values have different meanings Monte-Carlo (II) •“Off-line” version of Monte-Carlo: After completing an episode (generation) where one table has dictated the action-selection policy, a new, second table is constructed from scratch to evaluate how good any action a is for a given memory s Second table will dictate policy in the next episode (generation) Equivalent to considering the second table as being built during the current episode, as long as it isn’t used in the current episode Monte-Carlo (III) •MC(s,a) is defined as the average of all rewards received after memory s was encountered and action a was selected •What if (s,a) was encountered more than once? •“Every-visit” variant: The average of all subsequent rewards is calculated for each occurrence of (s,a) MC(s,a) is the average of all calculated averages Monte-Carlo (IV) •“Every-visit” variant more suitable than “firstvisit” variant (where only the first encounter with (s,a) counts) Environment can change a lot since the first encounter with (s,a) •Exploration variants not used here For a given memory s, action a with the highest MCvalue is selected Full exploitation here because we have the experience of the previous episode of rounds •MAM: Parenting Agent An MC-table (doesn’t matter if dynamic or static) Dictates action-selection for offsprings only •ASF: Selects between the actions suggested by both parents with equal chance •Parameters: m Par Memory length Complex Agent (I) •Contains a genetic agent, a learning agent and a parenting agent in a subsumption architecture •Mating selection (debt from before) occurs among complex agents: At a generation’s end, each agent’s average reward serves as its score Agents receive mating rights according to scores “strata” (determined by scores’ average and standard deviation) Complex Agent (II) •Mediates between the inner agents and the environment •Perceptions passed directly to inner agents •Actions suggested by all inner agents passed through an ASF, which selects one of them •Parameters: Comp Gen Comp Lrn Comp Par P P P ASF’s prob. to select genetic action ASF’s prob. to select learning action ASF’s prob. to select parenting action Complex Agent - Mating Complex (Previous Generation) Genetic Memory MAM Learning Memory MAM Learning Memory MAM ASF Parenting Memory MAM ASF ASF Complex (Current Generation) Genetic Memory MAM Learning Memory MAM Parenting Memory MAM ASF ASF ENVIRONMENT Parenting Memory MAM ASF Complex (Previous Generation) Genetic Memory MAM Complex Agent - Perception Complex (Previous Generation) Genetic Memory MAM Learning Memory MAM Learning Memory MAM ASF Parenting Memory MAM ASF ASF Complex (Current Generation) Genetic Memory MAM Learning Memory MAM Parenting Memory MAM ASF ASF ENVIRONMENT Parenting Memory MAM ASF Complex (Previous Generation) Genetic Memory MAM Complex Agent - Action Complex (Previous Generation) Genetic Memory MAM Learning Memory MAM Learning Memory MAM ASF ASF Parenting Memory MAM ASF Complex (Current Generation) Genetic Memory MAM Learning Memory MAM Parenting Memory MAM ASF PGen PLrn ASF PPar ENVIRONMENT Parenting Memory MAM ASF Complex (Previous Generation) Genetic Memory MAM •Measures: Experiment (I) Eating-rate: average reward for a given agent (throughout its generation) BER: Best Eating-Rate (in a generation) •Framework: 20 agents in generation 9500 generations 30000 rounds per generation •Dependent variable: Success measure (Lambda) - Average of the BERs in the last 1000 generations Experiment (II) •Environment: Grid: 20 x 20 A single food patch, 5 x 5 in size 0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.4 0.4 0.2 0.2 0.4 0.8 0.4 0.2 0.2 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 Experiment (III) •Constant values: Gen m Gen PCros Gen PMut 1 Lrn 1 m Lrn Lrn f Lrn Temp t (n) Lrn Freeze m Par 0.02 0.005 0.2 0.95 5 * 0.999n 0.2 1 Experiment (IV) •Independent variables: Complex agent parameters: ASF probabilities (111 combinations) ( PGen , PLrn , PPar ) Environment parameter: Probability that in a given round, the food patch moves in a random direction (0, 10-6, 10-5, 10-4, 10-3, 10-2, 10-1) Env “Movement Probability” •One run for each combination of values Results: Static Environment •Best combination: Genetic-Parenting hybrid (PLrn = 0) PGen > PPar Mov. Prob. Best (PGen, PLrn, PPar) Success •Pure genetics don’t perform well 0 (0.7, 0, 0.3) GA converges slower if not assisted by learning or parenting •Pure parenting performs poorly •For a given PPar, success improves as PLrn decreases (Graph for movement prob. 0) 0.7988 Results: Low Dynamic Rate •Best combination: GeneticLearningParenting hybrid PLrn > PGen + PPar PPar >= PGen Mov. Prob. Best (PGen, PLrn, PPar) 10-6 (0.15, 0.7, 0.15) 10-5 (0, 0.9, 0.1) 10-4 (0.03, 0.9, 0.07) 10-3 (0.02, 0.8, 0.18) Success 0.7528 0.7011 0.6021 0.3647 •Pure parenting performs poorly (Graph for movement prob. 10-4) Results: High Dynamic Rate •Best combination: Pure learning (PGen = 0, Ppar = 0) Mov. Prob. Best (PGen, PLrn, PPar) Success 10-2 (0, 1, 0) 0.1834 10-1 (0, 1, 0) 0.0698 •Pure parenting performs poorly •Parenting loses effectiveness: Non-parenting agents have better success (Graph for movement prob. 10-2) Conclusions •Pure parenting doesn’t work •Agent algorithm A will be defined as an actionaugmentor of agent algorithm B if: A and B are always used for receiving perceptions B is applied for executing an action in most steps A is applied for executing an action in at least 50% of the other steps •In a static enviornment (C1 + ~C2), parenting helps when used as an action-augmentor for genetics •In slowly changing enviornments (C1 + C2), parenting helps when used as an actionaugmentor for learning •In quickly changing enviroments (C1 only), parenting doesn’t work - pure learning is best Bibliography (I) •Genetic Algorithm: Q-Learning: R. Axelrod. The complexity of Cooperation: AgentBased Models of Competition and Collaboration. Princeton University Press, 1997. H.G. Cobb and J.J. Grefenstette. Genetic algorithms for tracking changing environments. In Proceedings of the Fifth International Conference on Genetic Algorithms, pages 523-530, San Mateo, 1993. T.W. Sandholm and R.H. Crites. Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems, 37: 147-166, 1996. Monte-Carlo methods, Q-Learning, Reinforcement Learning: R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. Bibliography (II) •Genetic-Learning combinations: G. E. Hinton and S. J. Nowlan. How learning can guide evolution. In Adaptive Individuals in Evolving Populations: Models and Algorithms, pages 447-454. Addison-Wesley, 1996. T.D. Johnston. Selective costs and benefits in the evolution of learning. In Adaptive Individuals in Evolving Populations: Models and Algorithms, pages 315-358. Addison-Wesley, 1996. M. Littman. Simulations combining evolution and learning. In Adaptive Individuals in Evolving Populations: Models and Algorithms, pages 465-477. Addison-Wesley, 1996. G. Mayley. Landscapes, learning costs and genetic assimilation. Evolutionary Computation, 4(3): 213234, 1996. Bibliography (III) •Genetic-Learning combinations (cont.): S. Nolfi, J.L. Elman and D. Parisi. Learning and evolution in neural networks. Adaptive Behavior, 3(1): 5-28, 1994. S. Nolfi and D. Parisi. Learning to adapt to changing environments in evolving neural networks. Adaptive Behavior, 5(1): 75-98, 1997. D. Parisi and S.Nolfi. The influence of learning on evolution. Models and Algorithms, pages 419-428. Addison-Wesley, 1996. P.M. Todd and G.F. Miller. Exploring adaptive agency II: Simulating the evolution of associative learning. In From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior, pages 306-315, San Mateo, 1991. Bibliography (IV) •Exploitation vs. Exploration: D. Carmel and S. Markovitch. Exploration strategies for model-based learning in multiagent systems. Autonomous Agents and Multi-agent Systems, 2(2): 141-172, 1999. Subsumption architecture: R.A. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2(1): 14-23, March 1986. Backup - Qualitative Data Qual. Data: Mov. Prob. 0 Pure Parenting Pure Genetics Pure Learning Best: (0.7, 0, 0.3) Qual. Data: Mov. Prob. 10-4 Pure Parenting Pure Learning Best: (0.03, 0.9, 0.07) Qual. Data: Mov. Prob. 10-2 Pure Parenting (0.09, 0.9, 0.01) Best: Pure Learning