Learning in Multiagent Systems Advanced AI Seminar Michael Weinberg The Hebrew University in Jerusalem, Israel March 2003 Agenda What is learning in MAS? General Characterization Learning and Activity Coordination Learning about and from Other Agents Learning and Communication Conclusions 7/12/2016 Advanced AI Seminar, March 2003 2 What is Learning Learning can be informally defined as: 7/12/2016 The acquisition of new knowledge and motor or cognitive skills and the incorporation of the acquired knowledge and skills in future system activities, provided that this acquisition and incorporation is conducted by the system itself and leads to an improvement in its performance Advanced AI Seminar, March 2003 3 Learning in Multiagent Systems Intersection of DAI and ML Why bring them together? 7/12/2016 There is a strong need to equip Multiagent systems with learning abilities The extended view of ML as Multiagent learning is qualitatively different from traditional ML and can lead to novel ML techniques and algorithms Advanced AI Seminar, March 2003 4 Agenda What is learning in MAS? General Characterization Learning and Activity Coordination Learning about and from Other Agents Learning and Communication Conclusions 7/12/2016 Advanced AI Seminar, March 2003 5 General Characterization Principal categories of learning The features in which learning approaches may differ The fundamental learning problem known as the credit-assignment problem 7/12/2016 Advanced AI Seminar, March 2003 6 Principal Categories Centralized Learning (isolated learning) Learning executed by a single agent, no interaction with other agents Several centralized learners may try to obtain different or identical goals at the same time 7/12/2016 Advanced AI Seminar, March 2003 7 Principal Categories Decentralized Learning (interactive learning) Several agents are engaged in the same learning process Several groups of agents may try to obtain different or identical learning goals at the same time Single agent may be involved in several centralized/decentralized learning processes at the same time 7/12/2016 Advanced AI Seminar, March 2003 8 Differencing Features: The degree of decentralization The degree of decentralization 7/12/2016 Distributedness Parallelism Advanced AI Seminar, March 2003 9 Differencing Features: Interaction-specific features Classification of the interactions required for realizing a decentralized learning process: 7/12/2016 The The The The level of interaction persistence of interaction frequency of interaction variability of interaction Advanced AI Seminar, March 2003 10 Differencing Features: Involvement-specific features Features that characterize the involvement of an agent into a learning process: 7/12/2016 The relevance of involvement The role played during involvement Advanced AI Seminar, March 2003 11 Differencing Features: Goal-specific features Features that characterize the learning goal: 7/12/2016 Type of improvement that is tried to be achieved by learning Compatibility of the learning goals pursued by the agents Advanced AI Seminar, March 2003 12 Differencing Features: The learning method The following learning methods are distinguished: Rote learning Learning from instruction and by advice taking Learning from examples and by practice Learning by analogy Learning by discovery The main difference is in the required amount of learning efforts 7/12/2016 Advanced AI Seminar, March 2003 13 Differencing Features: The learning feedback The learning feedback indicates the performance level achieved so far The following learning feedbacks are distinguished: 7/12/2016 Supervised learning (teacher) Reinforcement learning (critic) Unsupervised learning (observer) Advanced AI Seminar, March 2003 14 The Credit-Assignment Problem The problem of properly assigning feedback for an overall performance change to each of the system activities that contributed to that change Can be usefully decomposed into two subproblems: 7/12/2016 The inter-agent CAP The intra-agent CAP Advanced AI Seminar, March 2003 15 The inter-agent CAP Assignment of credit or blame for an overall performance change to the external actions of the agents 7/12/2016 Advanced AI Seminar, March 2003 16 The intra-agent CAP Assignment of credit or blame for a particular external action of an agent to its underlying internal inferences and decisions 7/12/2016 Advanced AI Seminar, March 2003 17 Agenda What is learning in MAS? General Characterization Learning and Activity Coordination Learning about and from Other Agents Learning and Communication Conclusions 7/12/2016 Advanced AI Seminar, March 2003 18 Learning and Activity Coordination Previous research on coordination focused on off-line design of behavioral rules, negotiation protocols, etc… Agents operating in open, dynamic environments must be able to adapt to changing demands and opportunities How can agents learn to appropriately coordinate their activities? 7/12/2016 Advanced AI Seminar, March 2003 19 Reinforcement Learning Agents choose the next action so as to maximize a scalar reinforcement or feedback received after each action The learner’s environment can be modeled by a discrete time, finite state, Markov Decision Process (MDP) 7/12/2016 Advanced AI Seminar, March 2003 20 Markov Decision Process (MDP) MDP - Reinforcement Learning task that satisfies the Markov state property Markov State satisfies Prst 1 s, rt 1 r | st , at 7/12/2016 Advanced AI Seminar, March 2003 21 Reinforcement Learning (cont) The environment in MDP represented by a 4-tuple <S,A,P,r > S is a set of states A is a set of actions P : S S A [0,1] r :S A Each agent maintains a policy that maps states into desirable actions 7/12/2016 Advanced AI Seminar, March 2003 22 Q-Learning Algorithm Reinforcement Learning algorithm Maintains a table of Q-values Q(x,a) – “how good action a is in state x”? Converges to the optimum Q-values with probability 1 7/12/2016 Advanced AI Seminar, March 2003 23 Q-Learning Algorithm (cont) At step n the agent performs the following steps: 7/12/2016 Observe its current state x n Select and perform action a n Observe the subsequent state Receive immediate payoff rn Adjust Qn1 values Advanced AI Seminar, March 2003 yn 24 Discounted Sum of Future Rewards Q-Learning finds an optimal policy that maximizes total discounted expected reward Discounted reward – reward received s steps hence are worth less than reward s received now by a factor of 7/12/2016 Advanced AI Seminar, March 2003 25 Evaluating the Policy Under policy the value of state x is: V ( x) Rx ( ( x)) Pxy [ ( x)] V ( y ) y The optimal policy * satisfies: * V ( x) max Rx (a) Pxy [a] V ( y) a y * 7/12/2016 Advanced AI Seminar, March 2003 26 Q-Values Under policy define Q-values as: Q ( x, a ) Rx (a ) Pxy [ ( x)] V ( y ) y Executing action thereafter 7/12/2016 a and following policy Advanced AI Seminar, March 2003 27 Adjusting Q-Values Update Q-values as following: If x xn and a an Qn ( x, a) (1 ) Qn1 ( x, a) [rn Vn1 ( yn )] Otherwise Qn ( x, a) Qn1 ( x, a) Where Vn 1 ( y ) max Qn 1 ( y, b) b 7/12/2016 Advanced AI Seminar, March 2003 28 Isolated, Concurrent Reinforcement Learners Reinforcement learners develop action selection policies that optimize environmental feedback Can be used in domains With no pre-existing domain expertise With no information about other agents RL can be used as new coordination techniques where currently available coordination schemes are ineffective 7/12/2016 Advanced AI Seminar, March 2003 29 Isolated, Concurrent Reinforcement Learners Each agent learns to optimize its reinforcement from the environment Other agents are not explicitly modeled An interesting research question is whether it is feasible for such an agent to use the same learning mechanism in both cooperative and non-cooperative environments 7/12/2016 Advanced AI Seminar, March 2003 30 Isolated, Concurrent Reinforcement Learners An assumption of most RL techniques is that the dynamics of the environment is not affected by other agencies This assumption is invalid in domains with multiple,concurrent learners Standard RL is probably not adequate for concurrent, isolated learning of coordination 7/12/2016 Advanced AI Seminar, March 2003 31 Isolated, Concurrent Reinforcement Learners The following dimensions were identified to characterized domains amenable to CIRL: 7/12/2016 Agent coupling (tightly/loosely) Agent relationships (cooperative/adversarial) Feedback timing (immediate/delayed) Optimal behavior combinations Advanced AI Seminar, March 2003 32 Experiments with CIRL Conclusions: Through CIRL both friends and foes can concurrently acquire useful coordination info No prior knowledge of the domain needed No explicit model of the capabilities of other agents is required Limitations: 7/12/2016 Inability to develop effective coordination when agents are strongly coupled, feedback is delayed and there are only few optimal behavior combinations Advanced AI Seminar, March 2003 33 Experiments with CIRL A possible fix to the last limitation is “lockstep learning”: 7/12/2016 Two agents synchronize their behavior so that one is learning while the other is following a fixed policy and vice versa Advanced AI Seminar, March 2003 34 Interactive Reinforcement Learning of Coordination Agents can explicitly communicate to decide on individual and group actions Few algorithms for Interactive RL: 7/12/2016 Action Estimation Algorithm Action Group Estimation Algorithm Advanced AI Seminar, March 2003 35 Agenda What is learning in MAS? General Characterization Learning and Activity Coordination Learning about and from Other Agents Learning and Communication Conclusions 7/12/2016 Advanced AI Seminar, March 2003 36 Learning about and from Other Agents Agents learn to improve their individual performance Better capitalize on available opportunities by prediction the behavior of other agents (preferences, strategies, intentions, etc…) 7/12/2016 Advanced AI Seminar, March 2003 37 Learning Organizational Roles Assume agents have the capability of playing one of several roles in a situation Agents need to learn role assignments to effectively complement each other 7/12/2016 Advanced AI Seminar, March 2003 38 Learning Organizational Roles The framework includes Utility, Probability and Cost (UPC) estimates of a role adopted at a particular situation 7/12/2016 Utility – desired final state’s worth if the agent adopted the given role in the current situation Probability – likelihood of reaching a successful final state (given role/situation) Cost – associated computational cost incurred Potential – usefulness of a role in discovering pertinent global information Advanced AI Seminar, March 2003 39 Learning Organizational Roles: Theoretical Framework S k , Rk sets of situations and roles for agent k An agent maintains S k Rk vectors of UPC During the learning phase: f (U rs , Prs , Crs , Potentialrs ) Pr( r ) jR f (U js , Pjs , C js , Potential js ) k f rates a role by combining the component measures 7/12/2016 Advanced AI Seminar, March 2003 40 Learning Organizational Roles: Theoretical Framework After the learning phase is over, the role to be played in situation s is: r arg max f (U js , Pjs , C js , Potential js ) jRk UPC values are learned using reinforcement learning UPC estimates after n updates:Uˆ rsn , Pˆrsn , Pˆ otentialrsn 7/12/2016 Advanced AI Seminar, March 2003 41 Learning Organizational Roles: Updating the Utility S – the situations encountered between the time of adopting role r in situation s and reaching a final state F with utility U F The utility values for all roles chosen in each of the situation in S are updated: Uˆ rsn1 (1 ) Uˆ rsn U F 7/12/2016 Advanced AI Seminar, March 2003 42 Learning Organizational Roles: Updating the Probability O : S [0,1] - returns 1 if the given state is successful The update rule for probability: Pˆrsn1 (1 ) Pˆrsn O( F ) 7/12/2016 Advanced AI Seminar, March 2003 43 Learning Organizational Roles: Updating the Potential Conf (S ) - returns 1 if in the path to the final state, conflicts are detected and resolved by information exchange The update rule for potential: Pˆ otentialrsn1 (1 ) Pˆ otentialrsn Conf (S ) 7/12/2016 Advanced AI Seminar, March 2003 44 Learning Organizational Roles: Robotic Soccer Game Most implementations of robotic soccer teams use the approach of learning organizational roles Use layered learning methodology: Low level skills (e.g. shoot the ball) High level decision making (e.g. who to pass to) 7/12/2016 Advanced AI Seminar, March 2003 45 Learning in Market Environments Buyers and sellers trade in electronic marketplaces Three types of agents: 7/12/2016 0-level agents: don’t model the behaviour of others 1-level agents: model others as 0-level agents 2-level agents: model others as 1-level agents Advanced AI Seminar, March 2003 46 Learning to Exploit an Opponent: Model-Based Approach The most prominent approach in AI for developing playing strategies is the minimax algorithm Assumes that the opponent will choose the worst move An accurate model of the opponent can be used to develop better strategies 7/12/2016 Advanced AI Seminar, March 2003 47 Learning to Exploit an Opponent Model-Based Approach The main problem of RL is its slow convergence Model based approach tries to reduce the number of interaction examples needed for learning Perform deeper analysis of past interaction experience 7/12/2016 Advanced AI Seminar, March 2003 48 Model Based Approach The learning process is split into two separate stages: 7/12/2016 Infer a model of the other agent based on past experience Utilize the learned model for designing effective interaction strategy for the future Advanced AI Seminar, March 2003 49 Inferring a Best-Response Strategy Represent the opponent’s model as a DFA Example: The TFT strategy for the IPD game Theorem: Given a DFA opponent model M̂ opt M there exists a best response DFA that can be computed in time polynomial in M̂ 7/12/2016 Advanced AI Seminar, March 2003 50 Learning Models of Opponents The US-L* algorithm infers a DFA that is consistent with the sample of the opponent’s behavior The US-L* algorithm extends the model according to the three guiding principles: 7/12/2016 Consistency: The new model must be consistent with the give sample Compactness: A smaller model is better Stability: Should be similar to the previous model as much as possible Advanced AI Seminar, March 2003 51 Exploring the Opponent’s Strategy One of the weaknesses of the MB approach is that it can converge to sub-optimal behavior Best-response ignores the possibility that the current model is not identical to the opponent’s strategy This is known as the Exploration vs Exploitation dilemma 7/12/2016 Advanced AI Seminar, March 2003 52 Exploration vs Exploitation The learning player has to choose between the wish to exploit the current model and the desire to explore other alternatives For stationary opponents it is rational to explore more frequently at early stages Use of Boltzmann distribution 7/12/2016 Advanced AI Seminar, March 2003 53 Agenda What is learning in MAS? General Characterization Learning and Activity Coordination Learning about and from Other Agents Learning and Communication Conclusions 7/12/2016 Advanced AI Seminar, March 2003 54 Reducing Communication by Learning Learning is a method for reducing the load of communication among agents Consider the contract-net approach: 7/12/2016 Broadcasting of task announcement is assumed Scalability problems when the number of managers/tasks increases Advanced AI Seminar, March 2003 55 Reducing Communication in Contract-Net A flexible learning-based mechanism called addressee learning Enable agents to acquire knowledge about the other agents’ task solving abilities Tasks may be assigned more directly 7/12/2016 Advanced AI Seminar, March 2003 56 Reducing Communication in Contract-Net Case-based reasoning is used for knowledge acquisition and refinement Humans often solve problems using solutions that worked well for similar problems Construct cases – problem-solution pairs 7/12/2016 Advanced AI Seminar, March 2003 57 Case-Based Reasoning in Contract Net Each agent maintains it own case base A case consists of: Task specification Ti Ai1Vi1 ,..., Aimi Vimi Info about which agent already solved the task and the quality of the solution Need a similarity measure for tasks 7/12/2016 Advanced AI Seminar, March 2003 58 Case-Based Reasoning in Contract Net Distance between two attributes Dist ( Air , A js ) is domain-specific Similarity between two tasks Ti and T j : Similar (Ti , T j ) Dist ( Air , Ajs ) r s For task Ti a set of similar tasks is: S (Ti ) T j : Similar (Ti , T j ) 0.85 7/12/2016 Advanced AI Seminar, March 2003 59 Case-Based Reasoning in Contract Net An agent has to assign task Ti to another agent Select the most appropriate agents by computing their suitability: 1 Suit ( A, Ti ) Perform( A, T j ) S (Ti ) T j S (Ti ) 7/12/2016 Advanced AI Seminar, March 2003 60 Improving Learning by Communication Two forms of improving learning by communication are distinguished: 7/12/2016 Learning based on low-level communication (e.g. exchanging missing information) Learning based on high-level communication (e.g. mutual explanation) Advanced AI Seminar, March 2003 61 Improving Learning by Communication Example: Predator-Prey domain 7/12/2016 Predators are Q-learners Each predator has a limited visual perception Exchange sensor data – low-level communication Experiments show that it clearly leads to improved learning results Advanced AI Seminar, March 2003 62 Some Open Questions… What are the unique requirements and conditions for Multiagent learning? Do centralized and decentralized learning qualitatively differ from each other? Development of theoretical foundations of decentralized learning Applications of Multiagent learning in complex real-world environments 7/12/2016 Advanced AI Seminar, March 2003 63 Conclusions This area is of particular interest to DAI as well as to ML We spoke about characterization of learning in MAS Showed several concrete learning approaches and the main streams in this area: 7/12/2016 Learning and activity coordination Learning about and from other agents Learning and communication Advanced AI Seminar, March 2003 64 Bibliography “Multiagent Systems”, Chapter 6 “Q-Learning”, Watkins, Machine Learning vol. 8 (1992) “Model-based Learning of Interaction Strategies in MAS”, Carmel & Markovitch “Multiagent Systems: A Survey from a Machine Learning Perspective”, Stone and Veloso 7/12/2016 Advanced AI Seminar, March 2003 65 Thank You 7/12/2016 Advanced AI Seminar, March 2003 66