Cooperation via Policy Search and Unconstrained Minimization Brendan and Yifang Feb 24 2015 Paper: Learning to Cooperate via Policy Search Peshkin, Leonid and Kim, Kee-Eung and Meuleau, Nicolas and Kaelbling, Leslie Pack. 2000. Introduction • Previously we’ve been concerned with single agents, oblivious environments and one-to-one value functions • Impractical for large/diverse environments • Impractical for complicated/reactive systems • More generalizable approach desirable • RL Chapter 8 introduced function approximation to build more accurate value functions with small samples • Peshkin et al. introduce multiple agents and world state information asymmetry • Wide range of potential new applications • Particularly relevant to game theory • Compatible with Connectionism models Game Theory • “Game theory is the study of the ways in which interacting choices of economic agents produce outcomes with respect to the preferences (or utilities) of those agents, where the outcomes in question might have been intended by none of the agents.” - Stanford Encyclopedia of Philosophy • Example – Volunteer’s Dilemma: N agents share a common reward if at least one agent volunteers, but volunteering has an associated cost (e.g. power goes out and at least one person has to make an expensive sat-phone call to electric company to get power restored) • • • • Cooperative Multi-agent Imperfect world state information Analytic solutions Multi-Agent Environments • Simple generalization • Instead of a single agent, we have a set of agents, πΊ • Each agent has its own action space, π΄π • π΄ is now the joint action space, π΄ = π ππ΄ • Large expansion of potential applications • Exponential expansion of search space • π agents with π possible states ⇒ π π world states • Makes many approaches computationally impractical Paper Specifics • Paper adopts Partially Observable Identical Payoff Stochastic Game model with some caveats • • • • Picked to be interesting Allows theoretic guarantees Requires non-trivial distributed algorithm Some game theory applications Identical Payoff Stochastic Game (IPSG) • Set of agents in a Markov environment • All agents share a common reward function • Represented π, π0π , πΊ, π, π • • • • π is a discrete state space π0π is a probability distribution over the initial state πΊ is the collection of agents π: π × π΄ → π(π) is a mapping state and action to probability over states • π is the reward function (discussed later) • Each agent represented π΄π , ππ , π΅π • ππ is the observation space • π΅π is the observation function Reward Questions • Question: “What is the reward r(t) in POIPSG? The paper says that different actions are taken by each agent to form a compound action, but there the reward is identical? How is this reward calculated?” – Tavish • π: π × π΄ → π • π is the state space, π΄ is the joint action space, and π is the reward space • Reward is an explicit result of the compound action at time π‘ and shared by all agents Reward Questions Cont. • Question: “ Its title is about cooperation. Does the reward function r reflect the idea of cooperation? ” – Jiyun Luo • Paper’s definition - “Cooperative games are those in which both agents share the same payoff structure.” • Identical payoff stochastic games meet this requirement by definition • Cooperation effect seemingly evident in some example – e.g. soccer • More generally – “a cooperative game is a game where groups of players ("coalitions") may enforce cooperative behavior, hence the game is a competition between coalitions of players, rather than between individual players.” • Typically entails communication and enforcement not present in model • Similarly, joint strategies are not explicit in this model Partially Observable IPSG (POIPSG) • Recall ππ and π΅π • • • • • π΅π is a mapping from the state to observation, π΅π : π → π(ππ ) Normally, π΅ π = π for all s (ie completely observable) In the more general form, the game is “partially observable” Each agent observes ππ (π‘) corresponding to π΅π (π π‘ ) Even with identical policies, agents act differently based on ππ • Each agent tries to independently maximize the game value, ∞ πΎ π‘ πΈπ π π‘ π π π‘=0 • Where πΎπ 0,1 is the discount factor • π is the set of strategies Asymmetric Information Question • Question: “The paper investigates the problem in which "agents all receive the shared reward signal, but have incomplete, unreliable, and generally different perceptions of the world state." How important is this problem? What are the real world applications to this model? ” – Yuankai • Economics: Equal shared holders in a company all benefit from its success, but may have different world state information • Politics: Citizens all have a common goal, but differ in knowledge • Shared rewards are common (ie volunteer’s dilemma), but identical rewards are somewhat unrealistic • Model arguably lacks communication and enforcement necessary for more complex economic problems; potentially representable with observation function? Model Assumptions • State space is discrete • Agents have finite memory • Important for realism • Paper outlines a finite state controller for each agent, but uses fully reactive policies in practice • Factored controller • (ie distributed opposed to centralized), each agent has its own sub-policy • Not compatible with simple communication models • Learning is simultaneous • Actions happen simultaneously • coordinated by the reward result • Policy defines probability of action as a continuous differential function Joint Controller vs. Factored Controller • A factored action consists of multiple components, π = π1 , … , ππ • A joint controller maps observations to a complete joint distribution, π(π) • A factored controller uses an independent sub-policy to determine each action component, πππ : ππ → π(ππ ) • Joint controllers significantly more powerful than factored controllers because they can coordinate actions • Any set of factored controllers ππ can be represented by a joint controllers Pr π = π π=1 Pr ππ • Not all stochastic joint controllers can be represented with a factored controller, ie coordinated randomness Control Question • “What is the major difference (mathematically) between central control vs. distributed control of factored actions?” – Dr. Yang • In general, see previous • In the case of gradient descent, none • Action probabilities of each agent (and consequently the distributed partial derivatives) are independent. • As such, weight updates can be distributed without cooperation Algorithm Question • “What is the REINFORCE algorithm? Please provide details, including derivation. Also show the connection between generation of RL and neural network.” – Dr. Yang • In a Connectionist Network (ie Artificial Neural Network) • Given a network on simple independent nodes/agents • Each node draws an output π¦π based on input vector π₯ π and associated weight, π€ππ • REINFORCE introduced the concept of policy search to RL • Major intuition here is that most policies can be represented as a series of weights • Optimization of the policy can be seen as a weight optimization problem • Good problem formulation for techniques like gradient descent • The authors use a more standard gradient descent for their actual weight adjustments Simple Artificial Neuron Algorithm - REINFORCE • Given a network and an associative immediate-reinforcement learning task • Simplify time dependence such that weights are adjusted following receipt of the reinforcement value, r, at each trial • Suppose that at each step, each weight π€ππ is incremented by an amount βπ€ππ = πΌππ π − πππ πππ • πΌππ is the learning rate factor • πππ is the reinforcement baseline • πππ is the characteristic eligibility of π€ππ • Call learning algorithms with this form a REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility (REINFORCE) algorithm • Note: Paper uses a simpler weight adjustment formulation Algorithm – Gradient Descent for Policy Search • Let β an action/state history at time π, and π»π‘ the set of all possible sequences length π‘ ∞ πΎπ‘ π π =π π = π‘=0 Pr β π π(π‘, β) β∈π»π‘ • ie, the value of a policy set is the discount factor multiplied by the probability of encountering that history given the policy and the value of the history (for each time step) – this is just expected value • Let π€ be the vector of weights representing a policy • We can calculate the partial derivatives (which gives us the gradient) of the policy value function with respect to each weight, ∞ ππ π πPr(β|π) π‘ = πΎ π(π‘, β) ππ€π ππ€π π‘=0 β∈π»π‘ Gradient Descent for Policy Search Cont. • Supposing each time-step in a history is represented with π π‘ , π π‘ , π π‘ , π(π‘) (ie state, observation, action, reward) • Expanding the partial derivative we have, ∞ πΎπ‘ π‘=0 π‘ × π=1 Pr β π π(π‘, β) β πln Pr(π π, β , π(π, β)|βπ−1 , π) ππ€π • Pr β π assumes complete knowledge so we substitute our estimate of the gradient, π‘ πΎ π‘ π(π‘, β) π=1 πln Pr(π π, β , π(π, β)|βπ−1 , π) ππ€π Algorithm – Distributed Gradient Descent • Now that we can calculate the gradient for each trial, using gradient descent is trivial • Gradient descent implementations will be covered later • To distribute the action determination to individual agents (consistent with our model), simply have each agent perform gradient descent locally • Each agent given some set of initial weights • Each agent adjusts their own weight given their local observation and the reward for each round Guarantees • For factored controllers, distributed gradient descent is equivalent to joint gradient descent • Thus, Distributed Gradient Descent finds a local optimum • Centralized gradient descent finds a local optimum simply because it’s a descent algorithm that stops at an optimum • Distributed is equivalent to joint • Every strict Nash equilibrium is a local optimum for gradient descent • Equivalence isn’t two-way (discussed later) Nash Equilibrium • “A set of strategies is a Nash Equilibrium when no player could improve her payoff, given the strategies of all other players in the game, by changing her strategy.” • In the case of IPSG • A Nash equilibrium point is a pair of strategies (π1∗ , π2∗ ) π . π‘. π π1∗ , π2∗ ≥ π π1 , π2∗ π( π1∗ , π2∗ ) ≥ π( π1∗ ,π2 ) • For all π1 ,π2 • Nash equilibrium is significant because it entails a stable noncooperative system • Widely applicable to economic and political strategies e.g. Prisoner’s dilemma • Comparison here is a little odd since system is “cooperative” NE Question • “ If we are modeling two agents in a system, say a user and a machine. To reach Nash equilibrium, can we restrict what the user can or should do in the process? Is this reasonable? If not, that means that the user can do whatever he/she wants, and acts not optimal, any comments on the Nash equilibrium under such condition? How can the machine still optimize for both parties? ” • A NE is two-sided by definition. If a user can improve their return (ie isn’t acting optimally) it isn’t a NE. • Rational agents are typically a good assumption, though it often requires a very sophisticated model to reflect real life • Partial observability may be a good way model mostly rational agents • ie a user interacting with a search engine • The machine may have a dominant strategy irrespective of the user actions, but it’s generally not interesting to model such problems with multiple agents NE Equivalence Question • “Can you explain when a local optima for gradient descent would not be a Nash equilibrium?” – Brad • Gradient descent is only guaranteed to search locally • One agent may have a better strategy (far removed) that improves theirs payoff regardless of other agents • “We can construct a value function V (w1, w2) such that for some c, V (·, c) has two modes, one at V (a, c) and the other at V (b, c), such that V (b, c)> V (a, c). Further assume that V (a, ·) and V (b, ·) each have global maxima V (a, c) and V (b, c). Then V (a, c) is a local optimum that is not a Nash equilibrium” Example - Toy • Coordination problem <a,a>; <b,b> <a,*> s2 s3 +10 s5 -10 s6 +5 <a,b>; <b,a> s1 <b,*> s4 <*,*> Experimental Results Example - Soccer • • • • 6x5 grid with 2 randomly places agents and 1 defender Agents can {North, South, East, West, Stay, Pass} Agents observe who possesses ball and status of surrounding Defenders can be Random, Greedy (goes out to block ball), or Defensive(stays in goal to block) Soccer optimal solution question • “ The paper shows a distributed learning algorithm in cooperative multi-agent domain by a 2-agents soccer game example. It points out that, an algorithm which can achieve optimal payoff for all agents may not be possible in general. Why not? Can you explain this in the class? ” – Sicong • Solving POIPSG completely is intractable • Analytical solutions are sometimes possible Q-learning • Used for policy search motivation as well as baseline • Recall – Initialize π(π , π) arbitrarily Repeat (for each episode): Initialize π Repeat (for each step of episode): Choose π from π using policy derived from π Take action π, observe π, π ’ π π , π ← π π , π + πΌ[π + πΎ min π π ′ , π′ − π(π , π)] π ← π ′; Until π is terminal π′ Experimental Results – Defensive Opponents Experimental Results – Greedy Opponents Experimental Results – Random Opponents Experimental Results – Mixed Opponents Tractability Question • “The paper gives the example of tractable soccer game with two learning agents playing against one opponent with fixed strategy and then it mentions that it becomes difficult to even store the Q table. The question is, what can be the possible to solve complex environments that are useful for practical applications, if any? Discussion on this?” – Tavish • Recall: Q-learning was mainly introduced as motivation • Removing the Q table is favor of a compact function approximation based on sampling is much more practical for large examples and often not too bad. Conclusions • Reinforcement learning can easily generalized to include multiple agents and applied to game theory • Distributed gradient descent has nice guarantees and performs well in POIPSG • Distributed gradient descent can produce “cooperation” against consistent and coordinated opponents Convex Optimization: Section 9.1 Stephen Boyd and Lieven Vandenberghe. 2004. Problem • Our goal is to solve the unconstrained minimization problem: min f (x) • Where f(x) is convex, and twice continuously differentiable. • In this case, a necessary and sufficient condition for x to be optimal is • * Ñ( f (x )) = 0 • So what is convex? Convex, strictly convex • f(x) is called convex, if • f(x) is called strictly convex, if Strongly convex • f(x) is twice continuously differentiable, and it is strongly convex, if it satisfy the following condition where Ñ2 f (x) - mI ³ 0 is Hessian matrix Strong convex, cont’d Ñ 2 f (x) = PLP T , where P T é m ê 1 m2 ê ê L =ê . . ê ê mn êë = P -1 ù ú ú ú ú ú ú úû Ñ 2 f (x) - mI = PLP T - mPIP T = P(L - mI )P T Thus, the m should be smaller than the smallest Eigen value of the Hessian matrix. Lower bound of Ñ f (x) 2 According to Taylor theorem, we will have 1 f (y) = f (x) + Ñf (x)T (y - x) + (y - x)T Ñ 2 f (z)(y - x) 2 1 ³ f (x) + Ñf (x)T (y - x) + (y - x)T m(y - x) 2 m 2 ³ f (x) + Ñf (x)T (y - x) + y - x 2 2 ~ ~ m 2 1 T ³ f (x) + Ñf (x) (y- x) + y - x 2 , where y = x - Ñf (x) 2 m 1 2 = f (x) Ñf (x) 2 2m 1 2 Ñf (x) 2 we have p* ³ f (x) 2m ~ Then It means when Ñf (x) 22 is small, the f(x) is near to its optimal. Upper-bound of Ñ f (x) 2 • Similar to lower bound, Ñ2 f (x) has an upper bound. Ñ 2 f (x) - mI ³ 0 Ñ 2 f (x) - MI £ 0 1 f (y) = f (x) + Ñf (x)T (y - x) + (y - x)T Ñ 2 f (z)(y - x) 2 M 2 £ f (x) + Ñf (x)T (y - x) + y-x 2 2 ~ ~ £ f (x) + Ñf (x)T (y- x) + 1 2 Ñf (x) 2 2M 1 p* £ f (x) Ñf (x) 2M M 1 2 y - x 2 , where y = x - Ñf (x) 2 M £ f (x) - 2 2 ~ Condition number of sublevel sets • We define the width of a context set C in the direction q, q is a unit vector. W(C, q) = sup qT z - inf qT z zÎC zÎC Further, we have minimum width and maximum width Wmin = inf W (C, q), Wmax = inf W(C, q) q 2 =1 q 2 =1 The condition number of a context set C is 2 Wmax cond(C) = 2 Wmin • Question 1: What is Hessian matrix?(From Brad) • Answer: It is second-order derivative matrix • Question 2: Given convexity, does local optimal guarantee to be global optimal? • Answer: Yes, local optimal guarantees global optimal, if the objective function is continuously differentiable. • Question 3: Convexity is a strong assumption. Does the realworld problem always satisfy this assumption? What if it does not stand? • Answer: The real-world problems do not always satisfy this assumption. ì x + y < 10 ï ïx > 0 í ïy > 0 ïî x + 2y > 4 or 2x + y > 4 max f (x, y) • We adopt some other strategies: follow the other direction other than gradient-descent direction, with a small probability. • Question 4: Can you explain the math symbol in Section 9.1?(From Jiyun) Convex Optimization: Sections 9.2 and 9.3 Stephen Boyd and Lieven Vandenberghe. 2004. Descent Methods • Goal: produce a minimizing sequence π₯ (π) , π = 1, … s.t. • π₯ (π+1) = π₯ (π) + π‘ π βπ₯ π , where π‘ π is the step size for step π • π‘ π > 0 until π₯ (π) is optimal • Focus is on descent, consistent with the convex theme of the book, ie π π₯ π+1 < π π₯ π • Recall: π»π π₯ π π π¦−π₯ π ≥0⇒π π¦ ≥π π₯ π • Search direction, π₯ (π) , must form an acute angle with the negative gradient • π»π π₯ π π βπ₯ (π) < 0 General descent method • Algorithm 9.1 – given a starting point π₯ ∈ ππ¨π¦ π repeat 1. Determine a descent direction βπ₯. 2. Line Search. Choose a step size π‘ > 0. 3. Update. π₯ β π₯ + π‘ βπ₯ until stopping criteria is satisfied. • To avoid confusion, call the direction of search a ray, π₯ + π‘βπ₯ π‘ ≥ 0} Descent Direction Question • “ in section 9.2, how did it decide a descend direction ‘βπ₯’” Jiyun Luo • Book is more theory oriented so the actual descent direction choice is left general here • You could choose any direction acute to the gradient and still fall under their convergence proofs. Exact Line Search • When the cost of minimization with respect to a single variable is small compared to the cost of finding the ray itself, we can simple use said minimum as the step size, π‘ = argminπ ≥0 π(π₯ + π βπ₯) Backtracking Line Search • When an exact minimization along the ray is hard, we can approximate that as well, e.g. backtracking line search • Algorithm 9.2 – given a descent direction βπ₯ for π at π₯ ∈ ππ¨π¦ π, α ∈ (0,0.5), and β ∈ (0,1) π‘β1 while π π₯ + π‘βπ₯ > π π₯ + πΌπ‘π»π(π₯)π βπ₯ π‘ β π½π‘ • ie, start with unit step size • Reduce step size by a factor of π½ until stopping condition holds • π π₯ + π‘βπ₯ ≤ π π₯ + πΌπ‘π»π(π₯)π βπ₯ Backtracking Stopping Condition Question • “ in algorithm 9.2 the "while" line, πΌ is constant, π‘ is decent length, βπ₯ is descend direction, what is the remain part (π»π(π₯)π )? What is its role?” - Jiyun Luo • π»π(π₯)π denotes the transpose of the gradient • Recall from convexity that – π»π(π₯)π βπ₯ < 0 • It follows (non-trivially) that on the interval (0, π‘0 ] π π₯ + π‘βπ₯ ≤ π π₯ + πΌπ‘π»π(π₯)π βπ₯ • This lets us bound the optimum and show that the loop terminates Backtracking Line Search Ex. Backtracking Stopping Condition Question • “ Could you show us graphically what it looks like when the stopping condition of the bactracking line search holds?” Brad Backtracking Stop Ex. Backtracking Parameter Choice Question • “For the backtracking line search method, how will the selection of πΌ and π½ values affect the convergence and accuracy of result. Also, can the πΌ and π½ values be estimated beforehand instead of arbitrarily specifying them?” – Tavish • Estimating optimal πΌ and π½ values beforehand would be difficult and problem specific • Choices for πΌ and π½ will affect convergence speed, but precision is guaranteed by your stopping threshold regardless Line Search Question • “This is a question about the line search in chapter 9.2. Since the curve f is convex, I believe there are many other simple and easy way to approach the minimum point. For instance, initiate t as a not very small step length. Then, consider f(x), f(x+t), f(x+2t), f(x+3t). These four values trisect the range [x, x+3t]. If f(x+t) and f(x+2t) are the two minimum values among the four, we will know that the minimum point must lie within [x+t, x+2t]. From that on, we can shrink the searching range, and keep doing this trisection. For every iterations after that, the minimum point must lie within the trisection within the minimum two values. This algorithm can guarantee a fast convergent speed, and requires similar calculation for the f values, compared with the gradient decent methods. Hence, my question is, why do we use line search (and backtracking line search) methods in the convex setting? What are its advantages compared with other methods that looks also pretty simple, for instance the one I just mentioned?” – Sicong • This algorithm seems fine, but not any faster than backtracking (range reduced by a factor of 1/3 opposed to π½ at every step) • Exact is the most direct approach you can take and Backtracking is very general while still facilitating their convergence guarantees Gradient Descent • We’ve left the search direction general • Natural choice is simply along the negative gradient, βπ₯ = −π»π(π₯) • Plugging back into Algorithm 9.1, this gives Algorithm 9.3 – given a starting point π₯ ∈ ππ¨π¦ π repeat 1. βπ₯ β −π»π(π₯). 2. Line Search. Choose a step size π‘ via exact or backtracking line search 3. Update. π₯ β π₯ + π‘ βπ₯ until stopping criteria is satisfied. • π»π(π₯) 2 ≤ π is a natural stopping condition, ie Euclidean norm of the vector is small, π₯ β π₯12 + β― + π₯π2 Line Search Questions Cont. • “Since the setting is convex and twice continuously differentiable, can you compare the exact line search, backtracking line search, and the Newton’s Method?” – Sicong • Newton’s method is a descent method, but it relies on the Hessian rather than the gradient so it’s using a different search vector • Convergence guarantees rely on an additional self-concordance assumption • “When would one use exact, vs backtracking line search?” - Brad • Exact line search typically reduces the required number of steps (not guaranteed) • However, computing the exact minimum often not feasible or efficient • “What are the major differences between gradient method with exact line search and gradient method with backtracking line search? Please illustrate with an example how they differ in reaching the point of convergence.” - Tavish Newton’s Method Gradient Descent with Exact Line Search Gradient Descent with Backtracking Line Search Exact vs. Backtracking Speed Conclusions • Based on toy experimental results • Gradient method offers roughly linear convergence • Choice of backtracking parameters has a noticeable but not dramatic effect • Gradient descent is advantageous due to its simplicity • Gradient descent may converge slowly in certain cases