Proactive Dynamic DCOPs Khoi Hoang, Ferdinando Fioretto, Ping Hou,

The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence Artificial Intelligence for Smart Grids and Smart Buildings: Technical Report WS-16-04 Proactive Dynamic DCOPs Khoi Hoang,† Ferdinando Fioretto,† Ping Hou,† Makoto Yokoo,‡ William Yeoh,† and Roie Zivan? † Department of Computer Science, New Mexico State University ‡ Department of Informatics, Kyushu University ? Department of Industrial Engineering and Management, Ben Gurion University of the Negev Abstract exact and approximation algorithms with quality guarantees to solves the PD-DCOPs proactively. The current approaches to model dynamism in DCOPs solve a sequence of static problems, reacting to the changes in the environment as the agents observe them. Such approaches, thus, ignore possible predictions on the environment evolution. To overcome such limitations, we introduce the Proactive Dynamic DCOP (PD-DCOP) model, a novel formalism to model dynamic DCOPs in the presence of exogenous uncertainty. In contrast to reactive approaches, PD-DCOPs are able to explicitly model the possible changes to the problem, and take such information into account proactively, when solving the dynamically changing problem. Background DCOPs: A Distributed Constraint Optimization Problem (DCOP) is a tuple hA, X, D, F, αi, where A = {ai }pi=1 is a set of agents; X = {xi }ni=1 is a set of decision variables; D = {Dx }x∈X is a set of finite domains and each variable x ∈ X takes values from the set Dx ∈ D; F = {fi }m i=1 is a set of utility functions, each defined over a mixed set of decision variables: fi : "x∈xfi Dx → R+ ∪ {⊥}, where xfi ⊆ X is scope of fi and ⊥ is a special element used to denote that a given combination of values for the variables in xfi is not allowed; and α : X → A is a function that associates each decision variable to one agent. A solution σ is a value assignment for a set xσ ⊆ X of variables that is consistent with their respective domains. P The utility F(σ) = f ∈F,xf ⊆xσ f (σ) is the sum of the utilities across all the applicable utility functions in σ. A solution σ is complete if xσ = X. The goal is to find an optimal complete solution x∗ = argmaxx F(x). Given a DCOP P , G = (X, E) is the constraint graph of P , where {x, y} ∈ E iff ∃fi ∈ F such that {x, y} = xfi . A DFS pseudo-tree arrangement for G is a spanning tree T = hX, ET i of G such that if fi ∈ F and {x, y} ⊆ xfi , then x and y appear in the same branch of T . We use N (ai ) = {aj ∈ A | {xi , xj } ∈ E} to denote the neighbors of agent ai . Introduction Distributed Constraint Optimization Problems (DCOPs) are problems where agents need to coordinate their value assignments to maximize the sum of the resulting constraint utilities (Modi et al. 2005; Yeoh and Yokoo 2012). They are well-suited for modeling multi-agent coordination problems where the primary interactions are between local subsets of agents (Maheswaran et al. 2004; Farinelli et al. 2008; Ueda, Iwasaki, and Yokoo 2010). Unfortunately, DCOPs only model static problems or, in other words, problems that do not evolve over time. However, within a real-world application, agents often act in dynamic environments. For instance, in a disaster management scenario, new information (e.g., weather forecasts, priorities on buildings to evacuate) typically becomes available in an incremental manner. Thus, the information flow modifies the environment over time. Consequently, researchers have introduced the Dynamic DCOP (D-DCOP) model (Petcu and Faltings 2005b; 2007; Lass, Sultanik, and Regli 2008), where utility functions can change during the problem solving process. All of these models make the common assumption that information on how the problem might change is unavailable. As such, existing approaches reacts to the change in the problem and solve the current problem at hand. However, in many applications, information on how the problem might change is indeed available, within some degree of uncertainty. Therefore, in this paper, (i) we introduce the Proactive Dynamic DCOP (PD-DCOP) model, which explicitly models how the DCOP will change over time; and (ii) we develop Dynamic DCOPs: A Dynamic DCOP (D-DCOP) is defined as a sequence of DCOP with changes between them, without an explicit model for how the DCOP will change over time. Solving a D-DCOP optimally means finding a utility-maximal solution for each DCOP in the sequence. Therefore, this approach is a reactive approach since it does not consider future changes. The advantage of this approach is that solving a D-DCOP is no harder than solving h DCOPs, where h is the horizon of the problem. Petcu and Faltings (2005b) have used this approach to solve D-DCOPs, where they introduce a super-stabilizing DPOP algorithm that is able to reuse information from previous DCOPs to speed up the search for the solution for the current DCOP. Alternatively, a proactive approach predicts future changes in the D-DCOP and finds robust solutions that require little c 2016, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 233 t. The latter is a common factor in several European countries. Furthermore, a transmission line l ∈ L is associated to a maximal thermal capacity cl , which limits the maximal amount of energy that can travel on l at a given time. Users can express their preferences on the charging schedule, which may, for instance, reflect the dynamic price schema imposed by the retailer. The goal is that of finding a feasible charging schedule for all the EVs that satisfies the capacity constraints of the underlying electricity network while maximizing the utilities associated to the user preferences. The D-DEVCSP can be formalized as a dynamic DCOP, where each EV v is modeled by an agent av ; start times for the charging schedules are modeled via decision variables xv . Other decision variables involve the power flow on the transmission lines luv , for pair of neighboring houses u, v. All the capacity constraints, relative to houses and transmission lines, can be captured via hard constraints. Further user preferences on scheduling times can be captured via soft constraints. Notice that the D-DCOP formulation can only partially capture the complexity of the D-DEVCSP. Indeed such formulation has some shortcoming, as they: (a) fail to capture the presence of exogenous characters in the dynamic aspect of the problem, and (b) do not take account the inconvenience of the users to change their schedule, when the preference of some agents are updated. The PD-DCOP model solves these shortcomings by acting proactively in the problem resolution, and allowing us to make a step forward toward more refined dynamic solutions. or no changes despite future changes. Researchers have also proposed other models for DDCOPs including a model where agents have deadlines to choose their values (Petcu and Faltings 2007), a model where agents can have imperfect knowledge about their environment (Lass, Sultanik, and Regli 2008), and a model where changes in the constraint graph depends on the value assignments of agents (Zivan et al. 2015). DPOP and S-DPOP: The Distributed Pseudo-tree Optimization Procedure (DPOP) (Petcu and Faltings 2005a) is a complete inference algorithm composed of three phases: • Pseudo-tree Generation: Agents coordinate to build a pseudo-tree (Hamadi, Bessière, and Quinqueton 1998). • UTIL Propagation: Each agent, starting from the leafs of the pseudo-tree, computes the optimal sum of utilities in its subtree for each value combination of variables in its separator.1 It does so by adding the utilities of its functions with the variables in its separator and the utilities in the UTIL messages received from its children agents, and projecting out its own variables by optimizing over them. • VALUE Propagation: Each agent, starting from the pseudo-tree root, determines the optimal value for its variables. The root agent does so by choosing the values of its variables from its UTIL computations. Super-stabilizing DPOP (S-DPOP) (Petcu and Faltings 2005b) is a self-stabilizing extension of DPOP, where the agents restart the DPOP phases when they detect changes in the problem. S-DPOP makes use of information that is not affected by the changes in the problem. PD-DCOP Model Motivating Domain A Proactive Dynamic DCOP (PD-DCOP) is a tuple hA, X, D, F, T, γ, h, c, p0Y , αi, where: • A = {ai }pi=1 is a set of agents. • X = {xi }ni=1 is a mixed set of decision and random variables. To differentiate between decision variables and random variables, we use Y ⊆ X to denote the set of random variables that model uncontrollable stochastic events (e.g., weather, malfunctioning devices). • D = {Dx }x∈X is a set of finite domains. Each variable x ∈ X takes values from the set Dx ∈ D. We also use Ω = {Ωy }y∈Y ⊆ D to denote the set of event spaces for the random variables (e.g., different weather conditions or stress levels to which a device is subjected to) such that each y ∈ Y takes values in Ωy . • F = {fi }m i=1 is a set of reward functions, each defined over a mixed set of decision variables and random variables: fi : "x∈xfi Dx → R+ ∪ {⊥}, where xfi ⊆ X is scope of fi and ⊥ is a special element used to denote that a given combination of values for the variables in xfi is not allowed. • h ∈ N is a finite horizon in which the agents can change the values of their variables. • T = {Ty }y∈Y is the set of transition functions Ty : Ωy × Ωy → [0, 1] ⊆ R for the random variables y ∈ Y, describing the probability for a random variable to change its value in successive time steps. For a time step t > 0, We now introduce a Dynamic Distributed Electric Vehicle (EV) Charging Scheduling Problem (D-DEVCSP), which we use as a representative domain to motivate the introduction of PD-DCOPs. We consider an n days scheduling problem, with hourly time intervals. In our scenario, we consider a set of houses H, each of which constitutes a charging station for an EV v ∈ H. Neighboring houses are connected to each other via transmission lines, and power is provided by a electricity provider. The set of houses and transmission lines can be visualized as a undirected graph G = (H, L), whose vertices represent houses and edges represent transmission lines connecting pairs of neighboring houses. EVs can be charged exclusively when connected to their own station (i.e., when parked at home), and each EV v needs to be charged for a given minimum amount of energy pv , according to the user usage and is influenced by the amount of time the EV is driven when not connected to the charging station. In a realistic scenario, such information is inevitably stochastic, as factors such as user users leaving time schedules and driving distances are outside the control of the decision process. In addition, each house v ∈ H has a background load btv and a maximal energy usage limit qvt for each time step 1 The separator of xi contains all ancestors of xi in the pseudotree that are connected to xi or one of its descendants. 234 and values ωi ∈ Ωy , ωj ∈ Ωy , Ty (ωi , ωj ) = P (y t = ωj | y t−1 = ωi ), where y t denotes the value of the variable y at time step t, and P is a probability measure. Thus, Ty (ωi , ωj ) describes the probability for the random variable y to change its value from P ωi at a time step t − 1 to ωj at a time step t. Finally, ωj ∈Ωy Ty (ωi , ωj ) = 1 for all ωi ∈ Ωy . • c ∈ R+ is a switching cost, defined as the cost associated to the action of changing the values of a decision variable from one time step to the next. • γ ∈ [0, 1), is a discount factor, which represents the difference in importance between future rewards/costs and present rewards/costs. • p0Y = {p0y }y∈Y is a set of initial probability distributions for the random variables y ∈ Y. • α : X \ Y → A is a function that associates each decision variable to one agent. We assume that the random variables are not under the control of the agents and are independent of decision variables. Thus, their values are solely determined according to their transition functions. Throughout this paper, we refer to decision (resp. random) variables as with the letter x (resp. y). We also assume that each agent controls exactly one decision variable (thus, α is a bijection), and that each reward function fi ∈ F associates with at most one random variable yi .2 The goal of a PD-DCOP is to find a sequence of h + 1 assignments x∗ for all the decision variables in X \ Y: x∗ = argmax F h (x) Equation (3) takes into account the penalties due to the changes in the decision variables’ values during the optimization process, where ∆ : Σ × Σ → N is a function that counts the number of assignments to decision variables that differs from one time step to the next. Equation (4) refers to the optimization over the last time step, which further accounts for discounted future rewards: γh Fxh (x) 1−γ X X F̃y (x) = f˜i (xi |yi =ω ) · phyi (ω) F̃x (x) = h−1 X γ t Fxt (xt ) + Fyt (xt ) f˜i (xi |yi =ω ) = γ h · fi (xi |yi =ω ) (10) X +γ Tyi (ω, ω 0 ) · f˜i (xi |yi =ω0 ) ω 0 ∈Ωyi In summary, the goal of a PD-DCOP is to find an assignment of values to its decision variables that maximizes the sum of two terms. The first term maximizes the discounted net utility, that is, the discounted rewards for the functions that do not involve exogenous factors (Fx ) and the expected discounted random rewards (Fy ) minus the discounted penalties over the first h time steps. The second term maximizes the discounted future rewards for the problem. D-DEVCSP as a P-DCOP: PD-DCOPs can naturally handle the dynamic character of the D-DEVCSP, presented in the previous Section. Uncontrollable events, such as user leaving times and travel distance affecting agents charging times decisions can be modeled via random variables. In particular, in our model each agent’s variable is associated to two random variables yvl , yvd ∈ Y, describing the different user leaving times and travel distances per day, which can affect preferences the agents’ preferences. Hourly user profiles on both leaving times and travel distance can be modeled via transition functions Ty (t, t0 ), for every time of the day t, t0 . EV charging rescheduling over time, is inconvenient for the EV drivers, thus it is important to find stable schedules past the given horizon. This character is modeled via switching costs for each EV schedule, occurring when a charing plan is forced to be rescheduled. Finally the PD-DCOP horizon captures the dynamic character of the domain. (1) (2) t=0 h−1 X − γ t c · ∆(xt , xt+1 ) (3) t=0 + F̃x (xh ) + F̃y (xh ) (4) where Σ is the assignment space for the decision variables of the PD-DCOP, at each time step. Equation (2) refers to the optimization over the first h time steps, with: X Fxt (x) = fi (xi ) (5) fi ∈F\FY Fyt (x) = X X fi (xi |yi =ω ) · ptyi (ω) (6) PD-DCOP Algorithms fi ∈FY ω∈Ωyi We introduce two approaches: An exact approach, which transforms a PD-DCOP into an equivalent DCOP and solves it using any off-the-shelf DCOP algorithm, and an approximation approach, which uses local search. where xi is an assignment for all the variables in xfi ; we write xi |yi =ω to indicate that the random variable yi ∈ xfi takes on the event ω ∈ Ωyi ; FY = {fi ∈ F|xfi ∩Y 6= ∅} is the set of functions in F which involve random variables; ptyi (ω) is the probability for the random variable yi to assume value ω at time t, and defined as ptyi (ω) = X 0 Tyi (ω 0 , ω) · pt−1 yi (ω ). (9) fi ∈FY ω∈Ωyi x=hx0,...,xh i∈Σh+1 F h (x) = (8) Exact Approach We first propose an exact approach, which transforms a PDDCOP into an equivalent DCOP and solves it using any offthe-shelf DCOP algorithm. Since the transition of each random variable is independent of the assignment of values to decision variables, this problem can be viewed as a Markov chain. Thus, it is possible to collapse an entire PD-DCOP into a single DCOP, (7) ω 0 ∈Ωyi 2 If multiple random variables are associated with a reward function, w.l.o.g., they can be merged into a single variable. 235 Procedure CalcGain() Algorithm 1: L OCAL S EARCH( ) 1 2 3 4 5 iter ← 1 hvi0∗ , vi1∗ , . . . , vih∗ i ← hNull, Null, . . . , Nulli hvi0 , vi1 , . . . , vih i ← I NITIAL A SSIGNMENT() context ← h(xj , t, Null | xj ∈ N (ai ), 0 ≤ t ≤ h)i Send VALUE(hvi0 , vi1 , . . . , vih i) to all neighbors 6 7 8 9 10 11 12 where (1) each reward function Fi in this new DCOP captures the sum of rewards of the reward function fi ∈ F across all time steps, and (2) the domain of each decision variable is the set of all possible combination of values of that decision variable across all time steps. However, this process needs to be done in a distributed manner. We divide the reward functions into two types: (1) The functions fi ∈ F whose scope xfi ∩ Y = ∅ includes exclusively decision variables, and (2) the functions fi ∈ F whose scope xfi ∩ Y 6= ∅ includes some random variable. In both cases, let xi = hx0i , . . . , xhi i denote the vector of value assignments to all decision variables in xfi for each time step. Then, each function fi ∈ F whose scope includes only decision variables can be replaced by a function Fi : " h−1 # " # h X γ Fi (xi ) = γ t · fi (xti ) + fi (xhi ) (11) 1 − γ t=0 Fi (xi ) = X γt t=0 + fi (xti |yi =ω ) · ptyi (ω) f˜i (xhi |yi =ω ) · phyi (ω) Procedure When Receive VALUE(hvs0∗ , vs1∗ , . . . , vsh∗ i) 19 20 21 γ t c · ∆(xti , xt+1 ) i foreach t from 0 to h do if vst∗ 6= Null then Update (xs , t, vst ) ∈ context with (xs , t, vst∗ ) 23 if received VALUE messages from all neighbors in this iteration then C ALC G AIN() 24 iter ← iter + 1 22 local search algorithm to solve PD-DCOPs, is inspired by MGM (Maheswaran, Pearce, and Tambe 2006), which has been shown to be robust in dynamically changing environments. Algorithm 1 shows its pseudocode, where each agent ai maintains the following data structures: • iter is the current iteration number. • context is a vector of tuples (xj , t, vjt ) for all its neighboring variables xj ∈ N (ai ). Each of these tuples represents the agent’s assumption that the variable xj is assigned value vjt at time step t. • hvi0 , vi1 , . . . , vih i is a vector of the agent’s current value assignment for its variable xi at each time step t. • hvi0∗ , vi1∗ , . . . , vih∗ i is a vector of the agent’s best value assignment for its variable xi at each time step t. • hu0i , u1i , . . . , uhi i is a vector of the agent’s utility (rewards from reward functions minus costs from switching costs) given its current value assignment at each time step t. 1∗ h∗ • hu0∗ i , ui , . . . , ui i is a vector of the agent’s best utility given its best value assignment at each time step t. 1∗ h∗ • hû0∗ i , ûi , . . . , ûi i, which is a vector of the agent’s best gain in utility at each time step t. The high-level ideas are as follows: (1) Each agent ai starts by finding an initial value assignment to its variable xi for each time step 0 ≤ t ≤ h and initializes its context context. (2) Each agent uses VALUE messages to ensure that it has the correct assumption on its neighboring agents’ variables’ values. (3) Each agent computes its current utilities given its current value assignments, its best utilities over all possible value assignments, and its best gain in utilities, (12) (13) where the first term (Equation (12)) is the reward for the first h time steps and the second term (Equation (13)) is the reward for the remaining time steps. The function f˜i is recursively defined according to Equation (10). Additionally, each decision variable xi will have a unary function Ci : Ci (xi ) = − Send GAIN(hû0i , û1i , . . . , ûhi i) to all neighbors 15 16 ω∈Ωyi h−1 X 18 14 ω∈Ωyi X 17 if u∗ 6= −∞ then 1∗ h∗ 0∗ 1∗ h∗ hu0∗ i , ui , . . . , ui i ← C ALC U TILS (hvi , vi , . . . , vi i) 0 1 h 0∗ 1∗ h∗ 0 1 hûi, ûi, . . . , ûi i ← hui , ui , . . . , ui i − hui, ui, . . . , uhi i else hû0i , û1i , . . . , ûhi i ← hNull, Null, . . . , Nulli 13 where the first term with the summation is the reward for the first h time steps and the second term is the reward for the remaining time steps. Each function fi ∈ F whose scope includes random variables can be replaced by a unary function Fi : h−1 X hu0i , u1i , . . . , uhi i ← C ALC U TILS(hvi0 , vi1 , . . . , vih i) u∗ ← −∞ foreach hd0i , d1i , . . . , dhi i in ×hi=0 Dxi do u ← C ALC C UMULATIVE U TIL(hd0i , d1i , . . . , dhi i) if u > u∗ then u∗ ← u hvi0∗ , vi1∗ , . . . , vih∗ i ← hd0i , d1i , . . . , dhi i (14) t=0 which captures the cost of switching values across time steps. This collapsed DCOP can then be solved with any offthe-shelf DCOP algorithm. Theorem 1 This collapsed DCOP is equivalent to the original PD-DCOP. Approximation Approach Since solving PD-DCOP is P-SPACE-hard, incomplete approaches are necessary to solve interesting problems. Our 236 Procedure When Receive GAIN(hû0s , û1s , . . . , ûhs i) 25 26 27 28 if Function CalcCumulativeUtil(hvi0 , vi1 , . . . , vih i) hû0s , û1s , . . . , ûhs i = 6 hNull, Null, . . . , Nulli then foreach t from 0 to h do if ûti ≤ 0 ∨ ûts > ûti then vit∗ ← Null 43 30 31 32 33 35 36 37 38 39 40 41 if received GAIN messages from all neighbors in this iteration then foreach t from 0 to h do if vit∗ 6= Null then vit ← vit∗ fj (vit , vjt | xj ∈ xfj , (xj , t, vjt ) ∈ context) 46 47 return u − c that the agent takes on its current value and its neighbors take on their values according to its context. The second component is the cost of switching values from the previous time step t − 1 to the current time step t and switching from the current time step to the next time step t + 1. This cost is c if the values in two subsequent time steps are different and 0 otherwise. The variable cti captures this cost (lines 35-40). The (net) utility is thus the reward from the reward functions minus the switching cost (line 41). The agent then searches over all possible combination of values for its variable across all time steps to find the best value assignment that results in the largest cumulative cost across all time steps (lines 8-12). It then computes the net gain in utility at each time step by subtracting the utility of the best value assignment with the utility of the current value assignment at each time step (lines 13-15). Send VALUE(hvi1∗ , vi2∗ , . . . , vih∗ i) to all neighbors foreach t from 0 to h do if t = 0 then cti ← cost(vi0 , vi1 ) else if t = h then cti ← cost(vih−1 , vih ) else cti ← cost(vit−1 , vit ) + cost(vit , vit+1 ) X uti ← fj (vit , vjt | xj ∈ xfj , (xj , t, vjt ) ∈ context)−cti f fj |xi ∈x j 42 X c←0 foreach t from 0 to h − 1 do c ← c + cost(vit , vit+1 ) 44 Function CalcUtils(hvi0 , vi1 , . . . , vih i) 34 h X t=0 f |x ∈xfj j i 45 29 u← return hu0i , u1i , . . . , uhi i Step 4: The agent sends its gains in a GAIN message to all neighbors (line 18). When it receives a GAIN message from its neighbor, it updates its best value vit∗ for time step t to null if its gain is non-positive (i.e., ûti ≤ 0) or its neighbor has a larger gain (i.e., ûts > ûti ) for that time step (line 27). When it has received GAIN messages from all neighbors in the current iteration, it means that it has identified, for each time step, whether its gain is the largest over all its neighbors’ gains. The time steps where it has the largest gain are exactly those time steps t where vit∗ is not null. The agent thus assigns its best value for these time steps as its current value and restarts Step 2 by sending a VALUE message that contains its new values to all its neighbors (lines 29-33). and sends this gain in a GAIN message to all its neighbors. (4) Each agent changes the value of its variable for time step t if its gain for that time step is the largest over all its neighbors’ gain for that time step, and repeats steps 2 through 4 until a termination condition is met. In more details: Step 1: Each agent initializes its vector of best values to a vector of Nulls (line 2) and calls I NITIAL A SSIGNMENT to initializes its current values (line 3). The values can be initialized randomly or according to some heuristic function. We describe later one such heuristic. Finally, the agent initializes its context, where it assumes that the values for its neighbors is null for all time steps (line 4). Related Work Aside from the D-DCOPs described in the introduction and background, several approaches have been proposed to proactively solve centralized Dynamic CSPs, where value assignments of variables or utilities of constraints may change according to some probabilistic model (Wallace and Freuder 1998; Holland and O’Sullivan 2005). The goal is typically to find a solution that is robust to possible changes. Other related models include Mixed CSPs (Fargier, Lang, and Schiex 1996), which model decision problems under uncertainty by introducing state variables, which are not under control of the solver, and seek assignments that are consistent to any state of the world; and Stochastic CSPs (Walsh 2002; Tarim, Manandhar, and Walsh 2006), which introduce probability distributions that are associated to outcomes of state variables, and seek solutions that maximize the probability of constraint consistencies. While these proactive ap- Step 2: The agent sends its current value assignment in a VALUE message to all neighbors (line 5). When it receives a VALUE message from its neighbor, it updates its context with the value assignments in that message (lines 1921). When it has received VALUE messages from all neighbors in the current iteration, it means that its context now correctly reflects the neighbors’ actual values. It then calls C ALC G AIN to start Step 3 (line 23). Step 3: In the C ALC G AIN procedure, the agent calls C ALC U TILS to calculate its utility for each time step given its current value assignments and its neighbors’ current value assignments recorded in its context (line 6). The utility for a time step t is made out of two components. The first component is the sum of rewards over all reward functions fj | xi ∈ xfj that involve the agent, under the assumption 237 proaches have been used to solve CSP variants, they have not been used to solve Dynamic DCOPs to the best of our knowledge. Researchers have also introduced Markovian D-DCOPs (MD-DCOPs), which models D-DCOPs with state variables that are beyond the control of agents (Nguyen et al. 2014). However, they assume that the state is observable to the agents, while PD-DCOPs assume otherwise. Additionally, MD-DCOP agents do not incur a cost for changing values in MD-DCOPs and only a reactive online approach to solving the problem has been proposed thus far. Another related body of work is Decentralized Markov Decision Processes (Dec-MDPs) and Decentralized Partially Observable MDPs (Dec-POMDPs) (Bernstein et al. 2002). In a Dec-MDP, agents can also observe its local state (the global state is the combination of all local states). In a Dec-POMDP, agents may not accurately observe their local states and, thus, maintain a belief of their local states. The goal is to find a policy that maps each local state (in a Dec-MDP) or each belief (in a Dec-POMDP) to the action for each agent. Thus, like PD-DCOPs, they too solve a sequential decision making problem. However, Dec-MDPs and Dec-POMDPs are typically solved in a centralized manner (Bernstein et al. 2002; Becker et al. 2004; Dibangoye et al. 2013; Hansen, Bernstein, and Zilberstein 2004; Szer, Charpillet, and Zilberstein 2005; Oliehoek et al. 2013) due to its high complexity – solving Dec-(PO)MDPs optimally is NEXP-hard even for the case with only two agents (Bernstein et al. 2002). In contrast, PD-DCOPs are solved in a decentralized manner and its complexity is only PSPACE-hard. The reason for the lower complexity is because the solution of PD-DCOPs are open-loop policies, which are policies that are independent on state observations. In summary, one can view DCOPs and Dec-(PO)MDPs as two ends of a spectrum of offline distributed planning models. In terms of expressiveness, DCOPs can solve single timestep problems while Dec-(PO)MDPs can solve sequential problems. However, DCOPs are only NP-hard while Dec-(PO)MDPs are NEXP-hard. PD-DCOPs attempt to balance the trade off between expressiveness and complexity by searching for open-loop policies instead of closed-loop policies of Dec-(PO)MDPs. They are thus more expressive than DCOPs at the cost of a higher complexity, yet not as expressive as Dec-(PO)MDPs, but also without their prohibitive complexity. C−DPOP Num. Iterations 12 10 8 ● ● ●● ● ●● ●● ● ● ● ● ● LS−RAND ● LS−SDPOP ● ● 6 4 2 0 0 100 200 300 Switching Cost 400 500 10000 Runtime (ms) ● 5000 ● 2500 ● 1000 ● 500 ● 2 ● ● 4 ● ● 6 Horizon 8 10 Figure 1: Experimental Results Varying Switching Cost (top) and Horizon (bottom) GB. Results are averaged over 30 runs. We use the following default configuration: Number of agents and decision variables |A| = |X \ Y| = 12; number of random variables |Y| = 0.25 · |X \ Y|; domain size |Dx | = |Ωy | = 3; horizon h = 3; switching cost c = 50; constraint densities pa1 = pb1 = pc1 = 0.5;3 and constraint tightness p2 = 0.8. We then vary the switching cost c of the problem from 0 to 500. Figure 1(center) shows the number of iterations it takes for the local search algorithms to converge from the initial solution. When c = 0, the initial solution found by LS-SDPOP is an optimal solution since the initial solution already optimizes the utilities of the problem over all time steps ignoring switching costs. Thus, it requires 0 iterations to converge. For sufficiently large costs (c ≥ 100), the optimal solution is one where the values for each agent is the same across all time steps since the cost of changing values is larger than the gain in utility. Thus, the number of iterations they take to converge is the same for all large switching costs. At intermediate cost values (0 < c < 100), they require an increasing number of iterations to converge. Finally, LS-RAND requires more iterations to converge than LS-SDPOP since it starts with poorer initial solutions. We also vary the horizon h of the problem from 2 to 10.4 Figure 1(right) shows the runtimes of all three algorithms. As expected, the runtimes increase when the horizon increases. When the horizon h ≥ 6 is sufficiently large, LSSDPOP is faster than LS-RAND indicating that the overhead of finding good initial solutions with S-DPOP is worth the Experimental Results We empirically evaluate our three PD-DCOP algorithms: Collapsed DPOP (C-DPOP), which collapses the PDDCOP into a DCOP and solves it with DPOP; Local Search (Random) (LS-RAND), which runs the local search algorithm with random initial values; and Local Search (SDPOP) (LS-SDPOP), which runs the algorithm with SDPOP. In contrast to many experiments in the literature, our experiments are performed in an actual distributed system, where each agent is an Intel i7 Quadcore 3.4GHz machine with 16GB of RAM, connected in a local area network. We thus report actual distributed runtimes. We impose a timeout of 30 minutes and a memory limit of 16 3 a p1 is the density of functions between two decision variables, is the density of functions between a decision variable and a random variable, and pc1 is the fraction of decision variables that are constrained with random variables. 4 In this experiment, we set the number of decision variables to 6 in order for the algorithms to scale to larger horizons. pb1 238 |A| 2 4 6 8 12 16 C-DPOP ρ time (ms) 223 1.001 489 1.000 5547 1.000 — — — LS-SDPOP time (ms) (207.7) 197.5 (307.3) 255.7 (456.3) 382.3 (838.1) 739.2 4821.6 (7091.1) 264897 (595245) ρ 1.003 1.009 1.011 1.001 1.003 1.033 LS-RAND ρ time (ms) 203.7 1.019 273.4 1.037 385.9 1.045 556.0 1.034 1092.9 1.031 2203.0 1.015 Acknowledgments The team from NMSU is partially supported by NSF grants 1345232 and 1540168. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the sponsoring organizations, agencies, or the U.S. government. Makoto Yokoo is partially supported by JSPS KAKENHI Grant Number 24220003. Table 1: Experimental Results Varying Number of Agents References savings in runtime to converge to the final solution. Finally, we vary the number of agents |A| (and thus the number of the decision variables) of the problem from 2 to 16. Table 1 tabulates the runtimes and the approximation ration ρ for all three algorithms. The runtimes of LS-SDPOP without reusing information are shown in parentheses. CDPOP times out after |A| ≥ 8. In general, the runtimes of CDPOP is largest, followed the by the runtimes of LS-SDPOP and the runtimes of LS-RAND. The difference in runtimes increases with increasing number of agents, indicating that the overhead to find good initial solutions with S-DPOP is not worth the savings in convergence runtime. As expected, the approximation ratio ρ with C-DPOP the is the smallest, since it finds optimal solutions, whilst the ratios of the local search algorithms are of similar values, indicating that they converge to solutions with similar qualities. Therefore, LS-SDPOP is preferred in problems with few agents but large horizons and LS-RAND is preferred in problems with many agents but small horizons. Becker, R.; Zilberstein, S.; Lesser, V.; and Goldman, C. 2004. Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research 22:423– 455. Bernstein, D.; Givan, R.; Immerman, N.; and Zilberstein, S. 2002. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27(4):819–840. Dibangoye, J. S.; Amato, C.; Doniec, A.; and Charpillet, F. 2013. Producing efficient error-bounded solutions for transition independent decentralized mdps. In Proceedings of AAMAS, 539–546. Fargier, H.; Lang, J.; and Schiex, T. 1996. Mixed constraint satisfaction: A framework for decision problems under incomplete knowledge. In Proceedings of AAAI, 175–180. Farinelli, A.; Rogers, A.; Petcu, A.; and Jennings, N. 2008. Decentralised coordination of low-power embedded devices using the Max-Sum algorithm. In Proceedings of AAMAS, 639–646. Hamadi, Y.; Bessière, C.; and Quinqueton, J. 1998. Distributed intelligent backtracking. In Proceedings of ECAI, 219–223. Hansen, E. A.; Bernstein, D. S.; and Zilberstein, S. 2004. Dynamic programming for partially observable stochastic games. In Proceedings of AAAI, 709–715. Holland, A., and O’Sullivan, B. 2005. Weighted super solutions for constraint programs. In Proceedings of AAAI, 378–383. Lass, R.; Sultanik, E.; and Regli, W. 2008. Dynamic distributed constraint reasoning. In Proceedings of AAAI, 1466–1469. Maheswaran, R.; Tambe, M.; Bowring, E.; Pearce, J.; and Varakantham, P. 2004. Taking DCOP to the real world: Efficient complete solutions for distributed event scheduling. In Proceedings of AAMAS, 310–317. Maheswaran, R. T.; Pearce, J. P.; and Tambe, M. 2006. A family of graphical-game-based algorithms for distributed constraint optimization problems. In Coordination of Large-Scale Multiagent Systems. Springer. 127–146. Modi, P.; Shen, W.-M.; Tambe, M.; and Yokoo, M. 2005. ADOPT: Asynchronous distributed constraint optimization with quality guarantees. Artificial Intelligence 161(1–2):149– 180. Nguyen, D. T.; Yeoh, W.; Lau, H. C.; Zilberstein, S.; and Zhang, C. 2014. Decentralized multi-agent reinforcement learning in average-reward dynamic DCOPs. In Proceedings of AAAI, 1447–1455. Oliehoek, F.; Spaan, M.; Amato, C.; and Whiteson, S. 2013. Incremental clustering and expansion for faster optimal planning in Dec-POMDPs. Journal of Artificial Intelligence Research 46:449–509. Conclusions Within real-world multi-agent applications, agents often act in dynamic environments. Thus, the Dynamic DCOP formulation is attractive to model such problems. Unfortunately, current research has focused at solving such problems reactively, thus discarding the information on the environment evolution (i.e., possible future changes), which is often available in many applications. To cope with this limitation, we (i) introduce a Proactive Dynamic DCOP (PD-DCOP) model, which models the dynamism in Dynamic DCOPs; and (ii) develop an exact PD-DCOP algorithm that solves the problem proactively as well as an approximate algorithm with quality guarantees that can scale to larger and more complex problems. Finally, in contrast to many experiments in the literature, we evaluate our algorithms on an actual distributed system. In the future, we will perform an extensive evaluation to compare the proposed PD-DCOP model against a reactive DCOP approach, and to evaluate our algorithm within the D-DEVCSP domain. Finally, we will enrich the current DDEVCSP model considering factors such as changes in the price of electricity at different times (based on the distribution of energy consumption in the area), or limitation on total in house energy consumption at different times. These factors can be elegantly modeled as random variables in our PD-DCOP model. 239 Petcu, A., and Faltings, B. 2005a. A scalable method for multiagent constraint optimization. In Proceedings of IJCAI, 1413– 1420. Petcu, A., and Faltings, B. 2005b. Superstabilizing, faultcontaining multiagent combinatorial optimization. In Proceedings of AAAI, 449–454. Petcu, A., and Faltings, B. 2007. Optimal solution stability in dynamic, distributed constraint optimization. In Proceedings of IAT, 321–327. Szer, D.; Charpillet, F.; and Zilberstein, S. 2005. MAA*: A heuristic search algorithm for solving decentralized POMDPs. In Proceedings of UAI, 576–590. Tarim, S. A.; Manandhar, S.; and Walsh, T. 2006. Stochastic constraint programming: A scenario-based approach. Constraints 11(1):53–80. Ueda, S.; Iwasaki, A.; and Yokoo, M. 2010. Coalition structure generation based on distributed constraint optimization. In Proceedings of AAAI, 197–203. Wallace, R., and Freuder, E. 1998. Stable solutions for dynamic constraint satisfaction problems. In Proceedings of CP, 447– 461. Walsh, T. 2002. Stochastic constraint programming. In Proceedings of ECAI, 111–115. Yeoh, W., and Yokoo, M. 2012. Distributed problem solving. AI Magazine 33(3):53–65. Zivan, R.; Yedidsion, H.; Okamoto, S.; Glinton, R.; and Sycara, K. 2015. Distributed constraint optimization for teams of mobile sensing agents. Autonomous Agents and Multi-Agent Systems 29(3):495–536. 240

Proactive Dynamic DCOPs Khoi Hoang, Ferdinando Fioretto, Ping Hou,

Related documents

Products

Support

Proactive Dynamic DCOPs Khoi Hoang, Ferdinando Fioretto, Ping Hou,

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib