Allocation Algorithms for Real-Time Systems as Applied to Battle Management by Kin-Joe Sham Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science and Electrical Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY © May 2002 Kin-Joe Sham, MMII. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part. OF TECHNOLOGY JUL 31 20j A uthor ... .. ... .... . .. .. ..... LIBRARIES .. ..... . .. Department of Elec~ical Engineering and Computer Science %\ May 24, 2002 Certified by ....... ..... ........ Dr. Leslie P. Kaelbling Professor of Computer Science and Engineering, MIT Thesis Supervisor Certified by... Accepted by .. ........... Dr. Michael E. Cleary The Charles Stark Draper Laboratory, Inc. Teclii6alj$upervisor ...... Dr. Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Allocation Algorithms for Real-Time Systems as Applied to Battle Management by Kin-Joe Sham Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2002, in partial fulfillment of the requirements for the degree of Master of Engineering in Computer Science and Electrical Engineering Abstract The ability to distribute the proper number of weapons and planes depending on the available information and resources is crucial in a successful air strike campaign. Current algorithms for this problem domain such as Yost's hybrid approach and the Markov task decomposition (MTD) approach model the given problem using Markov decision processes, but only in minimal detail for computational feasibility. This thesis extends the MTD approach to more closely match realistic situations. The new technique introduces different weapon types and incorporates constraints on the number of planes available and the number of weapons each plane can carry. Although it is impossible to prove that solutions produced by the modified MTD approach are close to optimal for large-scale problems, tests have shown that the modified MTD approach generates solutions with higher expected utility compared to other heuristics. Experiments were also conducted to determine the effect of varying each element in an allocation problem on the algorithm's overall computation time. The results demonstrated that only the off-line phase significantly increases its running time due to the extension. Thus, real-time distribution analysis is possible since the on-line phase requires a negligible amount of time to execute. Thesis Supervisor: Dr. Leslie P. Kaelbling Title: Professor of Computer Science and Engineering, MIT Technical Supervisor: Dr. Michael E. Cleary Title: Principal Member of the Technical Staff, The Charles Stark Draper Laboratory, Inc. 3 4 Acknowledgements I would like to thank my Draper Laboratory supervisor, Dr. Michael E. Cleary, for all his help and technical advices, which assisted me in completing my thesis. In addition, I am grateful to him for accepting me into the Draper Fellowship program and providing me an interesting topic to research on. I would also like to thank my thesis advisor, Professor Leslie P. Kaelbling, for the constant feedback and technical support that she offered. Without her expertise in the stochastic planning problem domain, my research would not have went as smoothly as it did. 5 Acknowledgements [continued] This thesis was prepared at The Charles Stark Draper Laboratory, Inc., under the Internal Research and Development Program (Account #18556). Publication of this thesis does not constitute approval by Draper Laboratory of the findings or conclusions contained herein. It is published for the exchange and stimulation of ideas. (Author's signature) 6 (date) Assignment Draper Laboratory Report Number T-1428. In consideration for the research opportunity and permission to prepare my thesis by and at The Charles Stark Draper Laboratory, Inc., I hereby assign my copyright of the thesis to The Charles Stark Draper Laboratory., Inc., Cambridge., Massachusetts, 512±J~ ------- -(date) (author's signature) 7 S Contents 15 1 Introduction . . . . . . . . . . . . . . . . . . . 16 1.1 Battle Management Scenario... 1.2 General Statement of The Problem . . . . . . . . . . . . . . . . . . . 16 1.3 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . 17 . . 1.3.1 Linear Programming Approach 1.3.2 Partially Observable Markov Decision Process and Linear Programming Hybrid Approach . . . . . . . . . . . . . . . . . . . 1.3.3 Markov Task Decomposition Approach with On-Line and OffLine P hases . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Experim ental Design . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Thesis Contributions . .. 1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 21 22 22 . . . . . . . . . . . . . . . . . . . . 25 2.1.1 The M DP M odel . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.2 Value Iteration Solution Method . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . 30 . . . . . . . . . . . . . . . . . . . 31 Markov Decision Process (MDP) Markov Task Decomposition (MTD) 2.2.1 Off-line Value Table Calculations 2.2.2 Online Value Maximization 3 Markov Task Decomposition with Only a Global Constraint 3.1 20 25 2 Technical Background 2.1 18 M odel Formulation..... . . . . . . . . . . . . . . 9 . . . . . . . . 33 33 3.2 Modification to MTD Approach . . . . 34 3.3 Architecture Overview . . . . . . . . . 34 3.4 Implementation Details . . . . . . . . . 36 3.4.1 Damage State Model . . . . . . 36 3.4.2 Battle Management World . . . 37 3.4.3 Off-Line Value Function Tables 39 3.4.4 On-Line Policy Mapper . . . . . 39 Replication of Meuleau's Results . . . . 40 3.5 4 Markovian Task Decomposition with Multiple Constraints 4.1 New Additions to the Modified MTD Approach 43 4.2 Architecture Improvements . . . . . . . . . . . . 44 4.3 Implementation Details . . . . . . . . . . . . . . 45 4.3.1 Weapon Types 45 4.3.2 Available Planes and Weapons per Plane Constraints . . . . . . . . . . . . . . 5 Experimental Results 6 43 48 51 5.1 Experimental Approach . . . . . . . . . . . 51 5.2 Modified MTD Algorithm vs Other Heuristic 53 5.3 Effect of Model Changes . . . . . . . . . . . 54 5.3.1 Additional Targets . . . . . . . . . . 54 5.3.2 Additional Target Types . . . . . . . 57 5.3.3 Multiple Damage States . . . . . . . 58 5.3.4 Additional Weapons . . . . . . . . . 59 5.3.5 Additional Weapon Types . . . . . . 60 5.3.6 Additional Planes . . . . . . . . . . . 61 Conclusions and Future Research Areas 6.1 Conclusions 6.2 Potential Applications 6.3 . . . . . . . . . . . . . . . . 63 . . . . . . . . . . . . . . . 63 . . . . . . . . . . . . . . . . . . . . . . . . . 64 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 10 List of Figures 1-1 Yost's Decomposition Algorithm [9] . . . . . . . . . . . . . . . . . . . 19 3-1 Markov Task Decomposition Approach . . . . . . . . . . . . . . . . . 35 3-2 An instance of optimal policy for single-target problem using Meuleau's approach [8](left bars), or using the MTD approach (right bars) 5-1 . . . 41 Comparison of the quality of policies generated by Modified MTD and Greedy Strategy for a 100-target problem . . . . . . . . . . . . . . . . 53 5-2 Modified MTD's running time with a varying number of targets. . 56 5-3 Modified MTD's running tirne with a varying number of target types. 57 5-4 Modified MTD's running time with a varying number of damage states. 58 5-5 Modified MTD's running time with a varying number of weapons. . . 59 5-6 Modified MTD's running time with a varying number of weapon types. 61 5-7 Modified MTD's running time with a varying number of planes. 11 . . . . . 62 12 List of Tables 5.1 A list of variables and their corresponding values for each experiment. 5.2 A summary of the resulting trends seen in the computation time by varying the given variable . . . . . . . . . . . 13 . . . . . . . ..... 52 55 14 Chapter 1 Introduction Finding efficient ways to distribute limited resources is an problem that can be found in many different domains. Although techniques such as linear programming have already been used to tackle general deterministic allocation problems, they have not been successfully applied to allocation problems where allotting resources to objects can change the state of the object based on its stochastic model. This type of problem is illustrated in a range of areas, from distributing doctors within a hospital to running a successful business corporation. One of the places where this is a major problem, however, is in the military. The ability to take appropriate actions depending on the available information and resources is crucial to the success of military campaigns. Typically, combat mission design will attempt to make good use of resources such as the limited number of fighting units and weapons. However, with so many different variables involved in a real battle, it is difficult to allocate the appropriate amount of resources and determine the best sequence of actions to maximize the enemy's damage while incurring the lowest associated costs. The work described herein is undertaken with the goal of making the process of allocating resources more effective by using miore sophisticated mathematical models of battle environments. Following a statement of the problem in section 1.2, several allocation methods are described in detail in section 1.3. 15 1.1 Battle Management Scenario The particular resource allocation problem in a combat scenario that is being examined can be described as follows. There are limited numbers of available weapons and strike aircraft with associated costs, which need to be assigned to attack a set of enemy targets within a finite time horizon that varies per target. Each target has two observable states, dead or alive. Furthermore, there is an a priori reward for a target being in each of the two states. At any given time interval t, a certain number of weapons and strike aircraft can be allocated to a target. The target's state in the next time interval t + 1 is determined by a probability function based only on the target's current state and the allocation given in time t. In this model, each target is independent of the other targets, thus changing the state of a target will not affect the outcome of other allocations. 1.2 General Statement of The Problem The domain of the allocation problem described above has several important traits. First, there are fixed amounts of various resources (e.g., weapons and aircraft) and a finite set of objects (e.g., targets), where each object has a known number of states. A limited set of actions, each of which consumes probabilistic amounts of available resources, causes objects to change states. There is an a priori reward for having an object in each of its possible states. The total value of the solution at a given time step can be calculated by summing the expected rewards for each object. Time is divided into discrete intervals over a finite horizon and actions can be applied to each object in each time interval. Since every action uses resources, the actions performed on each object must be constrained so that in total they consume less than the total available resources at every time interval and over the entire time horizon. A global constraint is defined as a restriction that is enforced over the entire time horizon (e.g., the limited amount of resources). An instantaneous constraint is another type of constraint that can also be incorporated into the problem. The constraint must be 16 independently satisfied at each time interval (e.g.., the limited number of planes and the holding capacity of each plane). This problem domain is modelled with an Markov decision process (MDP) where actions change the states of objects probabilistically[7]. It is assumed that the objects are all independent of each other, implying that the Markov models are all independent. However, the overall problem cannot be solved without merging together the solution to each independent MDP because the objects are still competing for the same pool of resources. For further discussion of Markov models, see section 2.1. 1.3 Approaches A feasible solution to the resource allocation problem can be obtained using known heuristics [9] that determine reasonable weapon distributions. However, such methods do not give information on the effectiveness of the allocation. Using a feasible solution derived from a heuristic could be sufficient in certain applications, but if a better solution is used. additional resources could be saved for future missions. Currently., there are several different methods being researched to solve the problem in the battle management scenario. This section will describe three of the techniques being explored and demonstrate how each technique focuses only on part of the entire problem. The three are the linear programming (LP) approach, the off-line POMDP and LP hybrid approach created by Yost, and the Markov task decomposition approach established by Meuleau. ct al. 1.3.1 Linear Programming Approach Currently, the military applies methods such as linear programming (LP) [5] to allocate resources in weapon/plane allotment problems similar to the problem within the battle management scenario. Linear programming is a commonly used approach because it maximizes or minimizes linear functions with many variables to obtain the desired optimal results within a finite set of constraints. However, these deterministic techniques do not capture the probabilistic elements in the problem, since a determin17 istic resource allocation does not take into account the damage state of the enemy targets and the probability that the weapons applied to the targets could miss or fail to damage them sufficiently. To incorporate these probabilistic elements and the sequential decision making into the model causes the problem to become immediately intractable for even simple problems [9]. For example, let's assume that each target can only be dead and alive and a global state of the model is a combination of each target's status, then the total number of global states |SI is |S| = 2 1TI where ITI is the number of targets. Thus, with each target added to the model, the problem grows exponentially because the number of states ISI increases by a factor of two for each new target. This is assuming that total probabilities must be computed for all the possible combinations of alternatives to find the optimal solution to any probabilistic problem. Hence, even a reasonably small problem with a probabilistic model can take significant computation to solve using only linear programming [9]. 1.3.2 Partially Observable Markov Decision Process and Linear Programming Hybrid Approach Research has been done on ways to find the optimal solution for problems that require allocating resources among activities that either gather information on different objects or take actions in an attempt to change the state of these objects [6]. The model described here can be directly applied to the battle management scenario by viewing the problem as a weapon and sensor allocation problem where sensors can observe targets and weapons can cause damage to the targets. As mentioned in the previous section, using linear programming to find the optimal answer will make even a small-scale problem intractable since every probabilistic combination must be examined. To further complicate the problem, it may be impossible to get complete information about the state of the objects. Another way to say this is that the state of the objects is only partially observable [7]. Such a problem can be modelled with one 18 large partially observable Markov decision process (POMDP) [2]. However, a POMDP is significantly more complicated than a MDP and becomes intractable much more quickly than the linear programming approach. In order to overcome this challenge, small independent POMDPs can be used to model the behavior of individual targets. Yost constructed a decomposition procedure shown in Figure 1-1 that combines linear programming (LP) and POMDPs to determine the resource allocation strategy, in addition to finding the optimal policy for each target [9]. A policy in this case is a sequence of sensor action and weapon usage. Ini tial Policies Current object values, resource marginal costs POMDP MASTER LP available resources object constraints optimal policy for current costs Improving policies Quit when no improvi ng policies are found. Figure 1-1: Yost's Decomposition Algorithm [9] The approach uses linear programming to compute the marginal costs of resources, which is the cost for using one additional resource. Then each POMDP takes these costs to determine an optimal policy for allocating resources to the targets of its type. The resulting policies are sent back to the master LP which allocates these policies to targets and recalculates the marginal costs. The iterative loop is terminated once the overall optimal policy is discovered [9]. Although an exact solution can be found, the entire calculation process is extremely time-consuming for a large-scale problem, so it is computed off-line. There19 fore, this method cannot be used for real-time analysis or decisions. 1.3.3 Markov Task Decomposition Approach with On-Line and Off-Line Phases To decrease the computation time for solving large-scale problems, Meuleau, et al., [8] created the Markov Task Decomposition approach to separate the solution process into two phases, allowing the more time-consuming calculations to occur separately from the actual algorithm for allocating weapons. The result is a relatively fast technique that can find near-optimal solutions to distribute weapons for maximizing the targets' damage. Due to the stochastic nature of the objects' states, Markov decision processes (MDPs) are used to model individual targets in this approach. However, unlike the hybrid approach, the technique assumes that all information about the targets is completely observable, thus MDPs are used instead of POMDPs. A MDP is created to model a system and is defined by a set of states and a set of probabilistic rules that governs the object's state transitions. The model is applied to problems that describe a series of dependent trials such as a sequence of events and decisions made based on the current status at each time step [3]. The MDP model and solution process is explained in more detail in section 2.1. If a large-scale problem is modelled with one MDP, then it would be intractable due to the amount of computation needed in the traditional dynamic programming technique [1] to solve the problem. Like the hybrid approach in section 1.3.2, Meuleau's method is to decompose the large MDP into a set of smaller MIDPs. However. instead of solving the set of smaller MDPs and constructing a global solution using LP all in a single computation process, Meuleau first calculates a set of value function tables off-line that correspond to each sub-MDP. These offline values represent the expected utility or reward for every action taken in every state that the target could be in at each time interval under various assumptions about how many weapons will be allocated to the target in the future. Using the value function tables, an on-line allocation algorithm will optimize the distribution of weapons among all the targets 20 in order to maximize the expected reward [8]. One advantage of this method is that it is able to analyze real-time situations, since the deterministic allocation algorithm that distributes weapons can be solved in a relatively short amount of time. This is because a significant portion of the calculation is from the value function formulation of the stochastic processes, which are done off-line. The real-time analysis uses the value tables created off-line to combine the results of all the sub-MDPs in a deterministic manner. Furthermore, once the value function tables have been created for the different weapon/target combinations, they can be reused if a similar situation occurs again. Despite its tractability, Meuleau's model of the battle management scenario is very simple. It did not take into account more realistic constraints such as the fact that there is a limited number of available planes and that each plane can carry only a certain number of weapons. It also did not model more than one type of weapon. Therefore, future work can extend this promising weapon allocation technique to model a more practical scenario which is what this thesis is focusing on. 1.4 Experimental Design To examine the tractability of the modified approach discussed in this thesis, experiments are conducted to test the effect of changing different aspects of the model. The following variables are increased incrementally in order to observe the change in computation time: " Targets " Target Types * Damage States " Weapons " Weapon Types * Planes 21 The solution time for the off-line phase will be compared to the time spent on the on-line phase in order to determine which phase needs to be improved upon in future work. 1.5 Thesis Contributions The main goal of this thesis is to extend Meuleau's approach to more closely match real-life situations. The thesis makes the following contributions in the process of achieving the main objective: * Introduction of different weapon types, which makes the model more realistic. " New constraints, such as a limit on the number of bombs per plane and a cap on the number of planes available, are incorporated. " Development of a new greedy algorithm that incorporates the above constraints. 1.6 Thesis Organization The remainder of this thesis is organized as follows: Chapter 2 provides the technical background information including a detailed description of a Markov decision process (MDP) and an explanation of Meuleau's allocation algorithm. Chapter 3 presents the infrastructure of the simulator and describes the implementation of the entire system. Chapter 4 examines the online and off-line components of the system including the various new constraints that were implemented to make the model mnore realistic. The allocation algorithm used to solve the different constraints are also explained in detail. Chapter 5 describes the different scenarios used for testing the effect of varying different components of the model and the results of the simulations. 22 Chapter 6 presents conclusions drawn from this research., discusses potential applications of the techniques developed in previous chapters, and presents ideas for future work that could extend the work reported here. 23 24 Chapter 2 Technical Background Since this thesis describes several extensions that were made to Meuleau's approach [8]. it is important to have a clear understanding of his original allocation algorithm. Section 2.1 describes the Markov decision process (MDP), which is the model used for the weapon/plane allocation problem. In addition, the section also explains the value iteration method used to solve such a process. Section 2.2 discusses Meuleau's method, which divives the off-line value iteration method used for computing the value tables using dynamic programming and the on-line value maximization technique for allocating resources. Both phases are important, since the extensions in this thesis require modifications of the value table formulation and the value maximization algorithm. 2.1 Markov Decision Process (MDP) Choices are made based on information that has uncertainty associated with it. Assuming that events exhibit some degree of regularity, their uncertainty can be described by a probability model. The Markov decision process (MDP) is a way of characterizing a series of dependent probabilistic trials that has an unique property in the way future events are dependent on past events, where the state of each object in the next time interval is only dependent on its current state and the current action [3]. A solution to a MDP is called a policy, which indicates the action that 25 needs to be performed depending on the current status of the MDP and past events that occurred [7]. The MDP model will be described in more detail in section 2.1.1. Although MDP can model a stochastic problem, a method is needed to solve it. The value iteration method is a common technique used for solving MDPs and will be explained in section 2.1.2. 2.1.1 The MDP Model Consider an object that may be described at any time as being in one of a set of N mutually exclusive states. For the state of an object to change, an action must be performed within a finite set of actions that is allowed to take place at the given state. When an action does take place, the object may undergo a state transition according to a set of probabilistic rules given by the transition matrix. In this formulation, he transition matrix is static, which means that the probabilistic rules are independent of time. A distinctive feature of any Markov process is that the transition probabilities for a series of dependent state transitions satisfy the Markov condition. This condition states that the conditional probability of the object being in state s at time t is only dependent on the state of the object at time step t,_ 1 [3]. In another words. the present state of the object encompasses all the historical information relevant to the future behavior of a MDP. Thus, the transition probability for state s to change to state s' due to an action a can be represented by Pr(s'la,s). Formally, an MDP is defined by a 4-tuple, < S, A, T, R > where S is the set of discrete object states {si, 8 2 {at, a 2 . ..., am}. ... , sn} and A is the set of available actions Note that the set of available actions is the collection of existing actions at the current state. A collection of probabilistic rules also known as the transition matrix is denoted as: T = { Pr(s'la,s) : acA, scS, s'cS}, (2.1) which specifies the likelihood that the state of the object will change from s to s' due to action a. Finally, the set 26 (2.2) R = {rs,.f : 8, s'ES} represents the reward space of the process. In particular, r , , is the reward acquired when the object is in state s and transition to state s'. Note that the reward could have a negative value which would represent a positive cost. 2.1.2 Value Iteration Solution Method An MDP, as specified in section 2.1.1, provides a mathematical model of a dynamic system, but not methods to determine the best course of actions for the system to take. A solution to such a model requires finding a policy, which is a rule for taking actions that maximizes the rewards received or minimizes the cost sustained. More formally defined, a policy for an MDP is a mapping from the current state to a set of actions for all time periods. Since the model only defines rewards for single transitions, but the aim is to maximize the expected reward attained at the end of the global finite time horizon H, a method for finding the optimal value of a state over time needs to be defined. The optimal value of a state, V(s, 1), is the expected sum of rewards rt at time t that will be gained over the time horizon if the transitions follow the optimal policy 7r, H V(s, 0):= max E Y: rt .(2.3) Since 7 is unknown, V(s, t) is determined by solving the simultaneous equations V(s, t) = max K [T(s, a, s')(V(s', t + 1) + R(s, '))] - C(a) , VsES (2.4) where To is the transition probability represented in equation 2.1. R() is the transition reward described in equation 2.2 and C(a) is the cost of the action performed. This function states that the value of state s at time t is the expected value at the 27 next time t + 1 minus the cost of taking the best available action when in state s, for all possible states [7]. For a given state s and a given time t, there exists an action that maximizes the value function. The optimal policy, therefore, is mapping of states to best available actions that satisfy the optimal value function in equation 2.4: 7r(s, t) = arg max [Z[T(S, a., s')(V(s', t + 1) + R(s, s'))] - C(a)j VsES, (2.5) where a can be any action that is available in state s at time t. The policy 7F is determined for every possible state over the entire time horizon H to formulate a table of actions that corresponds to the policy of the MDP. Equations 2.4 and 2.5 can be solved simultaneously so that when the maximum value at a given state and time is determined, the corresponding action that gave the maximum value is stored as the best available action in the policy. To find the optimal value function for a finite time-horizon H, the simultaneous equations can be solved using dynamic programming through an iterative algorithm called value iteration. The value of being in any state at the final time is assumed to be zero, V(s. H) = 0. VscS, (2.6) because the MDP only models events up to the time-horizon and not events at the time-horizon or beyond. Thus, using equation 2.6, as the base case, the equations can be traversed backwards from H to time zero to calculate the optimal values. For example, since T(s, a, s'), R(s, s') and C(a) are all known, equation 2.4, at time H-i1, can be reduced to, V(s, H - 1) = max 1 [T(s, a, s')R(s, s')] - C(a)j ,VSES where all the values at time H - 1 are computable. This iteration will occur until t = 0, thus forming a table of optimal values for every state at each time period [8]. 28 Equations 2.4 and 2.5 are observed to have a solution time that grows at an order of O(1 A -. S12 ). This illustrates that the computation time grows linearly with respect to the number of actions JAI and polynomially with respect to the number of states ISI, since each additional state introduces another equation in the simultaneous equations. However, when multiple objects are described by a single MDP, where a global state is a combination of each object's state, the computation time of the iteration value method will explode. This is due to the fact that the global state space grows exponentially with respect to the N number of objects such that O(IS N), as described in section 1.3.1. Because realistic descriptions of real-life problems involves many objects, the state space S can have countless number of global states, making computational tractability an issue. Therefore, new ways for solving MDPs have been explored such as the Markov Task Decomposition method. 2.2 Markov Task Decomposition (MTD) Although MDPs are valuable in modeling stochastic planning problems such as weapon/plane allocation to targets, the computational time of the value iteration method grows exponentially when dealing with multiple objects, making large state and action spaces intractable. One technique for solving a large MDP is called the Markov Task Decomposition method [8]. MTD separates a large MDP with concurrent processes into their own individual sub-MDPs and then merges the solutions to approximate a global solution. This dramatically reduces the solution time since each sub-MDP has exponentially smaller state and action spaces that can be solved in a relatively short amount of time compared to the full MDP. The tradeoff for using MTD is that the computed policy is not optimal. However, it has been shown empirically that MTD produces policies for small problems that are close to the optimal results [8]. The problem with MTD is the difficulty in combining the policies from each subMDP effectively. The rewards and transition probabilities of each concurrent process are assumed to be independent. This assumes that an assignment of resources to one task does not affect the expected utility of another task. However, this is sometimes 29 not the case since global or instantaneous constraints can tie the sub-MDPs together, making the algorithm to combine the local policies a non-trivial task. The MTD approach sacrifices optimality for faster computational time. MTD is performed in two phases. An off-line phase (§2.2.1) solves the sub-MDPs associated with individual tasks by calculating the respective optimal value functions and the associated optimal policy using the value iteration method. An online phase (§2.2.2) uses the optimal policies determined in the off-line phase to maximize the overall value based on the current state. The resources that maximized the overall value are then assigned to each task, which would perform the best available action according to its local policy [8]. 2.2.1 Off-line Value Table Calculations The value of a state can vary, depending on the action that is performed at that state. To find the optimal value of a state, an action must be selected that would maximize the value at a given time with a given allocation by comparing values derived from each action. When an action a is taken at time t, a reward r is received for transitioning from the current state 5 to another state s'. In addition to the reward, the value V(s', t + 1. m - a) must also be considered, for being in s' at time t + 1 with the remaining allocation after the action a is executed. The expected value for going to the resulting state s' is the sum of the transition probability from s to s' due to action a multiplied by the reward plus the expected value of s'. Since taking an action will typically incur a cost, the final value is the expected value for the resulting state minus the cost of the action, as shown in equation 2.4. Now consider a MDP that is separated into N individual tasks, each having its own sub-MDP. For task i where si is the current state of the task and rij is the amount of resources allocated to the task, the Bellman equation, for the optimal value function is: Vi(si, t, mi) = imax [z - [T (si, a, s')(V (s', t + 1, riu - a) + R(si, s')) .smsS 30 - c -a (2.7) where action a must be less than or equal to the allocation mi given to the target. This equation for a single task is derived from the more general value function, equation 2.4. Since the process has a finite time-horizon where 0 < t < H, all the values at the time-horizon H are assumed to be zero as explained in section 2.1.2. Using V (sj, H,rmn) = 0, VsjES, 0 < m < M, 0 < i < N (2.8) as the base case, one can use dynamic programming [5] to compute a sub-table of expected cumulative rewards for each task. The resulting ISI x H x AJ x N table, V, will consist of all the optimal values of each task for being in each state at each time when allocated 0 < m < MI resources. Besides calculating the off-line value table, the off-line phase also creates a table of actions. Each task has its own optimal policy that is determined when computing the value function. The best action at a given state, time and allocation is the action that maximizes the value. When a maximum value for a set of variables is calculated, the corresponding action is extracted. Derived from equation 2.5, this can be formally stated as ai = arg max [Ti(si.,a, s)(V(s', t + 1, mi - a) + J?(si., s))] - c - a ,VscS. (2.9) Thus, if one knows the current state of task i, and the resource mni allocated to it at time t, then it is possible to find the best available action ai associated to the optimal value, V(si, t, mi). Notice that the off-line phase only takes into account one resource type. In Chapter 4, multiple resource types will be considered. 2.2.2 Online Value Maximization Using the table V constructed in the off-line phase, the online phase allocates the resources to each task accordingly., in order to maximize the total value of the process. From the declared set of allocations, and the optimal values of each task, a set of 31 actions A =< al, a2, ... , ai, ... > is extracted from the action table created in the off- line phase. There are many ways to perform this value maximization, however most algorithms are either sub-optimal or not computational feasible. In this thesis, a greedy algorithm adopted from Meuleau's work is used for the online value maximization. Even though the algorithm does not promise an optimal solution, Meuleau gave empirical evidence that the solution is better than the solutions produced by other known heuristics [8]. The greedy algorithm is adapted for a weapon-to-target allocation problem where a constraint on the number of weapons is present. Given the current state s of all targets, the number of weapons remaining M, and the time t, the objective is to choose mi, the number of weapons assigned to each target i with state si, so that the sum of 1K(si, mi, t) is maximized and E mi < M. To solve the value maximization problem, the marginal utility AK of assigning an additional weapon to target i given that mi weapons have already been allocated to it is defined as A V(si mi, t) = V(si mi + 1, t) - (5i, mi, t). (2.10) Weapons are assigned one by one to the target that has the highest A / for the current allocation of mi. Once a weapon is allotted to target i, an updated AV is computed for the next marginal utility with a new allocation of mi + 1. The method terminates when all M weapons have been distributed or AV(si, mi, t) < 0 for all i. This is a reasonable assumption because the marginal utility function for a given target is monotonically decreasing such that there is never a situation where adding an additional weapon will decrease the expected utility but adding six weapons will suddenly increase the utility. Because the process above is a gradient ascent on E Vi, it could be trapped in a local maximum, thus resulting in a sub-optimal allocation. Again, notice the greedy algorithm in this case is only adapted for one weapon type and no plane types or plane (instantaneous) constraints. The algorithm will be modified in Chapter 4 to handle these constraints. 32 Chapter 3 Markov Task Decomposition with Only a Global Constraint The Markov Task Decomposition approach established by Meuleau, et al, solves weapon allocation problems with only a single global constraint. Section 3.1 de- scribes a representation of the battle management scenario that can be solved using the MTD approach. In order to apply this technique to air strike campaigns with different types of targets., changes were made to the original approach while maintaining the single global constraint criteria. These adjustments are identified in section 3.2 and the system architecture of the modified MTD algorithm are described in 3.3. Furthermore, section 3.4 explains the various modifications made in the simulator implementation to accommodate for the changes in the MTD approach. The per- formance of the revised method is determined by comparing it to the original MTD technique's since the two approaches should arrive at the same solution when weapon allocation problem only has one target type. The result of this experiment will be discussed in section 3.5. 3.1 Model Formulation The battle management scenario creates an environment for solving military campaign planning problems. Each target in the scenario is represented by an individual task 33 modelled with an independent sub-MDP. The global constraint in the problem is the total number of weapons available for the campaign. To solve the stochastic planning system, the online phase merges the policies of each sub-MDP sub-optimally without violating the global constraint. The overall policy determines the actions taken on each target, which have inherently probabilistic outcomes. The result is completely observable, meaning that the information received after an action is taken is accurate. The model formulation here is very flexible, thus additional amendments such as plane and instantaneous weapon constraints described in Chapter 4 can be made with little difficulty. 3.2 Modification to MTD Approach Adjustments were made from Meuleau's approach to create a more realistic model of the battle management world under a global resource constraint. In the modi- fled model, there are different target types, where each type has its own transition probability matrix, reward for being destroyed, and number of damage states. These features will be discussed in more detail in section 3.4. Furthermore, multiple targets in the environment can be represented by the same target type. The battle management scenario allows targets to have windows of vulnerability within the time-horizon. The goal of the system created here is to produce a sub-optimal policy for distributing a limited number of weapons for attacking targets that appear within various time intervals within the finite time-horizon. 3.3 Architecture Overview The system constructed with the changes stated in section 3.2 can be divided into three distinctive sections: the online phase, the value function tables, and the simulator. The value function tables for each target type are computed off-line using the value iteration method during the initialization of the system's environment. This procedure requires the transition matrix, reward for being destroyed, and the set 34 of damage states that describes the sub-MDP of each target type. Although the on-line value maximization looks up the value tables at every time step, the value computations are only carried out once during the off-line phase. Thus, the off-line section saves a significant amount of running time by not having to calculate a value in real-time whenever the value is needed by the value maximization method. After the table calculations are completed, the system begins its on-line simulation with the world simulator. The simulator presents the current states of all the targets, the current time step, and the remaining resources to the on-line value maximization algorithm. The algorithm will then maximize the overall value greedily to establish a good resource allocation. The results are passed into the policy mapper, which decides the action that needs to be taken on each target, in terms of the amount of weapon allocated to it. The actions that are taken on the active targets are passed back to the world simulator. This allows the simulator to probabilistically determine the resulting state of each active target after performing the assigned action. The loop is repeated until the time step reaches the final time horizon. Figure 3-1 shows a flow diagram that illustrates the operation of the system. On-hne Phase Off-line Phase c Value Table Calculatton USU19 Value rteraion twe=4 State 4 1 Targets Id Remaa Resources Value Maxnuaton May- V (i n) 2 21MWorld Im Allocabons Or N Targes Simulator (N Targets) Poicy Mapper Acbon Table Actions to Peform on 9 Targets Figure 3-1: Markov Task Decomposition Approach 35 3.4 Implementation Details The implementation of the system described here requires several unique components. The simulator, called the battle management world, offers an environment for modelling existing targets. The means of interaction between the model and the real-time decisions made by the MTD approach are provided by the world simulator. For each target, a damage state model is used to determine the status of the target in a probabilistic manner. The most important parts of the system are the off-line value tables and the on-line policy mapper. These components are necessary for solving the largescale stochastic planning problem. The sections below will describe each component in detail. 3.4.1 Damage State Model The damage state model is used to describe the severity of the damage done to a target. The status of a target i is established as one of the states in the set of all possible damage states. This set of damage states is modelled mathematically by Si = {u, d1 , ... , dN}, where u is undamaged, and d, to dN are the degree of harm inflicted on the target with d, being the least damage done and dN being the most. If the target is damaged during its window of opportunity, the current state of the target is changed to the state representing the severity of damage, and a certain amount of reward is received, depending on the amount of damage done. When a change of state occurs, it can only go from a lesser damage state to a higher damage state. A single weapon can damage a target i, causing a change in its damage status from state di to state dj, with a probability of P(di -- dj). The transition probability matrix for one weapon is given in the initialization file. From this, a "noisy-or" model is assumed for multiple weapons, in which a single strike is sufficient to inflict damage on the target. More specifically, since individual weapons' strike probabilities are independent, when several weapons hit a target at one time step, they can all potentially cause different levels of damage to the target. It is assumed, however, that the damage status of the target is equivalent to the highest level of damage 36 triggered by one of the multiple weapons landed on the target. For example, if all three weapons cause the damage status to jump from d3 to d5 independently, then the target's condition is equal to d5 . However, if instead, one of the three weapons causes the damage status to jump from d 3 to d4, then the target's current state is equal to d6. The transition probability matrix for multiple weapons can be derived from the single weapon transition probability matrix that is given (luring the initialization of Assuming PI(di -+ dj) is the probability of one weapon causing the the system. damage state to jump from di to dj then, Pa(di -> dj) == - K P (di -- Pa(di a -+ d) d (3.1) k-imj+ This equation states that the sum of all the strike probabilities using a weapons starting from the initial state of di is equal to 1. To find the probability of Pa(di d)., the probabilities of Pa(di -+ di) to Pa(di -> dIsI) d_ 1 ) and Pa(di --4 dj tj) to Pa(di should be subtracted from 1. The expression, E -1 P (di -+ dk)]a, is the combination of all single weapon probabilities that causes the state to change from di to dk for 0 j-1, which equals to the sum of probabilities of Pa(di -+ di) to Pa(di -> d-t). k Subsequently, the expression., El,' aP(di Pa(di --> d,) is the sum of Pa(di -- dj I.1) to - d~s). To solve the equation above, dynamic programming is used. The base case is when r =IS, thus setting z.<csta" Pa(di -- dm) to 0. This leaves Pa(di ->dIsl) = I - 1 PI(di - d )i+ (3.2 k=0 which can be calculated from the information given at initialization. The loop will allow one to compute all the probabilities, Pal for every possible state transition. 3.4.2 Battle Management World The battle management world is initialized at the beginning of a simulation. The world contains a set of distinct target types, a set of targets, the available resources 37 remaining, and the current time step. Targets with the same target type are indistinguishable from each other since they are modelled by the same sub-MDP. However, all the targets exist at different window of availability which is the period of time when they are vulnerable to an attack. Thus, at a given time step, some targets can be attacked but others will not be available for an air strike. Each target type is initialized by computing its damage state model as shown in section 3.4.1. The target type also indicates the amount of reward received for causing a damage state change. During the on-line simulation, the set of targets and the available resources are all updated at each time step. When the on-line phase of the MTD approach determines a set of action to be taken on the active targets, it is passed back into the simulator to update the world. To find out the resulting state for a given target i, the vector of probabilities for transitioning from the current state di after taking an action ai is extracted V =< Pa(di - di), ... , Pa(di -' dmaxgtate) >. Notice that the vector components add up to 1 as stated in the previous section. A random number generator picks a number between 0 and 1. If the number is between E k--I Pa(di -- dk) and Zdi/3 Pa(di - dk) then the resulting state for the given target i is dj. For example, assume that there are 3 states and the target is currently in damage state dj, then the vector of transition probabilities might look like, Vp =< 0. 0.2, 0.8 >. There is a zero probability to go back to the undamaged state. a 0.2 probability for staying in the same state dj, and a 0.8 probability for the target to be completely destroyed d 2 . If the random number generator picks a number between 0 and 0.2, then the resulting state is damage state dj. However, if the generator picks a number between 0.2 and 1.0, then the final state is damage state d 2 (completely destroyed). Besides updating the current state of the targets, the remaining resources also need to be computed. The new amount of resources is wher= o- E a, where M is the previous amount of resources and A' 38 (3.3) is the new amount of available resources after taking the set of actions provided by the policy mapper. 3.4.3 Off-Line Value Function Tables In this planning problem, the computation for value function tables can be noticeably simplified compared to the generic off-line value calculation discussed in section 2.2.1. Since every target of the same target type is indistinguishable from each other, only a single table is computed for each target type and not for each target. The only difference between the targets is their window of opportunity. For a target i whose window of availability is from t to t + k, then the value function for that target corresponds to the value function of its target type from H - k to H, where H is the finite time horizon. Another simplification that can be made is that the value is 0 for every target type whose status is "destroyed". Thus, it is unnecessary to compute those values. From the value tables, it can be shown that V(si., mi, t) increases monotonically with m until it plateaus at m*J. This is the point where the marginal utility of allocating one more weapon is zero and the marginal utility of using an additional weapon is negative. This implies that even when a weapon is allocated at this point, it will never be used because the cost of the weapon outweighs the benefit of using it. Since it is known that the values beyond m* remain constant, the off-line value calculations only need to be evaluated to that point. The values past that allocation should simply be set to the value at m*t,, instead of calculating the values using the value iteration method. This is another way to significantly decrease the computation time of the off-line phase. 3.4.4 On-Line Policy Mapper The policy mapper implements the value maximization algorithm specified in section 2.2.2 and searches to find a set of actions that maximizes the expected utility given a set of allocations. Using a greedy strategy, the on-line phase allocates weapons to each active target in order to maximize E V(si, mi, t). The weapon allocation is then 39 passed to the search command, which looks up the action ai to be taken on each target i using V and mi as indices. Each action ai maximizes V given the current state, the allocation and the time. The array of actions A =< a1 , ... , an > is executed in the simulator which stochastically determines the resulting states S =< si, ... , s, >. Note that it is never optimal to drop all the allocated weapons at once, which will be explained in the next section. 3.5 Replication of Meuleau's Results In order to demonstrate that the MTD approach described in this chapter arrives at the same quality of policy as the original MTD approach, the same problem is given to the two models and the policies of the two approaches are compared. The problem consists of a single target that has a window of opportunity spanning over the entire time horizon where U = 10. In this case, the target can only be in two states, either undamaged or danaged. Furthermore, the probability of hitting the target is pi = 0.25 and the reward received for damaging the target is ri = 90. Only one weapon type is available and the cost of using the weapon is c = 1. With the given scenario, the same number of weapons are sent at time t in both approaches assuming the target is still undamaged. Thus., the result verifies that the MTD approach described here produces the same quality of policy compared as Meuleau's approach. In figure 3-2, the two approaches are shown to deliver an increasing number of weapons at each step as the window of opportunity comes to an end. Since the probability of damaging the target only depends on the number of weapons used at the time and not the order in which the weapons are used, the number weapons sent at each time step is increased since there is less time to damage the target in the future if the weapons sent at the next time step miss again. 40 12 S. 10- 1 MeuIeau's Approach m Modified WD Approach 4 2 z 1 2 6 3 4 5 7 8 9 Tim e Figure 3-2: An instance of optimal policy for single-target problem using Meuleau's approach [8](left bars), or using the MTD approach (right bars) 41 42 Chapter 4 Markovian Task Decomposition with Multiple Constraints In order to construct a more realistic model, several features and constraints are added to the MTD described in Chapter 3. Section 4.1 lists the modifications and explains how each one relates to a limitation seen in realistic situations. Minor changes are also made in the architecture of the simulator so that it can support the additional features. Furthermore, the algorithms used in the on-line and off-line phases of the MTD approach are extended from the previous chapter to include the new constraints. These changes are discussed in section 4.2 and section 4.3, respectively. 4.1 New Additions to the Modified MTD Approach Several new features are introduced to more realistically model air strike campaigns. The ability to have different weapon types is included in the modified MTD approach to model the various weapons that could be used in an air strike. Each weapon type has different damage capabilities depending on the target type it hits. This concept is discussed in further detail in section 4.3.1. Another attribute added is that only a limited number of weapons can be delivered by one plane to attack a single target on a given time step. Specifically., there is a finite supply of planes, where each plane has a defined weapon capacity. Weapons are distributed among the targets by plane 43 loads. Once all the planes are assigned to the different targets at a given time, no extra weapon can be used for a strike at that instance even if it could increase the expected utility. The goal of the system created here is to produce a near-optimal policy for distributing a limited number of weapons carried by a specified number of planes for attacking targets that appear within the finite time-horizon the target's vulnerability window. 4.2 Architecture Improvements The basic components of the system designed here are similar to the architecture described in Chapter 3. However, minor changes in the simulator, the on-line phase and the off-line phase are made in order to adapt the MTD model to the new scenario. Since the new model introduces multiple weapon types, the total number of weapons for weapon type j is stated as Mj. In this model, the simulator still assigns a weapon allocation to each target at every time step. The only difference is that a weapon allocation for target i is now comprised of an array of weapons Wi, Wi =<w,'w 2 , ... ,WK > , w <_ Wj < M1 f or 1 < j < K. (4.1) where K is the number of weapon types, and W1 is the amount of remaining weapons of type j. The set of weapons that is dropped on a target is defined as the action A taken on a target i, which is characterized by Ai =< a, a 2 ,, aK > aj < wj for 1 <j < K, where wj is the number of weapons of type j (4.2) allocated to target i and aj is the number of weapons being dropped. Note that a weapon allocation Wi is the number of weapons assigned to a target over the entire time horizon H, but an action assignment Ai is the number of weapons that is used on a target at a given time step. For example, if there is a total of 5 weapons of type I and 6 weapons of type II, then a possible weapon allocation for a given target i is Wi =< 2, 3 >. When action Ai =< 1, 2 > is taken on the target, then the remaining weapon allocation becomes 44 Vi =< 1, 1 >. At the next time step., another valid action Aj =< 1. 1 > can be taken, which will use up all the weapons allocated to target i. Modification in the off-line phase enhances the value iteration method so that it can calculate a value for each of the new weapon allocation denoted in equation 4.1. This procedure requires a different transition matrix for each target-type/weapontype combination and a separate cost for each weapon type. The computation time and the size of the value tables will, therefore, depend largely on the number of weapon types, This extension will be described in detail in section 4.3.1. The on-line phase is altered by considering the number of planes available. In addition to assigning weapons of different types to each target, planes are now allocated at every time step. The number of weapons used on a target at any given time must not exceed the total capacity of the planes assigned to that target (@4.3.2). These constraints are added in the value maximization algorithm by making a loop of allocating and deallocating weapons until the constraints are satisfied. The policy mapper, which extracts the set of actions performed on the targets, is almost identical to the one designed in Chapter 3. The only difference is that the policy mapper will determine an action array of different weapon types to be used on a target instead of assigning weapons of only one type. 4.3 4.3.1 Implementation Details Weapon Types The limited supply of weapons that is used to attack the targets can be of different types. More specifically, a weapon type is defined by several different attributes such as the total available number of weapons of this kind, and a distinct cost for attacking with it. Furthermore, a weapon type has different effectiveness on various targets. This is described by a transition matrix, Ti = Ti(si A, j), Vs. jcS, Vaj < W' for 1 < j 45 K, (4.3) for each target type i such that there exists a probability for transitioning between state si to state s' when the array of actions Ai is performed on the target. During the on-line simulation, a weapon allocation and an action assignment for a given target contains the mixture of different weapon types that produces the highest expected utility. Thus, unlike the system described in Chapter 3 where allocations are represented solely by a number., weapon assignments are now arrays of weapons that can include many types. Off-Line Changes The value function tables computed off-line are modified to incorporate the idea that a weapon allocation can consist of different weapon types. In this model formulation, the off-line value calculations still use the basic value iteration method as described in section 3.4.3, but the Bellman equation shown in equation 2.7 is extended upon to represent multiple weapon types. Similar to the previous model, every target of the same target type is still identical except for its window of availability within the finite time horizon. Another resemblance to the previous model is that the value is 0 for every target type when it is completely destroyed. The optimal value of state si for target type i can be determined for each array of assigned weapons, Wi. With this in mind, the Bellman equation for computing the optimal value function of target type i is defined as (4.4) (si, t, Wi) max aj wV for I 1 j K SICS K (Ti,(si, A, s)(K(s', t + 1, Wi - A) + Ri(si, s ))) - Z ay (-,j jjK for 0 < t < H, where cj is the cost of using weapon type j and Wi - Ai =< wi - a, 'W 2 - a 2 , ... , WK - aK > Using dynamic programming, the value iteration method solves the Bellman equation recursively for each target type i as described in section 2.1.2. 46 The action assignment Ai that maximizes all the V() is considered the best available action that can be taken for that situation. This is similar to equation 2.9 except that each action is now an array of actions defined in equation 4.2. The sizes of the final action table and value table are dependent on the number of weapon types and the number of weapons of each type. Each additional weapon type adds an extra dimension to the table. Thus. the size is (ISI x LI x HK M1). On-Line Changes The on-line value maximization has to accommodate the multiple weapon types that are introduced in this model. Particularly, the greedy algorithm used to maximize the global value has to take into account that once a weapon type is used up, a weapon of that type cannot be allocated to another target. The marginal utility for using an additional weapon of type j on target i is now defined as W , t) - lKj(si AVi (s 1 Wi t) = Vi, where W =< w, .. w+. WK i, t) (4.5) >. AV is initially calculated for each weapon-type/target combination. Weapons are then assigned one by one to the target with the highest AV for the current set of allocations. When a weapon of type j is allocated to a target i, A ,j(si., Wi, t) is updated to AV(si, W, t). Once all the weapons of the same type are assigned, then the AV values for using another weapon of that type are set to -1 for every target. This is done to establish a negative marginal utility for assigning an extra weapon so that weapons of that type will not be allocated to another target. The greedy algorithm terminates when AVj < 0 for all i and j. Adding the feature of multiple weapon types increases the on-line computation time to O(M 1 . ... - MK) where Mj is the number of weapons available of type j. This is because the greedy algorithm traverses through all the weapons and assigns them to the targets one at a time. 47 4.3.2 Available Planes and Weapons per Plane Constraints For a more realistic model of an air strike, two new constraints are added to the model. Planes are introduced to restrict the number of weapons that can be used on the targets. A plane adds a fixed cost for transporting weapons to a target and is defined by a limit on the weapons it can carry. A limited number of planes are available at each time step for weapon deliveries. Furthermore, a plane can only hold a finite number of weapons per time step. Because the value table formulation is independent of the number of planes, only the on-line phase is modified by adding planes. The on-line value maximization begins by determining the "best" array of weapon allocations, Wi, with the greedy algorithm described in section 4.3.1. An assignment of actions, A1 , is established by searching the table of actions for an assignment corresponding to the value V and the array of weapon allocations XV. A sufficient number of planes., qj, is then assigned to each target to carry all the weapons that will be used on that target. However, if the total number of planes assigned to all the targets exceeds the planes available. Q., then the deallocation-reallocation process is initiated. To satisfy the plane constraint, planes are deallocated greedily until E qi Q. A plane is deallocated if it causes the least change in the overall value at that time step. When a plane is deallocated from a target i, the weapons it holds are reallocated to different targets at the current time or to the same target but forced to be used at a later time in order to satisfy the current plane constraint. The marginal utility of j target i for reallocating weapons of type AVj (si,Wi, t) = Vi, (s i', where W =< in, ... , -. a ,..W'WK > is t + 1) - Ky (8j, Wf, t + 1), W' =< wi, ... ,wj - aj+ 1, .... WK> and aj is the number of weapons of type j carried by one deallocated plane. After recalculating the marginal utility for all the targets and weapon types, the greedy algorithm reassigns the deallocated weapons and distribute the planes accordingly to deliver the weapon allocations to the targets. If the plane constraint is violated again, the 48 process is reiterated until the constraint is satisfied. Note that when a plane is deallocated from a target, the target is added to a list that contains all the targets that will only be assigned weapons for later use during the reallocation process. This ensures that the process will eventually reach a point where E qi = Q since no deallocated target can ever be reassigned weapons for current use, thus which prevents additional planes from being allocated to targets that have already deallocated planes. 49 50 Chapter 5 Experimental Results In order to validate the modified Markov task decomposition approach, two experiments were conducted. First, the quality of solutions produced by the technique was compared to the solutions from another heuristic (p5.2). Second, the effects of changing different variables within the modified MTD approach (MMTD) on computation time was examined to establish the upper bound for the algorithm's running time. The results of changing individual elements of the model are given in detail in section 5.3. 5.1 Experimental Approach The same conditions were held constant throughout each experiment to prevent any deviations due to sources other than the variables in the model that were intentionally varied. A 1.2GHz-Pentium III computer with 256Mb of RAM was used to run all the experiments to ensure that the computation time for each trial was not influenced by differences in speed and memory among computers. No other programs were running simultaneously so that the entire processor was dedicated to the experiment modulo stochastic effects from the OS (Windows 2000). Besides ensuring that the external factors were constant in each experiment, the problem given to the modified MTD algorithm was also constant except for the variables in the model that were being observed. 51 Experiment Sections J5.3.2 g5.3.3 5.3.4 §5.3.5 200 10 10 10 10 1 1-50 1 1 1 1 2 2 2 1--30 2 2 2 Weapons 1-800 300, 50 30 30 1-100 5 each 30 Weapon Types 1 1 1 1 1 1-5 1 Planes 200 200 30 30 30 30 1 -50 Variable §5.2 §5.3.1 Targets 100 1-200, 1-50 Target Types 1 Damage States Time Horizon (H) 10 p 0.2 ri 100 Cost 1 Capacity 2 5.3.6 Table 5.1: A list of variables and their corresponding values for each experiment. The standard scenario consisted of 10 targets of the same target type where each target can be in one of two possible states. The targets' window of availability was randomly generated so that each target is vulnerable for at least 1 time step within the finite time horizon I. In addition, 30 weapons of a single weapon type were available for attacking with a pi = 0.2 probability of hitting target i. A reward ri = 100 was received for actually damaging target i. To ensure that the deallocationreallocation process did not occur unless otherwise specified, 30 planes were available for allocation, each with a capacity of carrying at most 2 weapons. Table 5.1 shows the values used for each experiment to guarantee that the change in computation time was only due to the variables that were being observed since all other elements are kept constant. 52 5.2 Modified MTD Algorithm vs Other Heuristic Because the process to calculate the optimal policy becomes intractable for a largescale allocation problem, an optimal policy cannot be used as a reference point in validating the performance of the MMTD approach. Instead of comparing the quality of solutions produced by MMTD to the optimal, they were compared to the solutions of other heuristics. In this experiment, all the variables in the standard scenario besides the number of weapons were kept the same except the number of targets was raised to 100 as stated in Table 5.1. Each trial had a different number of weapons available, covering a range from 1 to 800 weapons. The problem was solved twice; once with the modified MTD approach and a second time with a greedy policy that applies the action with the highest expected immediate reward [8]. The overall expected utility was determined for the two different approaches in each trial. 2000 1500 1000 500 Number of We apons Modified MTD Approach Greedy Policy Figure 5-1: Comparison of the quality of policies generated by Modified MTD and Greedy Strategy for a 100-target problem 53 Figure 5-1 shows that the performance of the modified MTD approach was much better than that of the greedy approach. From the graph, the expected utility of the modified MTD approach leveled off at a maximum value of approximately 1600 after 200 weapons are used. Using the same number of weapons, the greedy policy., however, only yielded an expected utility of about 600, which is significantly lower than the expected utility produced by MMTD. Therefore, this provides evidence that the modified MTD approach produces higher quality solutions compared to at least one simple heuristic. 5.3 Effect of Model Changes The computation time of the modified MTD approach varies depending on the size of the problem that is being solved. However, each variable within the problem does not affect the solution time in the same way. Adjusting some variables will increase the amount of time spent in the off-line phase while changing other variables will lengthen the computation time in the on-line phase. Furthermore, the order in which the computation time grows differs depending on the variable that is being regulated. In the experiments described in the following sections, each element was varied over a specified range while keeping the other variables constant. A set of scenarios were solved using the modified MTD approach where each scenario had a different amount of the elements that were being adjusted. In these experiments, a trial was defined as the process of solving each scenario ten times. The time spent in the on-line and off-line phases were averaged over the ten runs in order to ensure that the stochastic nature of the approach had negligible effects on the computation time. These experiments showed how the running time grew according to the incremental change of each element as seen in table 5.2. 5.3.1 Additional Targets To examine the effect of additional targets on the computation time for solving a weapon/plane allocation problem, this experiment was conducted twice with different 54 Test Variable Effect Targets Linear Target Types Linear Damage States Quadratic Weapons Linear Weapon Types Exponential Planes Constant Table 5.2: A summary of the resulting trends seen in the computation time by varying the given variable. number of weapons available for each time. In the first experiment, the number of targets ranged from 1 to 200. The number of weapons available was set at 300 so that there are potentially enough weapons to attack every target. There were also 200 planes, each with a maximum capacity of carrying two weapons, to guarantee that the deallocation-reallocation process was not activated in this experiment. All the other variables had values stated in Table 5.1. In the second experiment, the number of targets only ranged from 1 to 50 with the number of weapons set at 50. All the other variables had values stated in Table 5.1 as in the first set of trials. As shown in Figure 5-2, the computation time of the off-line calculation did not grow as the number of targets increased because it only increases with the number of target types. The on-line phase's running time, however, grew linearly relative to the number of targets. This was due to the fact that the value maximization algorithm traverses once through every target to allocate each weapon (54.3.1). Thus, as the number of targets increased, the time it took to assign all the weapons also lengthened. Note that the running time for the off-line phase was significantly longer than the running time for the on-line phase as the number of weapons increased. As shown in Figure 5-2(b), the total running time was approximately constant relative to the number of targets when the number of weapons available was set at 300 since the linear increase in the computation time of the on-line phase is negligible compared to the time it takes for the off-line phase computation. 55 1.4 1.2 1 0.8 E 0.6 0.4 0.2 0 M- Mf r, VC%4 Mf (cq(N M M C'14 M I-M c--- _t LD It M ICT Number of Targets Total Tirre Off- Line Phase On-Line Phase - -+- (a) With 50 weapons available. 60 50 E, 40 30 20 10 0 r-_M r-- M D M ) LDI- Nube -f Tre Number of Tar gets --- On- Line Phase -in- Off- Line Phase Total Tirre (b) With 300 weapons available. Figure 5-2: Modified MTD's running time with a varying number of targets. 56 5.3.2 Additional Target Types The effect of varying the number of target types on the MMTD's running time was analyzed by solving weapon/plane allocation problems with a range of 1 to 50 different target types. All the target types had, however, identical representations to ensure that the time taken to calculate the off-line value tables did not fluctuate between different target types. In order to increase the likelihood of at least having one target of each type by random assignment, 200 targets were introduced into the model. The other variables in the model are shown in Table 5.1. 4;0 45 -- + ToaTm 0hs -!*-nePae On-in 1=1 ic th cCIt n tie f th - tk p Num b er o f Ta rget Types SOn-Line ;'hmsp_ = OIf-Line Phas Foit~aI Time1 Figure 5-3: Modified MTD's running time with a varying number of target types. As shown in figure 5-3, the running time of the on-line phase stayed constant as the number of target types changes. However, the off-line phase's computation time was linearly proportional to the number of target types, because a value table was created for each type. Since the computation time of the off-line phase took up a large portion of the total time used, the total running time of the MMTD increased linearly relative to the number of target types. 57 5.3.3 Multiple Damage States 500- 4002 300u 200F1 F, 100- 0 ~rv'~rvv~ir'.T27' r'7Y4 F'~ 'T ' Number of Damage States On-Line Phase m Off-Line Phase Total Tirne Figure 5-4: Modified MTD's running time with a varying number of damage states. The number of damage states that a target can be in influences the running time of the modified MTD approach. In order to study this effect, the variables were set to the values stated in Table 5.1. Note that the number of different damage states was varied from 1 to 30. Figure 5-4 illustrates that the significant portion of the total running time was due to the off-line phase. Although in theory the solution time of the off-line phase grows at O(IS 2) (§2.1.2), where ISI is the number of states, the graph shows that the computation time actually increases at a slower rate than expected relative to the number of damage states initially. However, the running time eventually increases quadratically as predicted. This is because several simplifications were made in the value table formulation as stated in section 3.4.3 such as setting all values at the final damage state to 0, which made the approach more efficient initially. However, as the number of damage states increases, the time saved by the simplifications is negligible compared to the time it takes for the overwhelming amount of calculations needed to create value tables for multiple damage states. 58 5.3.4 Additional Weapons The effect on the total computation time was studied in this experiment as the number of weapons available was varied. Each trial used the values stated in Table 5.1, as the basic problem used with a distinct number of weapons ranging from 1 to 100. Since there was only one weapon type, the worst case running time for the new approach should theoretically be linear relative to the weapons available. The linear relationship is based on the fact that a value is calculated for each weapon allocation during the off-line value table formulation. However, as shown in figure 5-5, the total time increased at a slightly faster rate than predicted. 2015-V 100- _. _ _,,&r#6 F V _ _ ~I'2%. - NCA M _ __ _ _Ir_"M_ T 'rt flD Lo _ _ I- _ Co _ 00 Number of Wekapons Line Fhase -u-Off-Line Flase On-O TotalTime8 Figure 5-5: Modified MTD's running time with a varying number of weapons. One possible cause of such deviation is the way a multi-dimensional array is implemented in the simulator. The simulator's multi-dimensional array, which is the data type used to hold transition matrices, the value table, and the action table, is represented by a single array, where each element of the single array maps to a distinct element in the multi-dimensional array. In order to retrieve or store a value, the 59 indices of the N-dimensional array are converted into a single index, and vice versa, through the following equation, [ai ... , aN-1, where LA1 ,.... aN] = (a1 x (A2 x A3 X ... x AN)) + -- + (aN-1. x (AN)) aN, (5.1) AN] are the dimensions of the N-dimensional array. This conversion process is linearly proportional to the amount of weapons. Thus, the additional time required to retrieve and store each value due to the specific implementation of the multi-dimensional array caused the computation time to increase at a faster rate than predicted. 5.3.5 Additional Weapon Types To examine the effect on the overall computation time due to an increase in the number of weapon types, an experiment consisting of 5 trials was conducted. Each trial had a different number of weapon types ranging from 1 to 5 with 5 weapons per type. The values of the other variables are listed in Table 5.1. The number of weapons was decreased to 5 so that the computation time was still under 24 hours when solving an allocation problem with 5 weapon types. The trials' running times increase exponentially with the number of weapon types as shown in Figure 5-6. The total time to solve each of the problems with more than one weapon type is mostly contributed by the off-line phase. During the value table formulation, a value is calculated for every possible combination of weapons of each weapon type. Since there are only 5 weapons per type in this scenario, the time required is Q(51W|), where |W| is the total number of weapon types. The computation time of the on-line phase is only 0(1), since it is independent from the number of weapon types. Thus, the first trial actually did exhibit an exponential trend but to a lesser degree. Note that the exponential effects are not evident in the 1st trial, where the time required for the on-line phase with one weapon type is much greater then the off-line phase's running time. 60 10o 10 E 1: 0.1 Number of Weapon Types - On-Line Phase - Off- Line Fhase Tot al Tirne Figure 5-6: Modified MTD's running time with a varying number of weapon types. 5.3.6 Additional Planes Adding an additional plane to a scenario affects the total computation time of the modified MTD approach. To study the effect, the values listed on Table 5.1 were used to model the set of problems, each with a different number of planes from a range of 1 to 50. In this experiment, the off-line phase's computation time stayed fairly constant as shown in figure 5-7 because the value table formulation is independent from the number of planes, which are only considered during the on-line phase. In addition, the on-line phase will require more time if the deallocation-reallocation process occurs. Since there were 30 weapons available and each plane couldn only carry a maximum of 2 weapons, the deallocation-reallocation process would definitely take place initially. This process varied in computation time according to the number of iterations required to find a permissible allocation. The variation depends on which targets are available for attack at a given time and the number of planes. As the number of planes increased, the likelihood of having to deallocate weapons due 61 6 5- E ~-2- 3- L& A,AA A .&L 10 IIII LD II 0Y) II cr) I r- ~- I I CN II IIIII LC) n- CA ) 0C4 I I If I Cr) co) I I I r- co) VIT II Lf) n11 Num ber of P lanes --- On-Line Phase -u-Off-Line Phase Total Time Figure 5-7: Modified MTD s running time with a varying number of planes. to an insufficient number of planes decreased. Computation time decreased at 25 planes, which marked the point in the experiment when the deallocation-reallocation process stopped occurring. Because the off-line phase took a much longer time than the on-line phase, the increase in running time due to the deallocation-reallocation process was, therefore, negligible. 62 Chapter 6 Conclusions and Future Research Areas 6.1 Conclusions The modified Markov task decomposition approach (MMTD) has been shown to be a feasible method to solve large-scale allocation problems particularly in the weapon/plane allocation domain. The technique incorporates the idea that weapons can be of different types. Furthermore, the approach also includes the deallocationreallocation process to model the fact that there are a limited number of planes that can carry weapons to their targets. These extensions to the original MTD approach provide a more comprehensive model in which realistic situations can be described. Although it cannot be proven that the technique can generate solutions close to optimal for large-scale problems, the experiment described in section 5.2 demonstrated that it produces solutions with a higher expected utility compared to a straightforward heuristic. MMTD divides the process of solving an allocation problem into two phases and breaks up the problem into smaller sub-problems (§2.2). As shown in section 5.3, the off-line phase requires a significant amount of time to generate the value table. Its worst-case computation time is 0(1-t Al. -S1 weapons of type j, 2 . N), where MVIi is the number of K is the number of weapon types., S| is the number of damage 63 states, and N is the number of target types. An instigating factor is that once the value table is created, it can be re-used in on-line analysis for similar scenarios. The on-line phase uses the off-line value table and produces weapon/plane allocations with the highest expected utility. Although its running time increases as the number of targets and planes varies, the amount of time it takes to determine a solution is negligible compared to the off-line computation time. This could make real-time analysis possible since the actual allocation of weapons and planes is done in a fraction of a minute on-line. 6.2 Potential Applications The simulator built for this research models a more sophisticated air strike campaign than its predecessor. If the model is further improved so that the simulation can give better predictions of the damages taken by the targets then the system could be used effectively in military planning. Moreover, this system could be adapted and used for other large-scale allocation problems. For example, the distribution of doctors and nurses to various departments within a hospital can be determined using a similar MMTD approach. This is because the departments can be modelled with MDPs, where assigning an extra nurse or doctor can cause a department to vary its state of efficiency and effectiveness. In addition, the features added in this thesis to the original MTD approach can model the different kinds of personnel available for distribution since they could be considered analogous to the different weapon types. Thus, any large-scale stochastic planning problem with different resource types and certain constraints can use the techniques developed in the previous chapter to find a near-optimal policy. 6.3 Future Work The research reported in this thesis provides the foundation for many different areas in which future work could be done. One area is the potential research for creating an 64 even more realistic model of an air strike campaign. Features such as implementation of different plane types with varying capacities and a way of prioritizing targets could be added to the model. In addition, the concept of geography could be developed into the system so that targets at a farther distance are harder to attack than the targets at close proximity. Another important aspect that should be included is the property of partial observability, for instance by using partially observable MDPs (POMDPs) to model the targets [4]. This property states that the information gathered on the targets after an attack has a probability of being erroneous. These modifications will enhance the model and most likely increase the computation time dramatically. Thus, further research is needed for creating faster and miore reliable solution algorithms. The work achieved for this thesis demonstrates that the MMTD technique only comes up with a near-optimal policy. Research should., therefore, be done to find new algorithms to solve large-scale stochastic planning problems optimally. Since MMTD decreases the solution time significantly compared to the time it takes to solve one large MDP, the on-line value maximization method could be improved to keep the same computational tractability but be able to find a higher quality policy. Other possible value maximization methods such as local search and linear programming should be researched to determine whether these techniques would yield solutions with higher expected utility. These amendments illustrate that there are still many areas involving stochastic planning problems that need to be worked on. 65 66 Bibliography [1 Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. McGraw-Hill Book Company, Cambridge, Massachusetts, first edition, 2000. [21 Thomas Dean, Leslie Pack Kaelbling, Jak Kirman, and Ann Nicholson. Acting optimally in partially observable stochastic domains. Artificial InItelligence, 76, 1995. 3] Alvin W. Drake. Fundamentals of Appied Probability Theory. McGraw-Hill Book Company, New York., 1967. [4] Spencer Firestone. A Partially Observable Approach to Allocating Resources in a Dynamic Battle Scenario. M.Eng dissertation, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 77 Massachusetts Avenue, MA 02141, June 2002. [5] Frederick S. Hillier and Gerald J. Lieberman. Introduction to OperationsResearch. Holden-Day, Inc., Oakland, California, fourth edition, 1986. [6] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, February 1998. [7] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237--285, 1996. 67 [81 Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, Leonid Peshkin, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Solving very large weakly coupled markov decision processes. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, pages 165-172, July 1998. [91 Kirk A. Yost. Solution of Large-Scale Allocation Problems with Partially Ob- servable Outcomes. PhD dissertation, Naval Postgraduate School, Department of Operations Research, Monterey, CA 93943-5000, September 1998. 68