--- A Partially Observable Approach to Allocating Resources in a Dynamic Battle Scenario by Spencer James Firestone Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2002 @ Spencer James Firestone, MMII. All rights reserved. The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document MASSACHUSETTS in whole or in part. BARKER INSTITUTE OF TECHN0!L!7'Y 200j JUL 3 LIBRARIES . ......... .. A uthor ....... Department' of Elec1TMM1 Engineering and Computer Science May 24, 2002 C ertified by.....w.. ........................ -- Richard Hildebrant Principal Member of Technical Staff, Draper Laboratory Technical Supervisor ........ Certified by. LesThfe Pack Kaelbling Professor of Computer Science and Engineering, MIT T9.sSipervisor ........ Arthur C. Smith Chairman, Department Committee on Graduate Students Accepted by - 2 A Partially Observable Approach to Allocating Resources in a Dynamic Battle Scenario by Spencer James Firestone Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2002, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract This thesis presents a new approach to allocating resources (weapons) in a partially observable dynamic battle management scenario by combining partially observable Markov decision process (POMDP) algorithmic techniques with an existing approach for allocating resources when the state is completely observable. The existing approach computes values for target Markov decision processes offline, then uses these values in an online loop to perform resource allocation and action assignment. The state space of the POMDP is augmented in a novel way to address conservation of resource constraints inherent to the problem. Though this state space augmentation does not increase the total possible number of vectors in every time step, it does have a significant impact on the offline running time. Different scenarios are constructed and tested with the new model, and the results show the correctness of the model and the relative importance of information. Technical Supervisor: Richard Hildebrant Title: Principal Member of Technical Staff, Draper Laboratory Thesis Supervisor: Leslie Pack Kaelbling Title: Professor of Computer Science and Engineering, MIT 3 4 Acknowledgments This thesis was prepared at the Charles Stark Draper Laboratory, Inc., under Internal Research and Development. Publication of this report does not constitute approval by the Draper Laboratory or any sponsor of the findings or conclusions contained herein. It is published for the exchange and stimulation of ideas. Permission is hereby granted by the author to the Massachusetts Institute of Technology to reproduce any or all of this thesis. Spencer Firestone May 24, 2002 5 6 Contents 1 2 1.1 M otivation. . . . . . . . . . . . . . . . . . . . . . . 14 1.2 Problem Statement . . . . . . . . . . . . . . . . . . 14 1.3 Problem Approaches . . . . . . . . . . . . . . . . . 16 1.4 Thesis Approach . . . . . . . . . . . . . . . . . . . 17 1.5 Thesis Roadmap . . . . . . . . . . . . . . . . . . . 18 19 Background 2.1 2.2 2.3 3 13 Introduction Markov Decision Processes . . . . . . . . . . . . . . 19 2.1.1 MDP Model . . . . . . . . . . . . . . . . . . 21 2.1.2 MDP Solution Method . . . . . . . . . . . . 21 PO M D P . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 POMDP Model Extension . . . . . . . . . . 22 2.2.2 POMDP Solutions . . . . . . . . . . . . . . 23 2.2.3 POMDP Solution Algorithms . . . . . . . . 26 . . . . . . . . . . . . . . 28 2.3.1 Markov Task Decomposition . . . . . . . . . 28 2.3.2 Yost . . . . . . . . . . . . . . . . . . . . . . 31 2.3.3 Castafion . . . . . . . . . . . . . . . . . . . 32 Other Approaches Details Dynamic Completely Observable Implementation 3.1 35 Differences to MTD . . . . . . . . . . . . . . . . . . 36 Two-State vs. Multi-State . . . . . . . . . . 36 3.1.1 3.2 3.3 4 3.1.2 Damage Model 3.1.3 Multiple Target Types . . . . . . . . . . . . . . . . . . . . . . . . . . 37 . . . . . . . . . . . . . . . . . . . . . . 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.3 Offline-MDP Calculation . . . . . . . . . . . . . . . . . . . . 46 3.2.4 Online-Resource Allocation and Simulation . . . . . . . . . . 46 Implementation Implementation Optimization . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 Reducing the Number of MDPs Calculated . . . . . . . . . . . 51 3.3.2 Reducing the Computational Complexity of MDPs . . . . . . 51 3.4 Implementation Flexibility . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5 Experimental Comparison 53 . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Partially Observable Implementation 4.1 4.2 4.3 Additions to the Completely Observable Model 57 . . . . . . . . . . . . 58 4.1.1 POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.2 Strike Actions vs. Sensor Actions . . . . . . . . . . . . . . . . 59 4.1.3 Belief State and State Estimator . . . . . . . . . . . . . . . . 59 The Partially Observable Approach . . . . . . . . . . . . . . . . . . . 61 4.2.1 Resource Constraint Problem . . . . . . . . . . . . . . . . . . 61 4.2.2 Impossible Action Problem . . . . . . . . . . . . . . . . . . . . 64 4.2.3 Sensor Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.4 Belief States . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3.2 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.3 Offline-POMDP Calculations . . . . . . . . . . . . . . . . . . 72 4.3.4 Online-Resource Allocation . . . . . . . . . . . . . . . . . . . 73 4.3.5 Online-Simulator . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.6 Online-State Estimator . . . . . . . . . . . . . . . . . . . . . 75 8 4.4 4.5 5 . . . . . . . . . . . . . . . . . . . . . . 77 4.4.1 Removing the Nothing Observation . . . . . . . . . . . . . . . 78 4.4.2 Calculating Maximum Action . . . . . . . . . . . . . . . . . . 78 4.4.3 Defining Maximum Allocation . . . . . . . . . . . . . . . . . . 79 4.4.4 One Target Type, One POMDP . . . . . . . . . . . . . . . . . 80 4.4.5 Maximum Target Type Horizon . . . . . . . . . . . . . . . . . 80 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.1 Completely Observable Experiment . . . . . . . . . . . . . . . 81 4.5.2 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . 83 Implementation Optimization 91 Conclusion 5.1 Thesis Contribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 95 A Cassandra's POMDP Software A .1 H eader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.2 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A.3 Observation Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.4 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Output files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.6 Example-The Tiger Problem . . . . . . . . . . . . . . . . . . . . . . 98 . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.5 A.7 Running the POMDP A.8 Linear Program Solving . . . . . . . . . . . . . . . . . . . . . . . . . 102 A.9 Porting Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 104 9 10 List of Figures . . . . 20 A second Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . 20 2-3 A combination of the two Markov chains . . . . . . . . . . . . . . . 20 2-4 A belief state update . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2-5 A sample two-state POMDP vector set . . . . . . . . . . . . . . . . 24 2-6 The corresponding two-state POMDP parsimonious set . . . . . . . 25 2-7 The dynamic programming exact POMDP solution method. . . . . 26 2-8 Architecture of the IIeuleau et al. approach . . . . . . . . . . . . . 30 2-9 Architecture of the Yost approach, from [9] . . . . . . . . . . . . . . 31 2-10 Architecture of the Castafion approach . . . . . . . . . . . . . . . . 32 2-1 A sample Markov chain. .... 2-2 ..................... 3-1 A two-state target . . . . . . . . . . . . . . . . . . . . . . . 36 3-2 A three-state target . . . . . . . . . . . . . . . . . . . . . . . 36 3-3 The completely observable implementation architecture . . . 40 3-4 A sam ple target file . . . . . . . . . . . . . . . . . . . . . . . 42 3-5 A sam ple world file . . . . . . . . . . . . . . . . . . . . . . . 44 3-6 Timeline of sample targets . . . . . . . . . . . . . . . . . . . 45 3-7 A sam ple data file . . . . . . . . . . . . . . . . . . . . . . . . 47 3-8 A timeline depicting the three types of target windows . . . 50 3-9 Meuleau et al.'s graph of optimal actions across time . . . . 55 3-10 Optimal actions across time and allocation . . . . . . . . . . 56 4-1 The general state estimator . . . . . . . . . . . . . . . . . . 60 4-2 The state estimator for a strike action . . . . . . . . . . . . 60 11 4-3 The state estimator for a sensor action . . . . . . . . . . . . . . . . . 60 4-4 Expanded transition model for M = 3 . . . . . . . . . . . . . . . . . . 63 4-5 Expanded transition model for M = 3 with the limbo state . . . . . . 65 4-6 The partially observable implementation architecture . . . . . . . . . 69 4-7 A sample partially observable target file . . . . . . . . . . . . . . . . 71 4-8 Optimal policy for M = 11 . . . . . . . . . . . . . . . . . . . . . . . . 82 4-9 Score histogram of a single POMDP target with no sensor action . . 84 . . . . . . . 85 4-10 Score histogram of a single completely observable target 4-11 Score histogram of a single POMDP target with a perfect sensor action 86 4-12 Score histogram of a single POMDP target with a realistic sensor action 87 4-13 Score histogram of 100 POMDP targets with no sensor action . . . . 88 4-14 Score histogram of 100 POMDP targets with a realistic sensor action 88 4-15 Score histogram of 100 POMDP targets with a perfect sensor action 89 4-16 Score histogram of 100 completely observable targets . . . . . . . . . 89 A-1 The input file for the tiger POMDP . . . . . . . . . . . . . . . . . . . 99 A-2 The converged . alpha file for the tiger POMDP . . . . . . . . . . . . 100 A-3 The alpha vectors of the two-state tiger POMDP . . . . . . . . . . . 101 A-4 A screenshot of the POMDP solver . . . . . . . . . . . . . . . . . . . 103 12 Chapter 1 Introduction The task of planning problems in which future actions are based on the current state of objects is made difficult by a degree of uncertainty of the state. State knowledge is often fundamentally uncertain, because objects are not necessarily in the same location as the person or sensor making the observation. Problems of this type can be found anywhere from business to the military [9, 5j. For an example of a business application planning problem, consider a scenario where a multinational corporation produces a product. This product is either liked or disliked by the general populace. The company has an idea of how well the product is liked based on sales, but this knowledge is uncertain because it is impossible to ask every person how satisfied they are. The company wishes to know whether they should produce more of the product, change the product in some way, or give up on the product altogether. Of course, an improper action could be costly to the company. To become more certain, the company can perform information actions such as polls or surveys, then use this knowledge to guide its actions. Military applications are even clearer examples of how partial knowledge can present problems. In a bombing scenario example, there are several targets with a "degree of damage" state, anywhere from "undamaged" to "damaged". The goal is to damage the targets with a finite amount of resources over a period of time. Dropping a bomb on a damaged target is a waste of resources, while assuming a target is damaged when it is not can cost lives. Unfortunately, misclassifications are 13 frequent [9]. This thesis will focus on a more developed battle management scenario in a military application. 1.1 Motivation Battle management consists of allocating resources among many targets over a period of time and/or assessing the state of these targets. However, current models of a battle management world are fairly specific. Some models are concerned only with bombing targets, while others focus on target assessment. Some calculate a plan before the mission, while others update the actions and allocations based on realtime information. However, a more realistic model could be created by combining components of these models. This new model could then be used to create more accurate and effective battle plans. 1.2 Problem Statement In a combat mission, the objective of the military is to maximize target damage in the shortest time with the lowest cost. This thesis will examine a battle scenario where there are several targets to which weapons can be allocated. The general problem to be investigated has the following characteristics: " Objective: The goal of the problem is to maximize the reward attained from damaging targets over a mission time horizon. The reward is defined as the value of the damage done to the target less the cost of the weapons used. " Resources: The resources in this research are weapons of a single type. For each individual problem, there is a finite number of weapons to be allocated, Al. Once a weapon is used, it is consumed, and some cost is associated with its use. " Targets and States: For each individual problem, there is a finite number of targets to be attacked. Each target is a particular type, and there are one or 14 more different target types. Each target type has a number of states. There are at least two states: undamaged and destroyed. " Time Horizons: The battle scenario exists over a discrete finite time horizon. Each of the H discrete steps in this horizon is called a time step, t. Each individual target is available for attacking over an individual discrete finite time horizon. Target transitions and rewards can only be attained when the target exists. " Actions: There are two classes of actions. A strike action consists of using zero or more weapons on a target in a given time step. There is also a sensor class of actions, which does not affect the target's state, but instead determines more information about its state. Sensor class actions are only necessary in certain models, and this will be discussed in the Observations subsection. Sensor class and strike class actions are mutually exclusive and cannot be done at the same time. There are N actions at every time step, where N is the number of targets. " State Transitions: Each individual weapon has a probability of damaging the target. Using multiple weapons on a target increases the probability that the target will be damaged. Damage is characterized by a transition from one state to another. It is assumed that targets cannot repair themselves, so targets can only transition from a state to a more damaged state, if they transition at all. " Allocation: Allocation of resources to targets is dividing the total number of resources among the different targets. However, allocating x resources does not imply that all x resources will be used at that time step. " Resource Constraints: For simplicity, it is assumed that there is no constraint that limits the number of weapons that can be allocated to any one target at any time step. The only resource constraint is that the sum of all targets' weapon allocations is limited to the total number of current resources available. " Rewards: Rewards are attached to the transitions from one state to another. Different target classes can have different reward values. 15 * Observations: After each action, an observation is made as to the state of each target. There are two cases: definitive knowledge of the state of the target, called complete or total observability, or probabilistic knowledge of the state of the target, called partial observability. - Total Observability: In the totally observable class of planning problems, after every action, the state of the targets is implicitly known with certainty. The next actions can then be planned based on this knowledge. Only strike actions are necessary in this class of problems. - Partial Observability: In the partially observable class of planning problems, after every action, there is a probabilistic degree of certainty that a target is in a given state. It is assumed that a strike action returns no information about the state of the target. In addition, a strike action is not coupled with a sensor action, so to determine more accurate information about a target, an sensor class action must be used. 1.3 Problem Approaches There are current solution methods to battle planning problems in totally observ- able worlds [71, and solution methods in partially observable ones [9, 5]. However, these solutions explore slightly different concepts and applications. There are other differences in battle management solution approaches. Some solutions plan a policy of actions determined without any real-time feedback [9]. When all calculations and policies are determined before the mission begins, this is called offline planning. On the other hand, some solutions dynamically change the policy based on observations (completely accurate [7] or not [5]) from the world. When the process is iterative, with observations from the real world or a simulator, this is called online planning. Some solutions deal with strike actions [7], some focus on bomb damage assessment (BDA) of targets, some do both [9], and others use related techniques to examine different aspects of battle management [5]. 16 This thesis will look at three seminal papers which describe different ways to approach the battle management problem, analyze them in some detail, and combine them into a more realistic model. Meuleau et al. [7] examine a totally observable battle management world much like the problem statement described above. The model consists of an offline and online portion, and considers only strike actions. The allocation algorithm used is a simple greedy algorithm combined with Markov decision processes and dynamic programming. Yost's Ph.D. dissertation [9] executes much offline calculation and planning to determine the best policy for allocation of weapons to damage targets and assess their damage using sensors in a partially observable world. His allocation method is a coupling of linear programming and partially observable Markov decision processes. Castafion's [5] looks at a different aspect of the battle management world. His focus is on intelligence, surveillance, and reconnaissance (ISR), where target type is identified. This is closely related to determining the state of a target in a par- tially observable world, which is the topic of this thesis. He uses online calculations with partially observable Markov decision processes, dynamic programming, and Lagrangian relaxation to allocate sensor resources. 1.4 Thesis Approach To establish a baseline for comparison, we begin by creating a simple totally observable model. The model will be almost identical to the one in Meuleau et al.'s paper, with a few extensions and clarifications. The model will consist of both online and offline phases. Examples similar to the paper will be run and compared to the paper's results. The model will be extended by expanding the initial implementation to operate in a partially observable world with additional characteristics. Then we will run experiments, analyze and compare results to prior work, and draw conclusions from the experiments. Another graduate student, Kin-Joe Sham, is concurrently working in the same problem domain. However, he is focusing on improving the allocation algorithms and 17 adding additional constraints to make the model more realistic. We have collaborated to implement the totally observable model, but from there, our extensions are separate, and the code we created diverges. Future work in this problem domain could combine our two areas of research, as we shared code for the completely observable implementation and started with the same model for our individual contributions. 1.5 Thesis Roadmap The remainder of this thesis is laid out as follows: Chapter 2 gives background into totally and partially observable Markov decision processes. It also describes the three other papers in much more depth. Chapter 3 discusses the totally observable model fashioned after Meuleau et al.'s research, including the differences between their approach and ours, additional considerations encountered while developing the model, implementation optimizations, and experimental result comparisons. Chapter 4 develops the partially observable model, comparing it to the completely observable one defined in the previous chapter. It discusses interesting new implementation optimizations and then presents experimental results and analysis of different real-world scenarios. Chapter 5 describes the conclusions drawn from this research and pos- sible future extensions to the project. Appendix A contains a detailed description Cassandra's POMDP solver application. 18 Chapter 2 Background Our work is founded on totally and partially observable Markov decision processes, so we begin with a description of them. With the understanding gained from those descriptions, the other approaches can be explained in more detail. 2.1 Markov Decision Processes A Markov decision process (MIDP) is used to model an agent interacting with the world [6]. The agent will receive the state of the world as input, then generate actions that modify the state of the world. The completely observable environment is accessible, which means the agent's observation completely defines the current world state [8]. A Markov decision process can be viewed as a combination of several similar Markov chains with different transition probabilities. The sum of the probabilities on the arcs out of each node must be 1. As an example, a sample Markov chain is shown in Figure 2-1. In this chain, there are four states, a, b, c, and d, which are connected with arcs labelled with transition probability from state i to j, Tj. Figure 2-2 displays these same four states, but with different transition probabilities. If figure 21 is considered to be a Markov chain for action a 1 and figure 2-2 represents a chain for action a 2 , the combination of the two is shown in figure 2-3. Now different paths with different probabilities can be taken to get from one state to another by choosing 19 Pac a ~ Pab=0.6 0-4 Pca=0.3 b=Id=09P 0. cd b Figure 2-1: A sample Markov chain Pac=7 pba=IPdI I b b - -d Pdc=.2 Figure 2-2: A second Markov chain Pac1 0-4 Pac2= PeP Pb=11 ba2 Pab1=0. =0 6 pIcd2I Pdcl=0-9 Pbcl= b d PdbI= 0 .IPc2=0.2 Figure 2-3: A combination of the two Markov chains 20 different actions. If the states have reward values, it is now possible to maximize reward received for a transitions from state to state by choosing the appropriate action with the highest expected reward. This is a simplification of an MDP. 2.1.1 MDP Model An MDP model consists of four elements: " S is a finite set of states of the world. * A is a finite set of actions. " T : S x A - H(S) is a state transition function. For every action a E A, and state s E S, the transition function gives the probability that an object will transition to state s' E S. This will be written as T(s, a, s'), where s is the original state, a is the action performed on the object, and s' is the ending state. " R : S x S -+ R is the reward function. For every state s E S, R is the reward for a transition to state s' E S. This will be written as R(s, s'). 2.1.2 MDP Solution Method This thesis is concerned with problems of a finite horizon, defined as a fixed number of discrete time steps in the MDP . Thus the desired solution is one in which a set of optimal actions, or a policy, is found. An optimization criterion would be as follows: -H-1~ maxE E rt, t=O where rj is the reward attained at time t. Since we assume that the time horizon is known for every target, this model is appropriate. To maximize the reward in an MDP, value iteration is employed. Value iteration is a dynamic programming solution to the reward maximization problem. The procedure calculates the expected utility, or value, of being in a given state s. To do 21 this, for every possible s' it adds the immediate reward for being in s to the expected value of being in the new state s'. Then value iteration takes the sum of these values, weighted by their action transition probabilities, for a given action a. The expected value, V(s) is set to be the maximum value produced across all possible actions: V(s) = max T(s, a, s')[R(s, s') + V(s')]. (2.1) A dynamic programming concept is used to apply equation 2.1 to multiple time steps. By definition, the value of being in any state in the final time step, H, is zero, since there cannot be a reward for acting on a target after the time horizon has expired. Next, the H - 1 time step's values are calculated. For this iteration of equation 2.1, V(s') refers to the expected value in the next time step, H, which is zero. The expected value for every state is calculated in time step H - 1, then these values are used in time step H - 2. The value iteration equation is adjusted for time: V(s) = max E T(s, a, s')[R(s, s') + 11(s')], where VH(s) = 0. In this manner, all possible expected values are calculated from the last time step to the first using the previously calculated results. The final optimal policy is determined by listing the maximizing action for each time step. 2.2 POMDP The world is rarely completely observable. In reality, the exact state of an object may not be known because of some uncertain element, such as faulty sensors, blurred eyeglasses, or wrong results in a poll, for example. This uncertainty must be addressed. 2.2.1 POMDP Model Extension A POMDP model has three more elements than that of the MDP: . b is the belief state, which is a set of probability distributions over the states. 22 Observation Action __gon- BStaee Stat Belief State Updater Figure 2-4: A belief state update Each element in the belief state b(s), for s E S contains the probability that the world is in the corresponding state. The sum of all components of the belief state is 1. A belief state is written as: [ b(so) b(s1 ) b(s 2 ) ... b(slsl) * Z is a finite set of all possible observations. * 0 :SxA -+ -I(Z) is the observation function, where O(s, a, z) is the probability of receiving observation z when action a is taken, resulting in state s. In the totally observable case, after every action there is an implied observation that the agent is in a particular state with probability 1. But since state knowledge is now uncertain, each action must have an associated set of observation probabilities, though every observation does not necessarily need to correspond to a particular state. After every action, the belief space gets updated dependent on the previous belief space, the transition probabilities associated with the action, and the observation probabilities associated with the action and the new state [3], as shown in figure 2-4. 2.2.2 POMDP Solutions Solving a POMDP is not as straightforward as the dynamic programming value iteration used to solve an MDP. Value functions at every time step are now represented as a set of ISI-dimensional vectors. The set is defined as parsimonious if every vector in the set dominates all other vectors at some point in the ISI-dimensional belief space. A vector is dominated when another vector produces a higher value at every point. 23 V(b) 72 7 0 ........... . 73................. ................ o b1 (dead) (alive) Figure 2-5: A sample two-state POMDP vector set Ft represents a set of vectors at time step t, -y represents an individual vector in the set, and F* represents the parsimonious set at time t. The value of a target given a belief state, V(b), is the maximum of the dot product of b and each -y in F*. There will be one parsimonious set solution for each time step in the POMDP. Like MDPs, POMDPs are solved from the final time step backwards, so there will be one 17* for each time step t in the POMDP, and the solutions build off of the previous solution, t-. The value function of the parsimonious set is the set of vector segments that comprise the maximum value for every point in the belief space. The value function is always piecewise linear and convex [2]. Figure 2-5 shows a sample vector set for a two-state POMDP. With a two-state POMDP, if the probability of being in one of the states is p, the probability of being in the other state must be 1 - p. Therefore the entire space of belief states can be represented as a line segment, and the solution can be depicted on a graph. In the figure, the belief space is labelled with a 0 on the left and a 1 on the right. This is the probability that the target is in state 1, dead. To the far left is the belief state that the target is dead with probability 0, and thus alive with probability 1. To the far right is the belief state that the target is dead with probability 1, and thus alive 24 V(b) 0 (dead) (alive) Figure 2-6: The corresponding two-state POMDP parsimonious set with probability 0. Each vector has an action associated with it. Distinct vectors can have the same action, as the vector represents the value of taking a particular action at the current time step, and a policy of actions in the future time steps. The dashed vectors represent action a1 , the dotted vectors represent action a2 , and the solid vectors represent action a3 . There are six vectors in this set, but not all are useful. Both 'y4 and ye are completely dominated, so are not be included in the final optimal value function. It is useful to note that there are two types of vector domination. The first is complete domination by one other vector, shown in the figure as 'y4 is completely dominated at every point in the belief space by 73. The second is piecewise domination, shown in the figure as Y6 is dominated at various points in the belief space by 71i, 72 73 and 75 Various solution algorithms make use of the differences between these two types of domination to optimize computation time. Figure 2-6 shows the resulting parsimonious set and the sections that the vectors partition the belief space into. In this particular problem, the belief space has been partitioned into four sections, with a1 being the optimal action for the first and fourth sections, a2 producing the optimal value for the second section, and a3 being 25 F* t -11 a, z.a)VvE ral ral al -*-F ][aAl a2j iFa2P f e e ra l ra A ][ fah 2 . falAl U Figure 2-7: The dynamic programming exact POMDP solution method the optimal action for the third. The heavy line at the top of the graph represents the value function across the belief space. 2.2.3 POMDP Solution Algorithms How the solutions for each time step are created is dependent on the POMDP solution algorithm used. However, this thesis does not focus on POMDP solution algorithms, but rather uses them as a tool to produce a parsimonious set of vectors. This section will discuss general POMDP solution algorithms at a high level, and also the incremental pruning algorithm used in this research. Cassandra's website [2] has an excellent overview of many solution algorithms. General Algorithms There are two types of POMDP solution algorithms: exact and approximate. Exact algorithms tend to be more computationally expensive, but produce more accurate solutions. Ve chose among several exact dynamic programming algorithms. 26 The current general solution method for an exact DP algorithm uses value iteration for all vectors. The algorithm path can be seen in figure 2-7. Every -Y in the previously calculated parsimonious set F* is transformed to a new vector given an action a and an observation :, according to a function defined as (-y, a, -). These vectors are then pruned, which means that all vectors in the set that are dominated are removed. This produces the set of vectors F'. This is done for all possible actions and observations. Next, for a given action a, all observations are considered, and the cross-sum, (, of every F is calculated. This vector set is once again pruned, and this produces 1a. This is done for all a C A. Finally, the algorithms take the union of every 1a set, purge those vectors, and produce the parsimonious set F*. The purging step involves creating linear programs which determine whether a vector is dominated by any other vector at some point. However, this is optimized by performing a domination check first. For every vector, the domination check compares it against every other vector to determine if a single other vector dominates it at every point in the belief space. If this is the case, the vector is removed from the set, making the LPs more manageable. Several algorithms were considered for this research. In 1971, Sondik proposed a complete enumeration algorithm, then later that year updated it to the One-Pass algorithm. Cheng used less strict constraints in his Linear Support algorithm in 1988. In 1994, Littman et al. came up with the widely used Witness algorithm, which is the basis for the above discussion of a general POMDP solution method. Finally, in 1996, Zhang and Liu came up with an improvement on the Witness algorithm, calling it incremental pruning. This is currently one of the fastest algorithms for solving most classes of problems [4]. Incremental Pruning The incremental pruning algorithm optimizes the calculation of the IF sets from the 17 sets. The way to get the F' sets is to take the cross sum of the action/observation 27 vector sets and prune the results, in the following manner: T4purge @ r this is equivalent to: purge(ra 2a .. Za IZI. where k Incremental pruning notes that this method takes the cross sum of all possible vectors, creating a large set, then pruning this large set. However, this is more efficiently done if the calculation is done as follows: p pruage(... purge(purge(1 9 1a7) D r) (D r A more detailed description of the algorithm is contained in Cassandra et al.'s published paper [4] which reviews and analyzes the incremental pruning algorithm in some depth. 2.3 Other Approaches Details As mentioned before, three papers have looked at problems similar to the one this thesis discusses. Each of these papers has had significant impact on the creation of this model. 2.3.1 Markov Task Decomposition Meuleau et al.'s paper focuses on solving a resource allocation problem in a completely observable world. Meuleau et al. use a solution method they call Markov Task Decomposition (MTD), in which there are two phases to solving the problem: an online and an offline phase, thus making it dynamic. The problem they choose to solve is to optimally allocate resources (bombs) among several targets of the same type such that at the end of the mission, the reward obtained is maximized. Each 28 target is accessible within a time window over the total mission time horizon, and has two states: alive or dead. There is one type of action, a strike action, which is to drop anywhere from 0 to M bombs on a target. Each bomb has a cost and an associated probability of hitting the target, and the probability of a successful hit goes up with the number of bombs dropped according to a noisy-or damage model, which assumes each of the a bombs dropped has an independent chance of causing the target to transition to the dead state. The offline phase uses value iteration to calculate the maximum values for being in a particular state given an allocation and a time step, for all states, allocations, and time steps. The actions associated with these maximum values are stored for use in online policy generation. In the online phase, a greedy algorithm calculates the marginal reward of allocating each remaining bomb to a target, using the previously calculated offline values. At the end of this step, every target i will have an allocation mi. This vector of allocations is passed to the next component of the online phase, the policy mapper. For each target, the policy mapper looks up the optimal action for that target's time step ti, mi, and state si. Then the policy mapper has a vector of actions consisting of one action for each target. These actions are passed to a simulator which models real world transitions. Every action has a probabilistic effect on the state, and the simulator calculates each target's new state, puts them into a vector, then passes this state vector back to the greedy algorithm. The greedy algorithm then calculates a new allocation based on the updated number of bombs remaining and the new states, the policy mapper gets the actions for each target from the offline values, sends these actions to the simulator, and so on. The online phase repeats in this loop until the final time step is reached. Figure 2-8 shows the MTD architecture. The first step is the greedy algorithm. For every target, the greedy algorithm phase uses the target's current state s and time step t to calculate the marginal reward for adding a bomb to the target's allocation. The target with the greatest marginal reward mA has its allocation incremented by one. Once a bomb is allocated to target x, that target's marginal reward, mrn 29 is Corresponding actions 1 4 Policy Actions sMapper s, t, m indexes Offline World Allocations Dynamic Programming Corresponding values, P 4 - s, t, mA indexes Greedy Algorithm States Figure 2-8: Architecture of the Meuleau et al. approach recalculated. When all bombs have been allocated, every target has an allocation. A vector of allocations is then passed to the policy mapper module. The policy mapper module in the figure uses the same s and t as the greedy algorithm used, but now uses each target's allocation from the allocation vector. The action corresponding to the maximum value for that target's s, t, and m is returned to the policy mapper, which then creates a vector holding an optimal action for every target. This vector is then passed to the world. The world, whether it is a simulator or a real life scenario, will perform the appropriate actions on each target. The states of these targets are changed by these actions according to the actions' transition models. The world then returns these states to the greedy algorithm, incrementing the time step by one. This loop repeats until the final time step is reached. The paper lists three different options for resource constraints. The first is the no resource constraints option, in which each target is completely decoupled, and there is no allocation involved, so only the offline part is necessary. Each target has a set of bombs that will not change over the course of the problem. In the second option, global constraints only, the only resource constraint observed is such that the total number of bombs allocated must not be more than the total number of bombs for the entire problem. The third alternative is the instantaneous constraints only 30 Current object values, resource marginal costs Initial Policies POMDP MASTER LP (1 per object type) available resources object constraints optimal policy for current costs Improving policies Quit when no improving policies are found Figure 2-9: Architecture of the Yost approach, from [9] option, in which there are a limited number of weapons that can be simultaneously delivered to any set of targets (i.e., plane capacity constraints). This thesis uses the second option, global constraints only, based on its simplicity and the possibility for interesting experiments. 2.3.2 Yost Yost looks at the problem of allocating resources in a partially observable world. However, all calculations are done offline. He solves a POMDP for every target, and allocates resources based on the POMDPs' output. Then his approach uses a linear program to determine if any resource constraints were violated. If there are resource constraint violations, the LP adjusts the costs appropriately and solves the POMDPs again, until the solution converges. Figure 2-9 shows Yost's solution method. It shows that an initial policy is passed into the Master LP, which then solves for constraint violations. The new updated rewards and costs are passed into a POMDP solver, which then calculates a new policy based on these costs. This policy goes back into the Master LP, which optimizes the costs, and so on, until the POMDP yields a policy that cannot be improved within the problem parameters. This is all done completely offline, so it does not apply to a dynamic scenario. 31 POMDP Solver Updated costs Corresponding values Actions Initial Information Lagrangian Relaxation World State Observations Figure 2-10: Architecture of the Castafion approach 2.3.3 Castafion Castafion does not deal with strike actions, but instead uses observation actions to classify targets. Each target can be one of several types, and the different observation actions have different costs and accuracies. He uses the observations to determine the next action based on the POMDP model, thus his problem is dynamic. He has two types of constraint limitations. The first is, again, the total resource constraint. He also considers instantaneous constraints, where he has limited re- sources at each time step. He uses Lagrangian relaxation to solve the resource constraint problems. The entire problem is to classify a large number of targets. However, he decouples the problem into a large number of smaller subproblems, in which he classifies each target. Then he uses resource constraints to loosely couple all targets. But by doing the POMDP computation on the smaller subproblems, he reduces the state space and is able to use POMDPs to determine optimal actions. Figure 2-10 depicts Castafion's approach. An initial information state is passed to the Lagrangian relaxation module. This in turn decouples the problems into one POMDP with common costs, rewards, and observation probabilities. Then the POMDP is solved, and the relaxation phase creates a set of observation actions based on the results. The world returns a set of observations and the relaxation phase then uses this to craft another decoupled POMDP, and so on until the final time step. 32 This approach is conducted entirely online. Castaion has efficiently reduced the number of POMDPs for all targets to one, but because the problem is dynamic, a new POMDP must be solved at every time step. His problem is one of classification, so the state of an object never changes. Thus, transition actions do not exist, and observation actions cause a target's belief state to change. 33 34 Chapter 3 Dynamic Completely Observable Implementation To analyze a new partially observable approach to the resource allocation problem, we begin by expanding the totally observable case. Though most of the ideas presented in this chapter are from Meuleau et al.'s work, it is necessary to understand them, as they are fundamental to the new approach presented in this thesis. The problem that this chapter addresses is one in which there are several targets to bomb and each is damaged independently. The problem could be modelled as an MDP with an extremely large state space. However, this model would be too large to solve with dynamic programming [7]. Thus, an individual MDP for each target is computed offline, then the solutions are integrated online. The online process is to make an overall allocation of total weapons to targets, then determine the number of bombs to drop for each target. The first round of weapons are deployed to the targets and the new states of the targets are determined. Bombs are then reallocated to targets, a second round of weapons are deployed, and so on, until the mission horizon is over. 35 S ={undamaged, damaged} .5 .. 0 [0 1 '0 501 0 Figure 3-1: A two-state target S ={undamaged, partially damaged, destroyed} T = .6 .3 .1 0 .7 .3j, R= 0 0 1 0 25 501 0 0 20 0 0 0 Figure 3-2: A three-state target 3.1 Differences to MTD The research presented in Meuleau et al.'s paper is complete, and the problem domain can be expanded to include partially observable states. However, other enhancements were made to the problem domain, including implementing and testing with multistate targets (defined as targets with three or more states), updating the damage model, and allowing for multiple target types. 3.1.1 Two-State vs. Multi-State Though Meuleau et al.'s model and calculations are of a general nature and can be used with multi-state targets, the paper only discusses a problem in which the targets are one type; and this target type has two states: alive and dead. However, in real-world problems, there will often be more than two states. A trivial example is a 4-span bridge in which the states range from 0% damaged to 100% damaged in 25% increments [9]. The implementation presented in this thesis can handle multiple states. It is simple enough to extend the model from two states to multiple states. All that is involved is adding a state to the S set, increasing the dimensions of the T matrix by one, and increasing the dimensions of the R matrix by one as well. For example, figure 3-1 presents a simple target type with two states: undamaged 36 Description State Si Undamaged 25% damaged 50% damaged 75% damaged Destroyed S2 S3 S4 S5 Table 3.1: Sample state descriptions and damaged. Figure 3-2 presents a target with three states: undamaged, partially damaged, and destroyed. The new T is a 3 x 3 matrix, and the R has more reward possibilities as well. 3.1.2 Damage Model Meuleau et al. use a noisy-or model, as described below, in which a single hit is sufficient to damage the target, and individual weapons' hit probabilities are independent. The state transition model they use for a two state target with states u undamaged and d = damaged is the following: T(s, a, s') = 0 if s = d and s' = u 1 if s = d and s' = d q if s = u and s' = u 1 if s = u and s' = d - The transition probability for a target from state s to state s' upon dropping a bombs is determined by the probability of missing, q = 1 - p, where p is the probability of a hit. To extend the model to multiple states, it is necessary to analyze what an action actually does to the target. Consider a target with five states, si through s5 , as shown in table 3.1. A state that is more damaged than a state si is said to be a "higher" state, while a state that is less damaged is a "lower" state. Each bomb causes a transition from one state to another based on its transition matrix T. Since the damage from each bomb is independent and not additive, when 37 multiple bombs are dropped on a target, each bomb provides a "possible transition" to a state. The actual transition is the maximum state of all possible transitions. Consider a target that is in s1. If a bombs are dropped, what is the probability that it will transition to S3? There are three possible results of this action: " Case 1: At least one of the a bombs provided a possible transition to a state greater than S3. If this situation occurs, the target will not transition to S3, no matter what possible transitions the other bombs provide, but will instead transition to the higher state. " Case 2: All a bombs provide possible transitions to lower states. Once again, if this situation occurs, the target will not transition to S3, but will transition to the maximum state dictated by the possible transitions. " Case 3: Neither of the above cases occurs. This is the only situation in which the target transitions to state s3. The extended damage model is generalized as follows. target transitions from state i to state j The probability that a given action a is: T(si, a, sj) = 1 - Pr(Case 1) - Pr(Case 2). (3.1) The probability of Case 1 is the sum of the transition probabilities for state si to all states higher than sj for action a: Is' Pr(Case 1) = T(si, a, 5m). ) (3.2) m=j+1 The probability of a single bomb triggering case 2 is the sum of the transition probabilities for state si to all states lower than sj for a = 1: j-1 T(si, 1, Sk), Pr(Case 21a = 1) = (3.3) k=1 where T(si, 1, sk) is given in T. The generalized form of equation 3.3 is the probability 38 that all a bombs dropped transition to a state less than sj: Pr(Case 2) r ZT(si, 1, sk) (3.4) _k=1I Finally, the probability that Case 3 occurs, that a target transitions from si to sj given action a is a combination of equations 3.1, 3.2, and 3.4: -1 T(si, a, sj) = 1 - Is -a 1 T(si, 1, s) - E T(s a, SM). (3.5) m=j+1 -k=1 Since equation 3.5 depends on previously calculated transition probabilities for Case 1, the damage model must be calculated using dynamic programming, starting at the highest state. Thus, T(sj, a, sisI) must be solved first for a given a, then T(si, a, sIS_11), and so on. 3.1.3 Multiple Target Types Any realistic battle scenario will include targets of different types, each with its own S, T, R, and A. Each of these target types has an associated MDP. Each one of these independent MDPs is solved using value iteration, and the optimal values and actions are stored separately from other target types'. In the resource allocation phase, each target will have its target type MDP checked for marginal rewards and optimal actions. Multiple MDPs now need to be solved to allow for multiple target types. 3.2 Implementation The following sections describe how the problem solution method described in Meuleau et al.'s paper was designed, implemented, and updated. 39 Resource Affoottion SOrrsponding actionsA poiy Data Actions World Simulator SAl- ions Structures ~s~ Atgcthm States 'ndexes Offline Dynamic Programming Target File World W File Figure 3-3: The completely observable implementation architecture 40 3.2.1 Architecture The architecture for the problem solution method is identical to the one discussed in section 2.3.1. Specific to the implementation, however, are the input files and data structures, which can be seen in relation to the entire architecture in figure 3-3. The input files are translated into data structures and used by both the offline and online parts of the implementation. The offline calculation loads in the target and world files (1) and produces an MDP solution data file for every target type (2). Next, the greedy algorithm loads in the target and world files (3), then begins a loop (4) in which it calculates the optimal allocation for each target. To do this, for each target i, the greedy algorithm looks up a value in the data structures, using a state s, a time step t, and an allocation m.A as an index. The greedy algorithm uses these values to create a vector of target allocations, which get passed to the policy mapper (5). For each target, the policy mapper uses the target's state, time index, and recently calculated allocation to get an optimal action (6). After calculating the best action for each target, the resource allocation phase passes a vector of actions to the simulator (7). The simulator takes in the actions for each target and outputs the new states for each target back to the resource allocation phase (8), and the online loop (4 - 8) repeats until the final time step is reached. 3.2.2 Modelling The entire description of a battle scenario to be solved by this technique can be found in the information in two data input types: a target file and a world file. All data extracted from these files will be used in the online and offline portions of the model. Target Files The target file defines the costs, states, and probabilities associated with a given target type. A sample target file is shown in figure 3-4. The first element defined in a target file is the state set. Following the %states 41 %states Undamaged Damaged %cost 1 %rewards 0 100 0 0 transProbs .8 .2 0 1 %end Figure 3-4: A sample target file separator, every line is used to describe a different state. The order of the states is significant, as the first state listed will be so, the second si, and so on. After the states, the %cost separator is used to indicate that the next line will contain the cost for dropping one bomb. This is actually a cost multiplier, as dropping more than one bomb is just the number of bombs multiplied by this cost. The %rewards separator is next, and this marks the beginning of the reward matrix. The matrix size must be ISI x ISI, where ISI is the number of states that were listed previously. The matrix value at (i, j) represents the reward obtained from a transition from si to sj. Note that in this problem, it is assumed that targets never transition from a more damaged state to a less damaged state, and so a default reward value of zero is used. This makes the reward matrices upper triangular. However, this model allows for negative (or positive, if so desired) rewards for a "backward" transition, as may occur when targets are repairable. The transition matrix is defined next using the separator %transProbs.This is another ISI x ISI matrix, as before, where the value at index (i, j) represents the 42 Target tb te A B C D E F 3 1 14 15 3 7 12 5 23 20 21 12 Table 3.2: Sample beginning and ending times of targets probability of a transition from si to sj if one bomb is dropped. The sum of a row of probabilities must equal 1. Note that once again, for the definition of the problem in this research, there are no "backward" transitions, so this matrix is upper triangular. This model also allows for targets transitioning from a higher damage state to a lower one. The probability for a transition from si to sj given an action of dropping more than one bomb is given according to the previously defined damage model. The file is terminated with the %eind separator. World Files The world file defines the time horizon, the resources, and the type, horizon, and state of each individual target. A sample world file is shown in figure 3-5. The first definition in a world file is the time horizon, as indicated by the %horizon separator, followed by an integer representing the total "mission" time horizon. When a scenario is defined by a world file as having a horizon H, the scenario is divided into H + 1 time steps, from 0 to H. The next definition is the total available resources, as indicated by the %resources separator, followed by an integer representing Al. This is the total resource constraint for the mission. After the resources, the %targets separator is listed. After that, there are one or more four-element sets. Each of these sets represent a target in the scenario. The first of the four elements is the target's begin horizon, tb which ranges from 0 to H - 1. This is when the target comes into "view". The next element is the end horizon, te, which ranges from tb + 1 to H. Table 3.2 lists several targets with various individual 43 %horizon 25 %resources 50 %targets 3 12 Meuleau Undamaged 1 5 Meuleau Undamaged 14 23 Meuleau Undamaged 15 20 Meuleau Undamaged 3 21 Meuleau Undamaged 7 12 Meuleau Undamaged %end Figure 3-5: A sample world file 44 t-- F -- 1 E D H--C - H- B -- i H 0 -5 A I 10 15 20 25 Figure 3-6: Timeline of sample targets horizons. Figure 3-6 depicts these targets graphically on a timeline of H 25. On this timeline, at any current time step, te, any target whose individual window ends at or is strictly to the left of t, has already passed through the scenario and will not return. Any target whose individual window begins at or exists during t, is in view, and is available for attacking. Any target whose individual window is strictly to the right of t, will be available in the future for attacking, but cannot be attacked immediately. The third element is the target type. This is a pointer to a target type, so a target file of the same name must exist. The fourth element is the starting state of the target. The starting state must be a valid state. Currently, upon initialization, the application checks for a valid state then ignores this value. targets are defined to start in the first state listed, however, the flexibility exists in this implementation to start in different states. At the end of every four-element set, the world file is checked for the %end separator, which signifies the end of the file. Generating world files is accomplished through interaction with the user. Users are queried for the scenario parameters: a total allocation M, a time horizon H, the number of targets N, the number of types of targets, Y, and the target type names. Then it creates a world file with the appropriate M and H, and N targets, each of which will be one of the entered target types with probability 1/Y. In addition, the targets will be given windows with random begin and end times within the time horizon. 45 3.2.3 Offline-MDP Calculation Each target has been defined to have a reward matrix, transition probabilities, states, and so on. Thus, different targets will have different value structures. The purpose of the offline calculations is to solve an MDP for each target type, which, given an state, a time horizon, and an allocation, returns an expected value and an optimal action associated with the value. The problem as defined in this thesis has a finite time horizon, and the following value iteration equation applies for each target, i: Vi(si, t, m) = max 1 T(si, a, s') [Ri(si, s') + Vi (s', t + 1, m - a)] - cia (3.6) a<m sesi This equation computes the value of the target i as the probabilistically weighted sum of the immediate reward for a transition to a new state s' and the value for being in state s' in the next time step, with a fewer bombs allocated, minus the total cost of dropping a bombs. This value is maximized over an action of dropping 0 to m bombs. Each maximized value V (si, t, m) will have an associated optimal action, a. Equation 3.6 is solved by beginning in the final time step, t = H. The values for Vi(s', H, m - a) are zero, since there is no expected reward for being in the final time step, regardless of allocation or state. Thus the values of V at t = H - 1 can be calculated, then used to calculate the values of V at H - 2, and so on, until t = 0. The solution is stored as value and action pairs indexed by allocation, time, and state. A sample data file for a two-state target with a horizon of 15 and an allocation of 10 is shown in figure 3-7. 3.2.4 Online-Resource Allocation and Simulation The resource allocation algorithm used in this research is a greedy algorithm. This algorithm begins by assigning all targets 0 bombs. Then for each target, it calculates the marginal reward for adding one bomb to the target's allocation. It does this by looking up the value in the offline results corresponding to that target's time index t, 46 STATE 0 TIME 0 ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION TIME 1 ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION TIME 15 ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION ALLOCATION STATE 1 TIME 0 ALLOCATION ALLOCATION 0.0, a = 0 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: V V V V V V V V V V = 0: 1: V V = 0.0, a = 0 = 18.999999999999996, a = 0 6: 7: 8: 9: 10: V V V V 0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: V V V V V V V V V 0: 1: V = 0.0, 18.999999999999996, a = 0 = 34.199999999999996, a = 0 = 46.359999999999999, a = 0 = 56.087999999999994, a = 0 = 63.870399999999989, a = 0 = 70.096319999999992, a = 0 = 75.077055999999999, a = 0 = 79.061644799999996, a = 0 = 82.249315839999994, a = 0 V = 84.799452672000001, a = 0 = 67.785599999999988, a = 6 = 72.028479999999988, a = 7 = 75.22278399999999, a = 8 = 77.578227199999986, a = 9 V = 77.578227199999986, a = 9 = 0.0, 0.0, = 0.0, = 0.0, = 0.0, = = = 0.0, = 0.0, = 0.0, a a a a a a a a a a = 0.0, V = 0.0, V = 0.0, a V = 0.0, = 0 = 0 = 0 = 0 = 0 = 0 = 0 = 0 = 0 = 0 = 0 a = 0 a = 0 Figure 3-7: A sample data file 47 state s, and m =1 allocation. It then subtracts the offline results value corresponding to that target's t, s, and m = 0. This is the marginal reward for changing a target's allocation from 0 bombs to 1 bomb. Once all targets have had their marginal rewards calculated, the greedy algorithm allocates a bomb to the target with the maximum marginal reward. It then calculates the marginal reward for adding another bomb to that target. If there is another bomb left, the target with the maximum marginal reward is allocated a bomb, and so on until either all bombs are allocated or the marginal reward for all bombs is 0. The marginal rewards for target i are calculated according to the following equation: A (Sim, t) = Vj(si, m + 1, t) - Vj(si, m, t) (3.7) Thus, given a state si, a time step t, and an allocation rn, the marginal reward equals the difference between the expected reward at the current allocation and the expected reward at the current allocation plus one bomb. The Vi(si, m, t) and VK(si, m + 1, t) values are retrieved from the data calculated in the offline phase. After this greedy algorithm is complete, each target will have a certain number of bombs allocated to it. In the data structures calculated in the offline phase, the actions are paired with a set of a time index, a state, and an allocation. Thus, given each target's t, s, and recently calculated m, the optimal action a is determined directly from the data structures and put into a vector of actions for the first time step. This vector is now passed to the simulator, and the sum of these bombing actions are subtracted from the remaining bombs left. The simulator takes these actions and, based on the probabilities calculated in the damage model, assigns a new state s' E S to each target i, then returns a vector of these new states. After this is done once, the resource allocation algorithm runs for the next time step, but this time the total available weapons counter is decreased by the sum of actions from the previous step and the target states are updated. This loop continues until the time step is equal to H. One problem lies in calculating the valid t for equation 3.7. Each target exists 48 in an individual window, but the application only knows tb and t, for each target, and the current time step, t. What value should be used to index the offline values? There are three cases. Target window type 1 is the simplest, corresponding to te < t, for the target. In this case, the target already "existed", and now has disappeared. This could be if it was a moving target and it has moved into and out of range. The marginal reward is zero for this target, as it will never be possible to damage it again. The greedy algorithm will not even consider these targets, since there is no benefit to allocating a bomb to them, and the actions associated with these targets will be to drop zero bombs. Target window type 2 is if tb t, < t,. In this case, the target exists, as the time step falls in the target's individual window. In this case, the t value used is equal to H - (te - tc). To understand this, it is important to remember that Vi is calculated from the final time step. t, - t, corresponds to the number of time steps left before the target's window closes. Thus, since the values in the data files were calculated for a target whose window was of length H, the proper value for t is the value for the same target type with the same number of time steps left in the window. Section 3.3.1 discusses the optimization implications of this implementation. Target window type 3 is when t, < tb. In this case, the target has not "come into view" yet. This could happen if a target is moving towards the attackers. It would be very bad to do the same thing as in case 1, since no bombs would get allocated to the target. For example, say this target has a reward of destruction of 1,000,000 and another target, which is of type 2, has a reward of 10. If there is only one bomb, using the naYve "type 3 = type 1" method, the greedy algorithm would allocate that bomb to the type 2 target. Then it may drop that bomb, and not have any to allocate to this target. To avoid this problem, it is noted that for a fixed s and m, as t decreases in equation 3.6, the values are nondecreasing. This corresponds to the idea that it is worth more to have the same allocation in the same state if there is more time left in the target's window. Thus the time index for a type 3 target is t = H - te, or the total horizon minus the end time. The action, however, for a type 3 target is always 49 Action dependent on MDP solution Action =0 Target considered Targetnsidred .for alUocation for allocation Action =0 T5rget NOT coidred for for allotin ..I Type 2 Type 3 Target 0 tb tH Figure 3-8: A timeline depicting the three types of target windows drop zero bombs, since the target "does not exist" in the current time step, and all bombs would by definition miss. Figure 3-8 shows the progression of the types of a target over a horizon. The target "exists" in the white part of the figure, when it is type 2. In this section, the target is considered for the allocation of bombs, and its action is dependent on the offline values. In the left grey area, the target does not exist yet, but will in the future. The target is still significant to the problem as type 3, as it still needs to be considered for weapon allocation. However, since the target does not exist yet, the action for a type 3 target will always be to drop 0 bombs. Conversely, a type 1 target in the right gray area no longer needs to be considered for weapon allocation. Again, since the target does not exist, the action will always be to drop 0 bombs. For the boundary cases, at tb and te, the target changes to the next class, as is shown in the figure. Determining the target indexes for the target types is simple. All target indexes begin at t = H - te, and every time step, the index is incremented by one. When t = H, the target window has passed, and the target is effectively removed from resource allocation consideration. 50 3.3 Implementation Optimization The first version of the totally observable resource allocation method took a great deal of time to calculate the offline values. Optimization did not seem to be a luxury, but rather a necessity to make the problem more tractable. It is possible to decrease the total amount of calculation by exploiting certain aspects of the MDP. 3.3.1 Reducing the Number of MDPs Calculated As mentioned in section 3.2.4, the value of two different targets of the same target type is the same if they have the same time index, meaning only one MDP calculation is required for each target type. At first, one MDP was calculated for each target. This took a great deal of time and computation. However, since the only thing that changes for calculation of 1V between these MDPs is the t in equation 3.6, the calculations were duplicated. However, the calculations need only be done once, for a maximum horizon, H. Once the maximum horizon MDP has been calculated, the online phase just needs to select t properly. Making this change reduced the number of MDPs from the number of targets to the number of target types. Though this does increase the size of the MDPs, in most realistic world scenarios, the increase in computation caused by the increased number of time steps is much less than the computation time to calculate an MDP for every target. It would be possible to only calculate the MDP for the maximum target horizon and then use those MDP values in the online phase. This actually optimizes the solution method for one scenario. But a large horizon can be selected for compatibility with future scenarios. If a horizon of 100 is calculated, then any problem with a maximum target horizon of 100 or less would have already been calculated. 3.3.2 Reducing the Computational Complexity of MDPs In the MDP calculation, for a fixed time and state at a certain allocation the values converge. As m increases, the values increase until they get to a maximum at an allocation of m*. As m increases, the values can never go down because there is no 51 cost for allocating another bomb, only for dropping it. Thus if it was determined that the marginal reward for dropping another bomb is less than the cost of a bomb, the action would be zero. This would be a marginal increase in reward of zero. At a certain point, the cost does exceed the marginal reward. At this point, Vi(si, t,m*) = Vi(si, t, m* + 1) = v. When this happens, no matter how many more bombs are allocated, the reward will never go up. Thus instead of continuing the calculations for n* > m > Al, it saves a large amount of computational time if v is copied into the data file for the appropriate indexes. 3.4 Implementation Flexibility This solution model is designed to be flexible, so that it can be extended to the partially observable case, and modular, so that the same code can be re-used. To this end, several components have been designed to be computed independently from the working model code. These components can be changed for different problems and different research. The damage model is fine for an extension to the "noisy-or" model, but other damage models may make more sense in different applications. It is easy to conjecture a scenario where two bombs that would individually make a state transition from undamaged to partially damaged, when considered together, might make the target transition to destroyed. Because of this, the damage model is one of the flexible modules. The greedy algorithm is not the optimal solution. We chose it because our goal is to keep almost everything the same, but change the problem from totally observable to partially observable. However, the resource allocation code is designed so that a different solution algorithm can easily be implemented. One benefit of having the online and offline parts completely separate is that the offline part can be done once, then the online part can be done over and over to get averaged experimental results. Another benefit to this model is that battle planners can do complete calculations for different target types before a mission, 52 creating a database, then use the appropriate values from this database in future missions. Since most of the calculation is done offline, the computational cost of the database calculation will be separated from the mission planning. Thus, with a fully instantiated database, a battle scenario needs only to be modelled as a new world file, and the online computational costs will be negligible compared to the database construction. This will allow for faster response to dynamic situations on the battlefield. 3.5 Experimental Comparison The purpose of creating a working implementation of the research in Meuleau et al. is to have something on which to base the partially observable approach. To this end, this section compares the results from the implementation created in this research with the results from the paper. All software implementation in this thesis was implemented in Java on a 266MHz computer with 192 MB of memory, running RedHat Linux 7.0. Meuleau et al.'s paper shows the offline results for an MDP for a single target problem. The target has two states: S = {undamaged, damaged}, an individual target horizon of 10, a single-bomb hit probability p of 0.25, a reward of 90 for destruction, and a bomb cost of 1. Figure 3-9 from their paper depicts the optimal number of weapons to send for each time step, given that the target is undamaged. The plot is monotonically increasing. This makes sense, since if the target will be around a long time, it is best to spread out the attacks. Given an extended period of time, the optimal policy would be to drop a bomb and determine if it damaged the target. If it did not, drop another and then determine if it damaged the target. Spacing out the attacks prevents the waste of bombs. However, once it gets closer to the end of the horizon, it becomes more important to make sure that the target is destroyed, so more bombs should be used. If at any time the target is damaged, the optimal action is, of course, to drop no bombs, since there will never be a positive reward for doing so. 53 Time Allocation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 5 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 6 7 8 0 0 0 0 0 0 0 0 1 0 1 1 1 1 2 1 1 2 1 2 2 1 2 3 1 2 3 2 2 3 2 2 4 2 2 4 2 2 4 2 3 4 2 3 5 2 3 5 2 3 5 9 0 1 2 3 4 5 6 7 8 9 10 11 11 11 11 11 11 Table 3.3: Optimal actions across time and allocation 54 12- 0 10 M......... 1 2 3 4 6 5 7 8 9 10 Time Figure 3-9: Meuleau et al.'s graph of optimal actions across time This problem was run using the implementation presented in this chapter, and table 3.3 shows the results. A major difference between the data from this trial and the data presented in the paper is that this experiment was run over several different allocations. The data is displayed graphically in figure 3-10. The graph shows that, regardless of the allocation, the policy says to to drop fewer bombs at first, then more bombs as the target gets closer to the end of its horizon. This coincides with both common sense and with the data from Meuleau et al.'s paper. The other result obtained is that the optimal policy at time step 15 and above is the same as that in the paper. The graph shows that the time dependent policy converges to the infinite resource optimal policy at this time step. In addition, it is clear that 11 bombs is the maximum action that will be taken, regardless of how many bombs are available to drop. At this point, according to the damage model the marginal reward for dropping one more bomb is less than the cost 55 12- 10 00 6 9 0 25 36 2 Step 10 15 Time Allocation 5 0 Figure 3-10: Optimal actions across time and allocation 1, for dropping 11 bombs is: of that bomb. The expected marginal reward Er Eril= Eri En1l R(undamaged, damaged) * (1 - qa) = 90 * (1 - 0.7511) _ I 11 86.1988, and by the same set of equations, the marginal reward Erl, is 87.1491. The difference between these two values is .9503, which is less than the cost of 1 that an extra bomb would incur. Based on this trial and thorough checking, the completely observable implementation as described in this chapter works and produces expected results. 56 Chapter 4 Dynamic Partially Observable Implementation The previous chapter defined and discussed an approach towards, and implementation of, a completely observable resource allocation problem. Though this approach works well in ideal situations with perfect sensors, it is not very realistic. This chapter will discuss the extensions to the previously defined model, describe the implementation, and analyze the approach in relation to the completely observable case as well as in other interesting cases. The problem that this chapter addresses is again one in which there are several targets to bomb and each is damaged independently. Similarly to the previous chapter, the problem could be modelled as a POMDP with an extremely large state space. However, even MDPs grow too fast to solve a single one that describes the entire problem, and POMDPs grow much faster than MDPs. Thus, using a single POMDP to solve the entire problem is unrealistic. Therefore, in this model, an individual POMDP for each target is computed offline and the solutions are integrated online. The online process is to make the overall allocation of weapons to targets and send out a round of weapons, much like the previous chapter. But now it is necessary to get observations of which targets were destroyed, instead of knowing exactly which ones were. The beliefs of the states of these targets are recomputed and weapons are reallocated. Then either a round of weapons or an observation mission are sent out 57 for each target, and this online process repeats until the mission is over. 4.1 Additions to the Completely Observable Model The completely observable model used MDPs, which did not address incomplete state information. To make a change to a partially observable problem scenario, we must shift to using POIDPs. There are other required changes associated with this shift, including belief states, observation models, and a state estimator. 4.1.1 POMDPs Equation 3.6 for calculating expected values no longer applies. In the completely observable case, the state space is discrete, so it is feasible to calculate all possible values based on all previous states weighted by their relative probabilities. However, now the state space is continuous. This continuous space causes the problems to grow much faster than in the completely observable case. Therefore, enumerating all possible values becomes intractable for anything other than very small problems. This can easily be understood by considering a completely observable example with a simple two-state target which is either alive or dead. It is assumed that a bomb has a hit probability of 0.5 and a cost of 1, and for simplicity A = {drop 1, drop 0}. The reward for destroying the target is 10. Calculating the values for this trivial problem is simple. First, the immediate rewards can be calculated. For dropping one bomb, the immediate reward for s' = dead is the reward multiplied by the probability of destruction, minus the cost for an action: (.5 * 10) - 1 = 4. The immediate reward for s' = alive is -1. For dropping zero bombs, this value is zero. These immediate rewards apply for all steps. In the final time step, H, all rewards are zero. It is simple to use the value iteration equation 3.6 to calculate all possible values for each s and s' in t = H - 1, then use these results to calculate all possible values for t = H - 2, and so on. However, once the state space becomes continuous it is no longer simple to enumerate all possibilities. What if dropping a bomb hits a target with probability 0.5, 58 but there is only a 30% chance of recognizing that the target is destroyed? The value iteration equation does not deal with this extra information. The solution is to use a POMDP formulation for the problem, for which several algorithms exist which can feasibly solve the partially observable problem. 4.1.2 Strike Actions vs. Sensor Actions In the MDP case, all actions affected the actual state of the model. Each action had a transition model (also called the damage model) based on the initial transition matrix T. These actions are called strike actions. In the completely observable case, strike actions had an implicit observation model 0 which produced an actual state observation z = s with probability 1. However, in the partially observable model, all strike actions are defined to produce no information about the target. Therefore, a new class of actions, sensor actions, must be introduced. These actions are defined to have no effect on the state of a target, but instead they will return information of the state of the target. In a POMDP in general, an action will have an effect on the state of a target and produce an observation of the state, but for simplicity in this model, this will not be the case. In this model, every action is either a strike action or a sensor action. 4.1.3 Belief State and State Estimator Now that the shift has been made from complete to partial observability, the output from the simulator is no longer a definitive state. Instead, it outputs an observation based on the action taken and the actual state of the target. Therefore, at every time step, for every target, it is necessary to calculate a new belief state b. Basic probabilistic manipulation yields the equations necessary for this update, which will be covered in section 4.3.6. At every time step, a new belief state must be created for each target, dependent on previous belief state, the action taken, and the observation received, shown in figure 4-1. The observation refers to the observation z received from the simulator 59 Observation (z) Acton a) State State(b Estimator P Figure 4-1: The general state estimator State ---_ -- Estimator --- b (strike action) Figure 4-2: The state estimator for a strike action after the action performed, a. The action transition model has an a priori probability that the target will change state, and the state estimator combines this information with the observation to update the prior belief state. The state estimator is split into two cases based on the above assumption that strike actions and sensor actions are mutually exclusive. For strike actions, the new belief state only depends on the previous belief state and the action taken, as seen in figure 4-2. Since a strike action yields no information from the simulator, the observation are from figure 4-1 can be eliminated. Conversely, for sensor actions, the new belief state only depends on the previous belief state and the observation received, as seen in figure 4-3. This is because the sensor actions never cause a target to change state, so the action arc can be eliminated from figure 4-1. These two cases are proved in section 4.3.6. z State b Estimator ---(sensor action) Figure 4-3: The state estimator for a sensor action 60 4.2 The Partially Observable Approach One way to solve the dynamic resource allocation problem would be to combine Meuleau et al.'s Markov Task Decomposition with Castanon and Yost's use of POMDPs. To do this, we will decouple each target, as in MTD, solving a POMDP for each target type. The results from these POMDPs will be used to determine an allocation and then optimal action for each target. Then these actions will be executed, and the observations from each action will be used for future resource allocation. However, resource constraints must be applied to recouple the targets. Yost uses an LP solver to handle resource constraints, and Meuleau et al. use the value iteration equation, which takes into account that performing action a reduces the total resources by a bombs. We used code developed by Cassandra [1] to solve POMDPs. The code takes in a POMDP definition input file and produces H output files of parsimonious vector sets, as described in appendix A. However, this code does not explicitly handle resource constraints. 4.2.1 Resource Constraint Problem The problem with this "simple combination" approach is that the POMDP solver code was not designed to deal with consumable resources. The only thing that prevents the policy from taking the most expensive actions is the cost alone, which means that every action is considered to be viable at every time step. This can lead to resource constraint violations in the resource allocation problem presented in this thesis. For example, if there are 10 bombs total for a mission with one target with a time horizon of 3, Cassandra's POMDP algorithm could say that the best action at every time step is to drop 10 bombs. This is likely if the cost for dropping bombs is low and the reward for destruction is high. However, this would mean that the policy says to drop 30 bombs total, which is a violation of the 10 bomb global resource constraint. The solution to this problem is to increase the state space of the POMDP to describe not only the actual state of the target, but also the number of bombs allocated to the target. With careful selection of transition and reward matrices, the 61 POMDP solver code will actually consider the allocation as part of the problem, and deterministically shift the allocation space while probabilistically calculating the real state space. This increases the size of the state space of the problem from IS to ISI x (Al + 1). That is, the number of states is now the state set crossed with the total number of resources allocated, including the zero allocation. For example, for a target with two states, alive and dead, and two total bombs allocated to it, the associated POMDP would have six states, S = {2alive, 2dead, lalive, 1dead, Qalive, Odead}. There is a tradeoff associated with this approach, and that is the size (number of coefficients), of the vectors in the POMDP. The maximum number of vectors, F in time step t is [9]: JFtJ = JAJJF*_111z.(41 Of course, many vectors in this set will be pruned out, resulting in a parsimonious set that is most likely much smaller than this worst-case. Equation 4.1 shows that the number of vectors grow linearly with JAl and exponentially with jZ, but not at all with ISI. This is not to say that increasing the size of the vectors does not increase the number of calculations required to solve a POMDP, as it increases the dimensional space of the vectors, but the algorithms to solve POMDPs work to prune vectors. Thus if this approach does not add to the number of vectors, it has a minimal impact in solving for a parsimonious set, and thus does not affect the asymptotic running time of the incremental pruning solution algorithm. The transition probabilities must be updated as well. Whereas before, it was easy to list the hit probability of dropping one bomb in a two-state POMDP as T(alive, drop 1, dead) = 0.3, for example, it becomes more complicated now. Now, for every action from dropping zero to dropping M bombs, a transition probability must be declared which is T(m alive, drop a, (m-a) dead) = 0.3, T(m alive, drop a, (m-a) alive) = 0.7, and so on. The transition model expansion is shown in figure 4-4. The white set of proba62 Sr alive dead Sr S alive dead 1 0l alive dead alive S [.LL dead a=0 alive I,7 I.3 Sf alive I d I dead alive alive S - S- Lii dead dead a=2 a1 a=. Sr S Oalivi Odead 00440ME 0 0 0 0 0 1m- Figure 4-4: Expanded transition model for MA~ - 3 63 bilities represents dropping 0 bombs, the light gray set represents dropping 1 bomb, the medium gray set represents dropping 2 bombs, and the dark gray set represents dropping and so on. When an action is executed, all transition probabilities that are not in that action's set are set to zero. Consider the target starting in state Salive, and action drop 2 is performed. The transition probabilities from state 3alive to all states but the states defined in the medium gray box (1alive, 1dead are set to zero. Thus, T(Salive, drop 2, lalive) - 0.4, and T(Salive, drop 2, idead) = 0.6. This increases the number of declarations necessary in the input file, but the actual state transition probabilities have not changed. 4.2.2 Impossible Action Problem Treating weapon allocation as part of a target's state creates another problem. Since every action is still allowable in every state, what happens if the state dictates that there are 5 bombs available for dropping, but the action to be considered is drop 6 or more bombs? There are a couple of ways of addressing these impossible actions. The first is to transition to the same state with no reward. The problem with this is that a POMDP solves for the future by taking the immediate reward for that action given that state and adding it to the probabilistic future rewards for that target. However, it is likely that the future reward from that particular state is nonzero, assuming that the cost of dropping bombs does not always exceed the probabilistic reward for destroying a target. This means that the impossible action could be considered in a policy, which produces undesired results. At best, it adds possible vectors to be considered at each step, which grows the problem exponentially. At worst, it causes the policy to list an impossible action as the optimal choice. Another way of addressing the problem is to transition to the zero allocation, completely undamaged state, which if defined properly should have zero future reward. However, there are times that a transition to this state is valid, such as dropping as many bombs as allocated and missing with all of them. This can lead to confusion, and is not a very consistent model. To solve this problem, a new limbo state is introduced. Any time an impossible 64 Sf SF Sf alive dead alive dead alive dead Sf alive dead alive li dead o alive d dead .7 3I alive live dead dead a=1 a=O 2a3 S 3alive|3dead2alive|2dead Jlive|1dead laive Odead l imbo 3alive 11 Mdead 2aiive Fdead . e n 3 m M 3 = i S lalive . 1 ...... 1 Mdead Oalive 1 |0 Odead 0 |1 limbo Figure 4-5: Expanded transition model for M = 3 with the limbo state 65 action is performed, the target transitions to this state, as can be seen in figure 45. When an impossible action is attempted, the transition model only considers the black transition set. The limbo state is an absorbing state, as any action taken while in this state transitions the target back to limbo. An impossible action is defined to have a cost of ca, where c is the cost per bomb and a is the number of bombs dropped. All rewards from this state are 0, and all observations from this state are nothing (discussed in the next section). Thus the POMDP will eliminate the vector associated with these actions in each epoch, since the immediate reward is negative, the future reward will be zero, and the belief state does not change. Not only does this solution save computation, but it also maintains resource constraints. 4.2.3 Sensor Actions In the MDP case, the state transition matrix for a strike action was nontrivial, whereas the implicit observation matrix for a strike action was the identity matrix. Thus, after every strike action, the state was determined, since the appropriate observation for the actual state resulted. However, this is unrealistic. In real-world scenarios, a pilot may see nothing but smoke from a dropped bomb. This research assume that a strike action provides no observation. This can either be modelled as a nothing observation, or as a uniform distribution across all observations, as the change in the belief state is independent of the observation in both cases. The mathematical reasoning behind this statement will be discussed in section 4.3.6. However, there must be a way to receive information about a target, given that none is gleaned from strike actions. Perhaps the commander of a military unit may send a UAV or quick manned aircraft with powerful sensors to do reconnaissance. To model this, the new observation class of actions must be created. These are defined to have a nontrivial observation probability matrix, where an observation obtained is based on the new state of the target and the action taken, whereas the transition matrix is the identity matrix. Thus, any observation action does not change the state of the target, only the target's belief state. 66 Sensor actions have a cost associated with the action, but there will never be an immediate reward, since a reward is gained on a transition, and the sensor class actions have an identity transition matrix. Conversely, strike actions not only have a cost associated with the action, but may have an immediate reward if the transition model dictates. Conversely, the strike actions are based on the cost of a bomb on a target multiplied by the number of bombs dropped but may have an immediate reward if the transition so dictates. 4.2.4 Belief States For the simple two-state "alive-dead" POMDP, the belief state is simply [b(alive) b(dead)], but now it must be expanded to include allocation. However, it is important to remember that the allocation is deterministic. Thus the nontrivial belief state has not changed in size, only the actual size of the belief vector. The belief state will be filled in on either side by anywhere from 0 to IVI "bins" of zero-belief probabilities. Each bin corresponds to a distribution across states for a given allocation. The convention used in this thesis is that the leftmost bin in the belief state is the maximum allocation, and the rightmost bin is the zero allocation. A generic belief state would be: [ <bin M> <bin M-1> . . . <bin 1> <bin 0> I To better understand this representation, consider a simple two-state alive-dead POMDP with a maximum allocation of six bombs. Initially, assuming the target is known to be alive, the belief state is: [ <1 0> <0 0> <0 0> <0 0> <0 0> <0 0> <0 0> ] The < and > are added for ease of understanding, but do not actually exist in the representation of the belief state. Assume the first action is to drop one bomb. Let Paj be the probability that the target is alive at time step t and Pd, be the probability that the target is dead. At time step 1, the belief space is: 67 [ <0 0> <Pal Pdl> <0 0> <0 0> <0 0> <0 0> <0 0> ] Assuming the second action is to drop two bombs, the updated belief space is shifted two bins to the right, corresponding to having three bombs left: [ <0 0> <0 0> <0 0> <Pa 2 Pd2 > <0 0> <0 0> <0 0> ] Finally, assume the next action is to drop three bombs. This shifts the nonzero bin three to the right, putting it in the zero-bomb bin: [ <0 0> <0 0> <0 0> <0 0> <0 0> <0 0> <Pa Pd3> ] It is clear that the values within the bins change probabilistically, but the movement across bins moves deterministically. This is important, because it means the problem has not really changed from a simple two state POMDP. The extra states that have been added simply serve to provide information to the implementation so that resource constraints will be honored. 4.3 Implementation So far we have presented a high-level description of the partially observable approach. This section describes the approach in more detail, discussing the architecture, modelling, and offline/online computation. Mathematical detail is provided where applicable. 4.3.1 Architecture The architecture for the partially observable approach, shown in figure 4-6, is similar to that of the completely observable approach, but a few additions are necessary. The target and world input files now include new information required for partially observable models. The offline data structure is now a directory structure in which every POMDP solved, uniquely identified by its target type, time horizon, and allocation, has its own .alpha file (Cassandra's vector output file) for each epoch. 68 Corresp( .i... ..... b, t .alpha File ............ ... ........... Data Data World Corresp I Files valu .. b.. .. . .-. Offline Dynamic Programming Observation lii Target File World File Figure 4-6: The partially observable implementation architecture 69 The online phase now requires a state estimator to determine a belief state from the observations received from the simulator. The estimator combines an observation, action, and a belief state from the previous time step, and yields a new belief state. This estimated state is passed to the resource allocation module, which uses it to get the optimal allocation and actions for each target from the data files. The action for each target is then sent to the simulator. The resource allocation module now has to calculate the optimal allocations and actions using belief states instead of merely looking them up in a preprocessed data file, as in the completely observable approach. The simulator takes in a vector of actions and now outputs an observation vector instead of a state vector. 4.3.2 Modelling The target modelling files for the MDP are augmented to address a POMDP implementation. A sample target file for this approach is shown in figure 4-7. Immediately after the %states section, a new observations section has been added. This section begins with %observations, and every subsequent line contains the name of an observation. These names make up the observation set Z. The nothing observation is not included in the target file, but is hard-coded, since that observation is common to all targets. The next section, beginning with the %obsactions separator, is used to describe the names and the observation probabilities of the sensor actions. The first line after the separator is an integer, ao, representing the number of sensor actions. After that are ao sets of three elements: sensor name, cost, and observation matrix. The first element, the name, is added to the set A. After the initialization is done, A will contain M + 1 + ao actions. The next element is the cost, c0 i of the sensor action i. Since transitions do not happen with sensor actions, the expected immediate reward for this particular sensor action is -co with probability 1. The final element, the observation matrix Oi, is an S x Z matrix, where the rows represent states a target has just transitioned to, and the columns represent observations. As an example, the probability of this target observing observation number 3 given it is 70 %states Undamaged Damaged %observations Undamaged-obs Damaged-obs %obsactions 1 look 1 .8 .2 .2 .8 %cost 4 %rewards 0 20 0 0 %transProbs .5 .5 0 1 %end Figure 4-7: A sample partially observable target file 71 in state 1 and taken observation action 2 is 02(1, 3). The remaining elements from the target file are loaded as before. The POMDP requires a new set of observation probabilities and observation actions, and the target file modelling now provides these. 4.3.3 Offline-POMDP Calculations The partially observable resource allocation code is in Java, and the POMDP solver code is in C. They communicate through a text file. The use of text files to model the world allows online and offline phases to be run separately, to get a series of online results with one offline computation. The offline phase creates for each target type an input file for the POMDP solver. From the target type file, S, A, and Z are extracted and placed in the header part of the file, adding a limbo state. Then for each element in T, an appropriate transition entry for all non-impossible actions is created, while impossible actions are set to transition to the limbo state. This process is repeated for the observation matrices. A reward entry for every action is created, dependent on expected reward for a transition minus the number of bombs dropped. The impossible actions are listed as well, as these actions will have a negative expected value and will be pruned out in the POMDP solution. Something interesting to note is that even though the limbo state has been added to the state space, its value in the solution file is actually never used online. Offline, the limbo state forces the POMDP solver to prune out vectors which transition to it. Since it is always zero by definition, it can be ignored in the online resource allocation section. Once a file is created for each target type, the POMDP solver is run. The command line execution is: pomdp-solve-4.0/pomdp-solve -save-all -horizon <H> -epsilon 1E-9 -o POMDPSols/<target type>_allocation<M>_horizon<H>-E-9/solution -P <target type>.POMDP 72 This command runs the POMDP solver code, saving all epochs, with horizon H, and epsilon 10- 9 . The input file has been saved previously as <target type>. POMDP. It is not necessary to specify incremental pruning as the POMDP solution algorithm, since Cassandra's code uses this algorithm as a default. The .alpha files will be saved into the directory POMDPSols, subdirectory listed by its allocation, horizon, and epsilon, and will have a file prefix of solution. These text files are for future use by the online phase. 4.3.4 Online-Resource Allocation Once the offline phase has been completed, the online phase begins. The online execution loop begins with the greedy algorithm, which calculates optimal allocations for all targets. Then the policy mapper determines the optimal actions for all targets based on their recently computed allocations. These actions affect the targets in the world. Finally a state estimator takes the observations from the world, updates each target's belief state, then passes these belief states back to the greedy algorithm for the next time step. Each target's belief state is initialized to the first actual state, allocation zero, with probability one. Thus, all bins but the zero-allocation bin are filled with ISI zeros, and the zero-allocation bin has a 1 in b(so) and zeros for all other b(s). In addition, a resources remaining counter, w, is initialized to M. Subsequent iterations of the resource allocation phase will use an updated counter and also an updated belief state for each target from the state estimator, so this is the only time the nonzero portion of the belief state will be set to [ 1 0 ... 0 1. For bombs 0 through w, the marginal reward for allocating that bomb is calculated for each target. To do this, it is first necessary to determine the time index of the target, t, which is the number of remaining time steps that the target will "exist". This is calculated by subtracting the current time step t, from the target's end time, te. If t ; 0, then the target's window has closed, and no bombs will be allocated to this target. If it is strictly positive, then the marginal reward can be extracted from the offline values. 73 For each target, i, determining the marginal reward for allocating m bombs, rT is done by iteration across all vectors in the .alpha<ti> file. We define b to be a belief state for target i in which the non-trivial allocation bin is m. The initial belief state is b'. On the initialization step, this is zero. This belief state is copied as belief state bm+, which is the same belief state with an extra bomb allocated. Now, for each vector -y, the maximum expected value of having m bombs allocated is calculated by taking the dot product of the b' belief state and selecting the maximum value. Similarly, the maximum expected value of having m +1 bombs is calculated by taking the maximum of the dot products of every vector and b'+'. The marginal reward for allocating a bomb to this target is the difference between these two values: rM = max[yb+ E]F* - max[yb. -yEF* bm. The calculation of the time index and marginal reward will be performed once for each target. The resource allocation algorithm will then select the maximum of these values and assign one bomb to that target. In this case, the nonzero portion of the belief state for this target shifts to the next higher bin. For example, the nonzero portion of the belief state for a target that is allocated its first bomb will move from the zero-allocation bin to the one-allocation bin. Also, assuming there are more bombs to be allocated, the new marginal reward for one bomb will be calculated for this target with the new belief state and can be compared to all the other targets' marginal rewards, which have not changed. The resource allocation algorithm then repeats these steps until all bombs are allocated, or until all marginal rewards are zero. After the resource allocation phase, the next step is to determine the appropriate actions corresponding to these allocations. This comes from the same .alpha<ti> file for every target as before. Since the greedy algorithm has shifted every target's nonzero belief state to the appropriate allocation bin, the optimal action, aj, is the action corresponding to the maximum value of the dot product of each vector, -y, and 74 the belief state bi. The index of the vector with the maximum dot product is: max = argmax[. b], then a, is set to the action associated with the -)7ax vector in the . alpha<ti> file. A vector of target actions, A, is created. Finally, the resource counter is decreased by the sum of the bombs dropped in the strike actions. 4.3.5 Online-Simulator The simulator is run using the action vector. For every target, the simulator updates the state of the object using the transition probabilities for that target's state and action T(s, a, s') for all s'. Once the states have been updated, the simulator now returns a vector of observations, Z. The observation returned for a given target is dependent on the action class. For a sensor action, a,, the observation returned will depend on the observation probabilities for that target's new state and action: O(s', a,, z). For a strike action, however, a uniform distribution across Z, excluding the nothing observation, is returned. This distribution corresponds to the nothing observation in the POIDP solver, as it does not provide any information as to the state of the object. This statement will be proved in the next section. 4.3.6 Online-State Estimator The next step in the online phase is to update the belief states. For ease of understanding, this section will focus on the nonzero bin of the original belief state. For example, if a target is allocated two bombs, the 2-allocation bin will be nonzero, and this section will discuss the changes in that bin. The state estimator only cares about the nonzero bin because it uses probabilistic rules to change the belief that the target is in a particular physical state, whereas after the action is taken, the nonzero probabilities will shift to the zero-allocation bin with probability 1. The belief state b is made up of ISI probabilities [b(sl)b(s 2 ) ... b(slsl)]. Given the above definition of the belief state, the updated belief state for a target, b , given an action a and an observation z is provided by Bayes' Theorem: , where O(a, s",z) s', z) Es T(s, a, s')b(s) s~"O(a, s", z)T(s, a, s")b(s)' aO(a, Pr(z Is", a). Intuitively, this makes sense. The probability that the target is in a given state s' is a function of the sum of the previous belief that the target was in a state and transitioned to s' given a, multiplied by the probability of making the observation z given s' and a, and normalized by all possibilities from s to s'. The previous section stated that a nothing observation is equivalent to a uniform distribution of all other observations. This statement can now be proved. A nothing observation occurs with probability 1 for all strike actions, regardless of state. Therefore, for a strike action, a, equation 4.2 is reduced to: T(s, a, s')b(s) ba=strike z=nothing( ) (1)T(s, a, s")b(s)' or ba=strike z=nothing (s ) = ss, ~ s')b(s) T (s, a, ,a "b (4.3) This is for an observation of nothing. Since no other observations can occur, the belief state for other observations need not be calculated. If every strike action produces the same observation, the updated belief state will only depend on the previous belief state and the transition probabilities. But the claim was that a new nothing observation is equivalent to setting the probability across other observations to be uniform. Let p. = 1/ Z'j, where Z' does not have the nothing observation. Now equation 4.2 is reduced to: bastrike ) pz Z, T(s, a, s')b(s) b(,s" p-T(s, a,, s")b(s) 76 ba=strike bz-unif orm (S ) ps E, T(s, a, s')b(s) Pz Es,s" T(s, a, s")b(s)' , T(s, a, s')b(s) T(s, a, s")b(s)' / ba=strike bZ=Uif orm (S) which is the same as equation 4.3. We will now show discuss sensor action belief state updates. If a sensor action is taken, no state transition occurs, which can be modelled as an identity matrix. Thus, T(s, a0 , s') = 1 if s = S' 0 otherwise. Therefore, T(s,ao, s')b(s) Z T(s, ao, s')b(s) (l)b(s = s') + (0)b(Vs # s') b(s'), and equation 4.2 can be simplified to: S(a, s', )b(s') b a=observation(S8 Similarly to the strike action case, the sensor actions are dependent only on the previous belief state and the observation probabilities for a given observation and action across all states. After every simulation step, the state estimator updates the belief states for every target and passes these belief states to the resource allocation module. This loop continues until the final time step is reached, at which point the application returns the actual state of each target. 4.4 Implementation Optimization Because the computational time of solving POMDPs can grow exponentially, the offline phase can be extremely computationally intensive. Equation 4.1 shows how 77 the number of vectors in a POMDP grows with time. Since solving each epoch of an incremental pruning POMDP will often take more and more time as the number of vectors increases, optimizing the offline phase becomes a necessity. In addition, as the problem grows, the data files grow as well, so it is also useful to find ways to the time to read data in the online phase. 4.4.1 Removing the Nothing Observation Section 4.3.6 discussed how the nothing observation is equivalent to a uniform distribution across all other observations. In addition, section 4.2.1 showed that the growth of the vector set for every epoch in a POMDP solution grows exponentially in the size of the observation set Z. Thus, it is clear that the number of vectors in a POIDP solution epoch can be reduced by removing the nothing observation. The vectors created by the nothing observation do get pruned out fairly quickly, but because the Incremental Pruning algorithm is used in this implementation, removing the observation does serve to eliminate an entire subset of [I sets across all a E A. 4.4.2 Calculating Maximum Action Remembering again that the number of vectors that get checked in a POMDP solution time step grows linearly with JAl, it would be helpful to reduce this number. Intuitively, there is no benefit to drop additional bombs beyond some number, assuming all bombs have a positive cost, because the marginal reward for dropping a bomb is strictly decreasing. If the POMDP is limited to this maximum action ama, it would save computational time, while producing the same result. To calculate this maximum strike action, some definitions are necessary. Let E,(s, a) be the expected reward (not including cost) for dropping a bombs if the target is in state s. This is defined as: E,(s, a) R(s, s')T(s, a, s'). = s S 78 To get the marginal reward, ZEr(S, a + 1) for dropping an additional bomb, E'(s, a) should be subtracted from E,(s, a + 1) as follows: AE,(s, a + 1) AE,(s, a + 1) Er(s, a + 1) - Er(s, a), R(s, s')T(s, a + 1, s') - 1 R(s, s')T(s, a, s), = s'CS s'eS AE,(s, a + 1) R(s, s') [T(s, a,+ 1, s') - T(s, a, s') = s'ES This is the marginal reward for being in a given state and dropping a + 1 bombs. However, this optimization requires knowing the maximum action, so the next step is to determine the maximum marginal reward possible for an action, AE,(a + 1) by taking the max over all states s E S: AE,(a+1) = maxAE,(s, sS A E,(a+1) = max( + 1), R(s, s')[T(s, a + 1, s') - T(s, a,s')]). The final step is to iterate from a = 0 to a = M, calculating AE,(a + 1) and comparing it to the cost of a bomb. When the cost of a bomb exceeds AE,(a + 1), amax-= a. It will never be worth it to drop more than amax bombs, since the marginal reward minus the cost of a bomb would be negative. As in the previous section, the vectors created by all actions that drop more than anax bombs would be pruned out in each time step of the solution algorithm, but this optimization serves to prevent that calculation from occurring in the first place. 4.4.3 Defining Maximum Allocation As soon as a target has been allocated amax bombs the marginal reward for adding another bomb will be zero and the greedy algorithm will allocate no more weapons. Marginal reward can never be less than zero, since the drop zero weapons action has a zero reward and is always possible. The max-action calculation has determined at which point the cost for a bomb outweighs the marginal reward for dropping another 79 bomb. Thus, a target will never be allocated more bombs than ama, the number calculated in the max-action optimization. Therefore, a POMDP never needs to be calculatedfor more allocation bins than ama Limiting the POMDP to the max2 +1. imum allocation, mmax, serves to reduce the number of states, which will in turn reduce the computation time for each vector. This may not work for different allocation algorithms, but greatly reduces the size of the problem for a greedy algorithm. 4.4.4 One Target Type, One POMDP Different targets of the same type differ in two respects: the number of time steps in the targets' windows and the number of bombs allocated to the target. Every other aspect of the POMDP created with the target type is the same. The previous section discussed how to limit the state size in a POMDP, by never increasing the allocation for a POMDP beyond the maximum, nmax. If two targets have a different number of time steps in their windows, one target will have a larger window. Let h, be the horizon for the target with the longer window, and h, be the horizon for the target with the shorter one. Assuming they both have the same allocation, mnmax, the . alphal through . alpha<h,> files will be exactly the same. Thus, if the target with the longer window is computed, it is no longer necessary to compute the target with the shorter window. Therefore, given this information, it is only necessary to solve one POMDP per target type. If the allocation is set to mmax and the horizon H is set to some large number, the solution data files will work for all targets of that target type up to an individual window of length H. 4.4.5 Maximum Target Type Horizon The online phase converts the POMDP solution into appropriate marginal values in a data structure to speed up online computation. It is possible to do this completely offline and create a text file that will be read directly into a data structure. However, due to time constraints, we were not able to implement that optimization. Instead, the 80 vectors are converted into a data structure entirely in the online phase. This involves iterating through each vector in each file, determining marginal rewards for adding a bomb to an allocation, then determining an optimal action given an allocation and storing these in the appropriate place. This can be quite computationally intensive, given that there could be a large number of vectors, and a large number of time steps. To minimize the time required to load these files into a data structure, at every step, as long as the target window is not of type 1, the online phase uses the .alpha<te - tc> file. Thus, only the . alphal through . alpha<te> files need to be loaded for a given target type. This will eliminate H - maxt, type data structures from having to be loaded. Also, given that more vectors are generally required as the number of epochs increases, eliminating the necessity to load future epochs into memory saves computational time and memory space. 4.5 Experimental Results This section describes the experiments for the partially observable model. This model produces the same results as the one in the previous chapter when a completely observable target is used. This section also presents Monte Carlo simulation results for the new partially observable model. 4.5.1 Completely Observable Experiment The first experiment is to compare results from the partially observable approach with those from the completely observable case. A target for the partially observable model is given the same parameters as that in the experimental results section in the previous chapter. There, the target has two states, undamaged and damaged, a bomb costs 1, the reward for destruction is 90, and the damage probability p is 0.25. To model this case as a POMDP, we augment the state according to our transition model, 81 2.5 2 & 1.5 0 0 0.5 1 2 3 4 5 6 7 8 9 10 Time Figure 4-8: Optimal policy for M = 11 as previously described, but change the observation model. To make it completely observable, all strike actions for this experiment are set to produce the actual state observation with probability 1. Whether sensor actions are included or not, identical parsimonious sets are produced. This is because in a completely observable model, there is no extra information to be gained with sensor looks. Since they have a cost and take time to perform, they are pruned from the optimal policy. In our standard partially observable approach, updating the belief state for a target with an action of one bomb usually requires dropping a bomb in one time step then performing a sensor action to determine whether the bomb was successful in the next. This is called a "look-shoot-look" policy and given unlimited time and resources, it is the optimal policy for a target. In this experiment, however, every strike action is both a "look" and a "shoot" action, taking only one time step to perform. Since the parameters of the target observation model is implicitly the same as 82 those of the targets in the completely observable approach, the results are identical. This can be seen in figure 4-8, which shows the policy for an allocation of 11 bombs. The total allocation Al is 11, since dropping 11 bombs is the maximum action for this target, as defined in its damage model. Figure 3-10 shows the optimal action for a 10 step problem from allocation 0 to 25. The policy in figure 4-8 can be extracted from this graph, by selecting an action at each time step corresponding to the bombs remaining. Since the policy for any allocation in the completely observable experiment matches the results from the previous chapter's MDP model, our approach works for the completely observable case. 4.5.2 Monte Carlo Simulations This section compares the results for four different problem scenarios: complete observability, no sensor action, a realistic sensor action, and a perfect sensor action. All scenarios use a new two-state undamaged/damaged target with a bomb cost of 1, a sensor action look with a cost of 0.1, a reward for destruction of 10, and a damage probability p of 0.5. The total mission horizon H is 7. A scoring metric is defined for the online results, such that destruction of a target increases the score by 10, dropping a bomb decreases the score by 1, and looking decreases the score by 0.1. Five total experiments are run: " Experiment 1-No Sensor Scenario: This experiment performs 1000 trials of a single target without a sensor action. For the first four experiments, the total allocation .I is 25 bombs, to ensure that the number of bombs does not limit the policy. * Experiment 2-Complete Observability Scenario This experiment performs 1000 trials of a single target with complete observability. This is the scoring results from the experiment performed in the last section. " Experiment 3-Perfect Sensor Scenario This experiment performs 1000 trials of a single target. The sensor action associated with this target has perfect accuracy. 83 900800 700 600S500 U. *400-- 300 200 100 0 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 Score Figure 4-9: Score histogram of a single POMDP target with no sensor action " Experiment 4-Realistic Sensor Scenario This experiment performs 1000 trials of a single target with a realistic sensor action. This experiment most closely models a real world scenario. " Experiment 5-Scoring Analysis This experiment consists of four separate sets of trials. Each set of 1000 trials is a world with 100 targets from one of the four scenarios above. The total allocation M in these trials is 250, so that a shortage of bombs is a possibility. The experiment compares the relative results from the four sets of trials. The first experiment is for a single target with no sensor action. Policies with no sensor actions can arise when the cost for a look action is expensive relative to the reward. Figure 4-9 shows the score for 1000 trials of a single target scenario. As expected, this result is bimodal, and indicates that the policy is to drop a total of three bombs. When the target is destroyed the score is 7, and when it isn't destroyed the score is -3. The policy is to drop three bombs because this is the maximum action as determined by the damage model. Since there are no observations, the belief state is only affected by the a priori probability probabilities defined in the transition model, 84 450 400 350f %300 i 250 - d200 150 100 50 0 -. -- - - - - - ----- -0 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 Score Figure 4-10: Score histogram of a single completely observable target so after three bombs are dropped no further actions will be performed. An interesting result observed in the no observation case is that many different policies exist with the same value. The overall policy says to drop a total of three bombs, but the time at which each one is delivered is irrelevant. Dropping three in the last time step, H is equivalent to dropping two in H and one in H - 1, which is equivalent to dropping one in H, one in H - 1, and one in H - 2, and so on. This unfortunately has the effect of increasing the size of the parsimonious set, since none of these vectors can be pruned. Even given this increase in vectors, the offline time for this experiment was on the order of one second. The next experiment is for a single, completely observable target. In this case the POMDP solver knows the state the target is in after every time step, and no sensor actions are necessary. The histogram for 1000 trials of this scenario in figure 4- 10 shows seven values. The first one on the right is a successful destruction after dropping one bomb, the next one is for dropping two bombs, and so on. Two trials out of 1000 missed with nine bombs, for the only negative score, at -9. Compared to fewer bombs in the previous experiment, the reason that up to nine bombs can be dropped in this scenario is the extra knowledge of state after every strike action. If a 85 500 . ............ ...... .... ..... ..... ...... ... .. .... . ... ... 450 400 350 ................... ...... - .5 . . .... .... ... ........... .. ... 11............. .. .. ...................... E. .. I ............ ....... .... .. ...... I... . .I.. .. ... .... ..... .. . .. ........ j' 300 C i C. .. 250 U.: 200 ..----------- 150 5ii IBM 100 50 0 -9 -8 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 Score action Figure 4-11: Score histogram of a single POMDP target with a perfect sensor target is undamaged after an attack, this is known with 100% certainty, so the policy for this could say to drop up to three bombs for the next time step. The offline time experiment was on the order of one second. The next experiment is for a single target and a sensor with perfect accuracy. The for result for 1000 trials is shown in figure 4-11. There are very few possible policies this target. The policy is to drop one bomb and look, then if it is not destroyed, The drop another bomb and look. If it is not destroyed, drop two bombs and look. increase in the action is due to the closing target window. If it is still not destroyed, drop the maximum action of 3 bombs, then do nothing on the last time step. No action is performed in the last time step because the sensor action has a cost of 0.1, but it is not possible to act in the following time step H + 1. Therefore, there - 1 is no point to looking since there is no explicit reward for knowledge. In the H time step in the scenario, the problem can be considered to be a new horizon-two steps, but problem. The optimal policy is to drop three bombs over those two time a sensor action will never be used. Once again, there are multiple policies to drop three bombs in two time steps, but they all have the same value. The offline time for this experiment was also on the order of one second. 86 400 350 --- 300- ,250 - U i 200 I- 150 - 100 - 50 - 0 -9 -8 0 -7 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 7 8 9 Score Figure 4-12: Score histogram of a single POMDP target with a realistic sensor action The next experiment is for a single target with a realistic sensor action. The look action in this problem is defined to be 85% accurate. The result for 1000 trials is shown in figure 4-12. In this case, it is clear how many bombs it took to destroy the target. The first spike to the right of the graph is at 8.9, which corresponds to dropping one bomb and looking, but now it is possible for the sensors to be incorrect. This shows up on the graph at the next spike, at 8.6. This corresponds to two more look actions, to calculate the belief state more accurately. These results show many more negative values than the preceding graphs, which is expected since incorrect assumptions of destruction can occur. The offline time for this experiment was on the order of 17 hours. Histograms for the scoring analysis experiment are shown in figures 4-13, 4-14, 415, and 4-16. In each case, the scoring curve has the same basic bell shape. However, the average score increases as the problem scenario is shifted from no sensor action, to a realistic sensor action, to a perfect sensor action, to a completely observable case. This is because in the first three cases, more knowledge about the state of the target is available at every step with a more accurate sensor action. The completely observable case improves on the average score from the perfect sensor action because 87 120 100 80 U- Cr 60 40! 20 0 450 400 500 550 600 650 700 750 800 Score Figure 4-13: Score histogram of 100 POMDP targets with no sensor action 20 18_ 16 14 12 U. 8 6-- 4 2 400 450 500 550 600 650 700 750 800 Score Figure 4-14: Score histogram of 100 POMDP targets with a realistic sensor action 88 20 18 16 14 12 10 G) 8 U- 6 4 2 400 450 500 550 600 650 700 750 800 Score Figure 4-15: Score histogram of 100 POMDP targets with a perfect sensor action 20 18 16 14 12 C 10 IL 8 6 4 2 0 400 450 500 550 1 1 1 600 650 700 1Emma", 750 800 Score Figure 4-16: Score histogram of 100 completely observable targets 89 time steps are not wasted using a look. It is also interesting to note how spaced out the scores are in figure 4-13, the no sensor action scenario. This is because each individual target has a bimodal distribution with a distance of 10. Thus, all possible scores occur at intervals of value 10. 90 Chapter 5 Conclusion This thesis presents a new approach to allocating resources in a partially observable dynamic battle management scenario by combining POMDP algorithmic techniques and a prior completely observable resource allocation technique. The problem we present is to allocate weapon resources over several targets of varying types, given imperfect observations after an action is taken. The scenario is dynamic, in that values are computed offline and then resources are allocated via a greedy algorithm in an online loop. The mathematical background behind completely observable and partially observable Markov decision processes was discussed, and its use was mentioned in three battle management approaches. A completely observable model by Meuleau et al. was then fully described and implemented as a starting point for our new partially observable approach. One problem in changing the completely observable model to partial observability was that resource constraints were violated. To address this problem, we augmented the state space in a POMDP such that the allocation is included in the actual state of a target. The state becomes more descriptive with deterministic and probabilistic elements. This increases the dimension of the vectors in the solution set, but does not increase the total possible number of vectors in the parsimonious set. An experiment compared the new partially observable approach with the completely observable one, and the results show that the new model honors weapon 91 constraints and produces the optimal policy in a completely observable case. Finally, Monte Carlo experiments were run on four different battle scenarios with various observation models, and their results were compared to show the relative importance of observation information. The approach we took to observe resource constraints in a partially observable battle management world produced optimal policies, but at a significant cost. The time involved to compute the offline values is much higher than that of the completely observable approach. Though the state space expansion does not increase the number of possible vectors in the next epoch's solution set, it does increase the number of vectors carried over to the next time step. This is because with state space expansion, a vector can dominate in a projection onto a range of states, and be dominated in the projection to other states. Since the time required is much more significant, this approach works well when there is ample time for computation beforehand. The online phase is relatively unaffected by the state space expansion and is practical for resource allocation. 5.1 Thesis Contribution The contribution that this thesis makes to the field is to introduce a way of coupling resource constraints to a POMDP without having to use an LP solver. This state space expansion does not cause the POMDP to become intractable, but only causes linear growth in the dimension of the vectors. It has minimal impact on the incremental pruning solution algorithm. Thus, this thesis presents a simple way of solving a dynamic partially observable resource allocation problem. 5.2 Future Work There are several possible avenues for future work. One is to optimize the running time of the POMDP solver code to take advantage of the fact that the transition and observation models are very sparse. In addition, more realistic transition and obser- 92 vation models can be considered, as strike actions usually produce some observation in a real-world scenario. A new resource allocation algorithm could be created, as the greedy algorithm used in this thesis is not optimal. Also, developing a more flexible POMDP solver is a natural extension to this research. If the action set in the POMDP model was able to be changed between every epoch, the impossible action issue would be avoided. The value iteration concept of taking resource use into consideration when deciding future actions could be included in the POIDP solver, such that impossible actions are eliminated. This would preclude the need for state space expansion, and keep the model of the POMDP small and efficient. 93 94 Appendix A Cassandra's POMDP Software Because the focus of this thesis is not to find better ways to solve POMDPs, we decided to use software developed by Anthony Cassandra to solve the POMDPs that our approach models. We focus on how to take the model of the problem from our input files to an output file that can be used by Cassandra's code to produce the offline values. A.1 Header The input file header, or preamble, defines the setup of the POMDP. The first line defines the discount factor for infinite time horizon problems, discount: (value). In such problems, the expected value at each time step is calculated by adding the future reward multiplied by the discount factor to the immediate reward. This biases the policy to act earlier rather than later, as the relative reward earned goes down with every time step. In an infinite horizon case, this would eventually cause the policy to converge to a set of actions dependent on a belief state range, as the cost for dropping bombs will eventually exceed the marginal reward for acting. In a finite horizon problem, the discount factor can still be used to encourage acting earlier, but is usually set to one. The next element in the preamble is the values: [reward/cost] line. This defines whether the rewards listed later in the file are rewards (positive values), or 95 costs (negative values). The states: <list of states> element comes next. This allows the user to define by name all the states that are possible in the POMDP. These state names are used later in the file to define probabilities and values. It is important to note the order the states are listed, as the output files to the POMDP will list several nodes which are arrays of numbers of length Next is actions: jS|. <list of actions>, which allows the user to define A, all actions possible in the POMDP. Like the states descriptor, the names listed will be used later in the file for definitions. Also, the output file will not use the names of the actions, but rather their indices in the list, with the first action having an index of zero. The final line of the preamble is observations: <list of observations>. This defines the POMDP's observation set Z. Like the previous two elements, the observations will be used to define probabilities and values, and only the index is important. For all three of the previous elements, the use of explicit names is meant for the user's ease of reading and understanding the input files. A.2 Transition Probabilities The section following the preamble is the section describing the transition probabilities. There are several ways to enter the transition probabilities in the input file, but the simplest way is to define a probability for each possible action, start state, and end state. The syntax for this is: T: <a E A> : <s E S> : <s' E S> (value) It is also possible to use the wildcard operator, *, for any action, state, or observation, to make multiple definitions. If this is desired, the wildcard must be used first, since any transition, observation, or reward that is defined more than once will use the definition that comes last in the file. 96 A.3 Observation Probabilities Defining the observation probabilities in the next section of the input file is very similar to defining the transition probabilities. In this case, however, an individual probability for an observation is listed based on an action and an end state, in the following way: 0: <a CA> A.4 <s' C S> <z CZ> (value) Rewards In Cassandra's code, it is necessary to specify that the values defined in the rewards section are either costs or rewards. There is no way to specify both a particular cost for an action and a reward for a transition, so the costs must be factored in when listing them in the input file. The next section defines the values for a given action with a transition from one state to another and an observation, in the following format: R: <a E A> : <s E S>: <s' E S> : <z E Z> (value) The value listed for each entry is R(s, s') - cost(a). A.5 Output files Cassandra's POMDP solver produces a set of files he calls "alpha" files. A .alpha file for a given time step contains the parsimonious set of vectors that describe the value function over belief states. Each .alpha file depends on the previous one, such that . alpha3 depends on . alpha2 which depends on . alphal. A . alpha file at time step t has n pairs of an action and a list of vector coefficients, where n is the size of the parsimonious set of vectors F*. The action is the index in A, from 0 to JAl - 1. The length of the vector coefficient list is ISI, and the it coefficient in the vector is the value for being in state si E S. 97 If the save-all option is selected in the command-line execution, then the POMDP solver will save the results associated with every time step, or epoch, as a .alpha<X> file, where X is the epoch number. This number corresponds to the number of time steps the target has left in its individual window. The output file for the final time step in the horizon will have a .alphal extension, the second to last will have a . alpha2 extension, and so on. A.6 Example-The Tiger Problem A well documented POMDP, presented in the paper by Kaelbling et al. [6], describes a simple scenario that Cassandra uses as an example for his code. This section will define the scenario and present the input and output files, to aid in the understanding of the POMDP solver syntax. The tiger problem is places a person in front of two closed doors. Behind one is some reward and behind the other is a hungry tiger, with equal probability. If the door with the reward is opened, the person receives the reward, whereas if the door with the tiger is opened, the person receives a penalty. In either case, when a door is opened, the problem resets, and the reward and tiger are placed behind the two doors with equal probability again. The actions the person can take are to open the left door, open the right door, or listen at a door. Listening is not free, however, and it is not completely accurate. There is some probability that the tiger will be silent as the agent listens, and some probability that the person may falsely hear something behind the reward door. The state space is defined as S = {tiger-left, tiger-right}. The action space is defined as A = {open-left, open-right, listen}. The reward for opening a reward door is +10, opening a tiger door is -100, and listening is -1. The possible observations are Z = {tiger-left, tiger-right}. There is an 85% chance for a correct observation. In addition, a discount factor of 0.95 is used, as it is modelled as an infinite horizon problem. Figure A-1 is the input file to the POMDP code, as described in the previous section. 98 0.95 discount: values: reward states: tiger-left tiger-right listen open-left open-right actions: observations: tiger-left tiger-right T:listen identity T:open-left uniform T:open-right uniform 0:listen 0.85 0.15 0.15 0.85 o:open-left uniform 0: open-right uniform R:listen : : * : * -1 * R:open-left : tiger-left : * -100 : 10 tiger-right : * : * R:open-right : tiger-left : * : * 10 R:open-right : tiger-right : R:open-left : * : * -100 Figure A-1: The input file for the tiger POMDP 99 0 19.3713683559737184225468809 19.3713683559737184225468809 0 0.6908881394535828501801689 25.0049727346740340294672933 0 16.4934850146678506632724748 21.5418370968935839471214422 0 3.0147789375580762438744387 24.6956809390929521441648831 0 25.0049727346740340294672933 0.6908881394535828501801689 0 21.5418370968935839471214422 16.4934850146678506632724748 0 24.6956809390929521441648831 3.0147789375580762438744387 1 -81.5972000627708240472202306 2 28.4027999372291724000660906 28.4027999372291724000660906 -81.5972000627708240472202306 Figure A-2: The converged . alpha file for the tiger POMDP 100 30 25 . 20 10 151 5 0 tiger-right (1) tiger-left (0) Belief Space Figure A-3: The alpha vectors of the two-state tiger POMDP The solution file is shown in Figure A-2. Every vector is represented by ISI coefficients. Each one of these vectors has an associated optimal action. In this case, a 0 corresponds to the first action defined in the problem, listen. A 1 is the open-left action and a 2 is the open-right action. This problem is simple enough to represent in a two-dimensional graphical format, shown in Figure A-3. Each numeric representation of a vector in the . alpha file in figure A-2 corresponds to a vector in this graph. The solid lines are listen actions, the dotted line is the open-left action, and the dot-dashed line is the open-right action. As shown, when the knowledge of the state is more certain (towards the left or right edges of the belief space) the optimal action is to open a door with a high expectation for a reward. In the middle, "gray area" of the belief space, the optimal action is to listen, with a lower expected reward. A.7 Running the POMDP The POMDP solver runs with several command line options. The ones that are used in this thesis are the following: e -horizon <int>: This option allows the user to specify the length of the hori- 101 zon. If the policy does converge on an epoch before the number chosen, the solver stops, and outputs that epoch's result as the final solution. " -epsilon <0-oc>: This option allows the user to set the precision of the pruning operation. The default value is 10- 9 . Higher values will generate faster solutions, though they will be less accurate. " -save-all: Normally, only the . alpha file of the final epoch is saved. However, if this option is selected, every epoch's . alpha file will be saved with the epoch number appended to the end. This becomes important for solving the online portion of the problem, which will be discussed later. " -o <f ile-pref ix>: This allows the user to specify the prefix for the . alpha files. This helps to keep the directories and files easy to read and understand, but is only aesthetic, and not integral to solving the problem. " -p pomdp-f ile: This option tells the POMDP solver which file to use for the input to set up the POMDP. " -method [incprune]: This option describes the POMDP solution method to be used. This thesis uses the default, incremental pruning. When the POMDP solver is run, it checks the input file both for syntax correctness as well as mathematical correctness (probabilities of a transition or observation matrix must add up to 1). If this is correct, it runs the POMDP. Figure A-4 is a sample screenshot of the POMDP solver, showing the problem parameters, the number of vectors and time taken for each epoch, and the total amount of time taken for the problem. A.8 Linear Program Solving The POMDP solver comes with two different LP solver options. The first is a generic, unsupported LP solver package which is bundled with Cassandra's code. The second option is to use a commercial software package, CPLEX. 102 Value iteration parameters: POMDP file = POMDPVals/code.POMDP Initial values = default Horizon = 10. Stopping criteria = weak (delta = 1.000000e-09) VI Variation = normal Optimization parameters: Domination check = true General Epsilon = 1.0000OOe-09 LP Epsilon = 1.0000OOe-09 Projection purging = normal-prune Q purge = normal-prune Use witness points = false Algorithm parameters: Method = incprune Incremental Pruning method settings: IncPrune type = normal Solutions files: Saving every epoch. Initial Epoch: Epoch: Epoch: Epoch: Epoch: Epoch: Epoch: Epoch: Epoch: Epoch: policy has 1 vectors. 1.. .3 vectors. (0.01 secs.) 2.. .4 vectors. (0.10 secs.) 3.. .5 vectors. (0.19 secs.) 4.. .5 vectors. (0.28 secs.) 5...6 vectors. (0.30 secs.) 6.. .7 vectors. (0.48 secs.) 7.. .8 vectors. (0.65 secs.) 8.. .9 vectors. (0.96 secs.) 9... 10 vectors. (1.20 secs.) 10.. .11 vectors. (1.74 secs.) (0.01 secs. total) (0.11 secs. total) (0.30 secs. total) (0.58 secs. total) (0.88 secs. total) (1.36 secs. total) (2.01 secs. total) (2.97 secs. total) (4.17 secs. total) (5.91 secs. total) Solution found. See file: POMDPSols/code-allocation30_horizonl0-E-9/solution. alpha POMDPSols/code-allocation30lhorizonl0-E-9/solution. pg User time = 0 hrs., 0 mins, 5.57 secs. (= 5.57 secs) System time = 0 hrs., 0 mins, 0.34 secs. (= 0.34 secs) (= 5.91 secs) Total execution time = 0 hrs., 0 mins, 5.91 secs. Proj-build time: 0.07 secs. Proj-purge time: 0.71 secs. Qa-build time: 4.12 secs. 1.01 secs. Qa-merge time: Total context time: 5.91 secs. Figure A-4: A screenshot of the POMDP solver 103 A.9 Porting Considerations A problem arose when trying to integrate the original resource allocation software and Cassandra's POMDP solver. The implementation presented earlier in this thesis is written in Java in the Windows environment, and Cassandra's code is written in C in the UNIX/Linux environment. Since the POMDP solver can be used by a simple system call from any application, the language differences can be handled easily, but the platform incompatibility cannot. This incompatibility was addressed by porting the implementation code from Windows to Linux. 104 Bibliography [1] Anthony R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. PhD dissertation, Brown University, Department of Computer Science, May 1998. [2] Anthony R. Cassandra. Pomdps for dummies. Online Tutorial Website: http://www.cs.brown.edu/research/ai/pomdp/tutorial/, January 1999. [3] Anthony R. Cassandra, Leslie Pack Kaelbling, and Michael L. Littman. Acting optimally in partially observable stochastic domains. Technical report, Brown University, 1994. [4] Anthony R. Cassandra, Michael L. Littman, and Nevin L. Zhang. Incremental pruning: A simple, fast, exact method for partially observable markov decision processes. Proceedings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence, 1997. [5] David A. Castanon. Approximate dynamic programming for sensor management. In Proc. Conf. Decision and Control, 1997. [6] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Plan- ning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99-134, 1998. [7] Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, Leonid Peshkin, Leslie Pack Kaelbling, and Thomas Dean. Solving very large weakly coupled markov decision processes. American Association for Artificial Intelligence, 1998. 105 [8] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995. [9] Kirk A. Yost. Solution of Large-Scale Allocation Problems with Partially Ob- servable Outcomes. PhD dissertation, Naval Postgraduate School, Department of Operations Research, September 1998. 106