A Partially Observable Approach to Allocating

---
A Partially Observable Approach to Allocating
Resources in a Dynamic Battle Scenario
by
Spencer James Firestone
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2002
@
Spencer James Firestone, MMII. All rights reserved.
The author hereby grants to MIT permission to reproduce and
distribute publicly paper and electronic copies of this thesis document
MASSACHUSETTS
in whole or in part.
BARKER
INSTITUTE
OF TECHN0!L!7'Y
200j
JUL 3
LIBRARIES
. .........
..
A uthor .......
Department' of Elec1TMM1 Engineering and Computer Science
May 24, 2002
C ertified by.....w..
........................
--
Richard Hildebrant
Principal Member of Technical Staff, Draper Laboratory
Technical Supervisor
........
Certified by.
LesThfe Pack Kaelbling
Professor of Computer Science and Engineering, MIT
T9.sSipervisor
........
Arthur C. Smith
Chairman, Department Committee on Graduate Students
Accepted by -
2
A Partially Observable Approach to Allocating Resources in
a Dynamic Battle Scenario
by
Spencer James Firestone
Submitted to the Department of Electrical Engineering and Computer Science
on May 24, 2002, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
This thesis presents a new approach to allocating resources (weapons) in a partially
observable dynamic battle management scenario by combining partially observable
Markov decision process (POMDP) algorithmic techniques with an existing approach
for allocating resources when the state is completely observable. The existing approach computes values for target Markov decision processes offline, then uses these
values in an online loop to perform resource allocation and action assignment. The
state space of the POMDP is augmented in a novel way to address conservation of
resource constraints inherent to the problem. Though this state space augmentation
does not increase the total possible number of vectors in every time step, it does have
a significant impact on the offline running time. Different scenarios are constructed
and tested with the new model, and the results show the correctness of the model
and the relative importance of information.
Technical Supervisor: Richard Hildebrant
Title: Principal Member of Technical Staff, Draper Laboratory
Thesis Supervisor: Leslie Pack Kaelbling
Title: Professor of Computer Science and Engineering, MIT
3
4
Acknowledgments
This thesis was prepared at the Charles Stark Draper Laboratory, Inc., under Internal
Research and Development.
Publication of this report does not constitute approval by the Draper Laboratory
or any sponsor of the findings or conclusions contained herein. It is published for the
exchange and stimulation of ideas.
Permission is hereby granted by the author to the Massachusetts Institute of
Technology to reproduce any or all of this thesis.
Spencer Firestone
May 24, 2002
5
6
Contents
1
2
1.1
M otivation.
. . . . . . . . . . . . . . . . . . . . . .
14
1.2
Problem Statement . . . . . . . . . . . . . . . . . .
14
1.3
Problem Approaches
. . . . . . . . . . . . . . . . .
16
1.4
Thesis Approach
. . . . . . . . . . . . . . . . . . .
17
1.5
Thesis Roadmap
. . . . . . . . . . . . . . . . . . .
18
19
Background
2.1
2.2
2.3
3
13
Introduction
Markov Decision Processes . . . . . . . . . . . . . .
19
2.1.1
MDP Model . . . . . . . . . . . . . . . . . .
21
2.1.2
MDP Solution Method . . . . . . . . . . . .
21
PO M D P . . . . . . . . . . . . . . . . . . . . . . . .
22
2.2.1
POMDP Model Extension . . . . . . . . . .
22
2.2.2
POMDP Solutions
. . . . . . . . . . . . . .
23
2.2.3
POMDP Solution Algorithms
. . . . . . . .
26
. . . . . . . . . . . . . .
28
2.3.1
Markov Task Decomposition . . . . . . . . .
28
2.3.2
Yost . . . . . . . . . . . . . . . . . . . . . .
31
2.3.3
Castafion
. . . . . . . . . . . . . . . . . . .
32
Other Approaches Details
Dynamic Completely Observable Implementation
3.1
35
Differences to MTD . . . . . . . . . . . . . . . . . .
36
Two-State vs. Multi-State . . . . . . . . . .
36
3.1.1
3.2
3.3
4
3.1.2
Damage Model
3.1.3
Multiple Target Types
. . . . . . . . . . . . . . . . . . . . . . . . . .
37
. . . . . . . . . . . . . . . . . . . . . .
39
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2.1
Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.2.2
Modelling
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.2.3
Offline-MDP Calculation . . . . . . . . . . . . . . . . . . . .
46
3.2.4
Online-Resource Allocation and Simulation . . . . . . . . . .
46
Implementation
Implementation Optimization
. . . . . . . . . . . . . . . . . . . . . .
51
3.3.1
Reducing the Number of MDPs Calculated . . . . . . . . . . .
51
3.3.2
Reducing the Computational Complexity of MDPs
. . . . . .
51
3.4
Implementation Flexibility . . . . . . . . . . . . . . . . . . . . . . . .
52
3.5
Experimental Comparison
53
. . . . . . . . . . . . . . . . . . . . . . . .
Dynamic Partially Observable Implementation
4.1
4.2
4.3
Additions to the Completely Observable Model
57
. . . . . . . . . . . .
58
4.1.1
POMDPs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.1.2
Strike Actions vs. Sensor Actions . . . . . . . . . . . . . . . .
59
4.1.3
Belief State and State Estimator
. . . . . . . . . . . . . . . .
59
The Partially Observable Approach . . . . . . . . . . . . . . . . . . .
61
4.2.1
Resource Constraint Problem
. . . . . . . . . . . . . . . . . .
61
4.2.2
Impossible Action Problem . . . . . . . . . . . . . . . . . . . .
64
4.2.3
Sensor Actions
. . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.2.4
Belief States . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.3.1
Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.3.2
Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.3.3
Offline-POMDP Calculations . . . . . . . . . . . . . . . . . .
72
4.3.4
Online-Resource Allocation . . . . . . . . . . . . . . . . . . .
73
4.3.5
Online-Simulator
. . . . . . . . . . . . . . . . . . . . . . . .
75
4.3.6
Online-State Estimator . . . . . . . . . . . . . . . . . . . . .
75
8
4.4
4.5
5
. . . . . . . . . . . . . . . . . . . . . .
77
4.4.1
Removing the Nothing Observation . . . . . . . . . . . . . . .
78
4.4.2
Calculating Maximum Action . . . . . . . . . . . . . . . . . .
78
4.4.3
Defining Maximum Allocation . . . . . . . . . . . . . . . . . .
79
4.4.4
One Target Type, One POMDP . . . . . . . . . . . . . . . . .
80
4.4.5
Maximum Target Type Horizon . . . . . . . . . . . . . . . . .
80
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
4.5.1
Completely Observable Experiment . . . . . . . . . . . . . . .
81
4.5.2
Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . .
83
Implementation Optimization
91
Conclusion
5.1
Thesis Contribution.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
92
5.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
95
A Cassandra's POMDP Software
A .1
H eader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
A.2
Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . .
96
A.3
Observation Probabilities . . . . . . . . . . . . . . . . . . . . . . . . .
97
A.4 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
Output files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
A.6 Example-The Tiger Problem . . . . . . . . . . . . . . . . . . . . . .
98
. . . . . . . . . . . . . . . . . . . . . . . . . .
101
A.5
A.7 Running the POMDP
A.8
Linear Program Solving
. . . . . . . . . . . . . . . . . . . . . . . . .
102
A.9
Porting Considerations . . . . . . . . . . . . . . . . . . . . . . . . . .
104
9
10
List of Figures
. . . .
20
A second Markov chain . . . . . . . . . . . . . . . . . . . . .
. . . .
20
2-3
A combination of the two Markov chains . . . . . . . . . . .
. . . .
20
2-4
A belief state update . . . . . . . . . . . . . . . . . . . . . .
. . . .
23
2-5
A sample two-state POMDP vector set . . . . . . . . . . . .
. . . .
24
2-6
The corresponding two-state POMDP parsimonious set . . .
. . . .
25
2-7
The dynamic programming exact POMDP solution method.
. . . .
26
2-8
Architecture of the IIeuleau et al. approach
. . . . . . . . .
. . . .
30
2-9
Architecture of the Yost approach, from [9] . . . . . . . . . .
. . . .
31
2-10 Architecture of the Castafion approach . . . . . . . . . . . .
. . . .
32
2-1
A sample Markov chain. ....
2-2
.....................
3-1
A two-state target
. . . . . . . . . . . . . . . . . . . . . . .
36
3-2
A three-state target . . . . . . . . . . . . . . . . . . . . . . .
36
3-3
The completely observable implementation architecture . . .
40
3-4
A sam ple target file . . . . . . . . . . . . . . . . . . . . . . .
42
3-5
A sam ple world file . . . . . . . . . . . . . . . . . . . . . . .
44
3-6
Timeline of sample targets . . . . . . . . . . . . . . . . . . .
45
3-7
A sam ple data file . . . . . . . . . . . . . . . . . . . . . . . .
47
3-8
A timeline depicting the three types of target windows
. . .
50
3-9
Meuleau et al.'s graph of optimal actions across time
. . . .
55
3-10 Optimal actions across time and allocation . . . . . . . . . .
56
4-1
The general state estimator
. . . . . . . . . . . . . . . . . .
60
4-2
The state estimator for a strike action
. . . . . . . . . . . .
60
11
4-3
The state estimator for a sensor action . . . . . . . . . . . . . . . . .
60
4-4
Expanded transition model for M = 3 . . . . . . . . . . . . . . . . . .
63
4-5
Expanded transition model for M = 3 with the limbo state . . . . . .
65
4-6
The partially observable implementation architecture
. . . . . . . . .
69
4-7
A sample partially observable target file
. . . . . . . . . . . . . . . .
71
4-8
Optimal policy for M = 11 . . . . . . . . . . . . . . . . . . . . . . . .
82
4-9
Score histogram of a single POMDP target with no sensor action
. .
84
. . . . . . .
85
4-10 Score histogram of a single completely observable target
4-11 Score histogram of a single POMDP target with a perfect sensor action 86
4-12 Score histogram of a single POMDP target with a realistic sensor action 87
4-13 Score histogram of 100 POMDP targets with no sensor action
. . . .
88
4-14 Score histogram of 100 POMDP targets with a realistic sensor action
88
4-15 Score histogram of 100 POMDP targets with a perfect sensor action
89
4-16 Score histogram of 100 completely observable targets
. . . . . . . . .
89
A-1 The input file for the tiger POMDP . . . . . . . . . . . . . . . . . . .
99
A-2
The converged . alpha file for the tiger POMDP . . . . . . . . . . . .
100
A-3
The alpha vectors of the two-state tiger POMDP
. . . . . . . . . . .
101
A-4 A screenshot of the POMDP solver . . . . . . . . . . . . . . . . . . .
103
12
Chapter 1
Introduction
The task of planning problems in which future actions are based on the current state
of objects is made difficult by a degree of uncertainty of the state. State knowledge
is often fundamentally uncertain, because objects are not necessarily in the same
location as the person or sensor making the observation. Problems of this type can
be found anywhere from business to the military [9, 5j.
For an example of a business application planning problem, consider a scenario
where a multinational corporation produces a product. This product is either liked or
disliked by the general populace. The company has an idea of how well the product
is liked based on sales, but this knowledge is uncertain because it is impossible to
ask every person how satisfied they are. The company wishes to know whether they
should produce more of the product, change the product in some way, or give up on
the product altogether. Of course, an improper action could be costly to the company.
To become more certain, the company can perform information actions such as polls
or surveys, then use this knowledge to guide its actions.
Military applications are even clearer examples of how partial knowledge can
present problems.
In a bombing scenario example, there are several targets with
a "degree of damage" state, anywhere from "undamaged" to "damaged". The goal
is to damage the targets with a finite amount of resources over a period of time.
Dropping a bomb on a damaged target is a waste of resources, while assuming a
target is damaged when it is not can cost lives. Unfortunately, misclassifications are
13
frequent [9]. This thesis will focus on a more developed battle management scenario
in a military application.
1.1
Motivation
Battle management consists of allocating resources among many targets over a period
of time and/or assessing the state of these targets. However, current models of a
battle management world are fairly specific. Some models are concerned only with
bombing targets, while others focus on target assessment.
Some calculate a plan
before the mission, while others update the actions and allocations based on realtime information.
However, a more realistic model could be created by combining
components of these models.
This new model could then be used to create more
accurate and effective battle plans.
1.2
Problem Statement
In a combat mission, the objective of the military is to maximize target damage in the
shortest time with the lowest cost. This thesis will examine a battle scenario where
there are several targets to which weapons can be allocated. The general problem to
be investigated has the following characteristics:
" Objective: The goal of the problem is to maximize the reward attained from
damaging targets over a mission time horizon. The reward is defined as the
value of the damage done to the target less the cost of the weapons used.
" Resources: The resources in this research are weapons of a single type. For
each individual problem, there is a finite number of weapons to be allocated,
Al. Once a weapon is used, it is consumed, and some cost is associated with its
use.
" Targets and States: For each individual problem, there is a finite number of
targets to be attacked. Each target is a particular type, and there are one or
14
more different target types. Each target type has a number of states. There are
at least two states: undamaged and destroyed.
" Time Horizons: The battle scenario exists over a discrete finite time horizon.
Each of the H discrete steps in this horizon is called a time step, t. Each
individual target is available for attacking over an individual discrete finite
time horizon. Target transitions and rewards can only be attained when the
target exists.
" Actions: There are two classes of actions. A strike action consists of using zero
or more weapons on a target in a given time step. There is also a sensor class
of actions, which does not affect the target's state, but instead determines more
information about its state. Sensor class actions are only necessary in certain
models, and this will be discussed in the Observations subsection. Sensor class
and strike class actions are mutually exclusive and cannot be done at the same
time. There are N actions at every time step, where N is the number of targets.
" State Transitions: Each individual weapon has a probability of damaging the
target. Using multiple weapons on a target increases the probability that the
target will be damaged. Damage is characterized by a transition from one state
to another. It is assumed that targets cannot repair themselves, so targets can
only transition from a state to a more damaged state, if they transition at all.
" Allocation: Allocation of resources to targets is dividing the total number of
resources among the different targets. However, allocating x resources does not
imply that all x resources will be used at that time step.
" Resource Constraints: For simplicity, it is assumed that there is no constraint
that limits the number of weapons that can be allocated to any one target at any
time step. The only resource constraint is that the sum of all targets' weapon
allocations is limited to the total number of current resources available.
" Rewards: Rewards are attached to the transitions from one state to another.
Different target classes can have different reward values.
15
* Observations: After each action, an observation is made as to the state of
each target. There are two cases: definitive knowledge of the state of the target,
called complete or total observability, or probabilistic knowledge of the state of
the target, called partial observability.
- Total Observability: In the totally observable class of planning problems, after every action, the state of the targets is implicitly known with
certainty. The next actions can then be planned based on this knowledge.
Only strike actions are necessary in this class of problems.
- Partial Observability: In the partially observable class of planning problems, after every action, there is a probabilistic degree of certainty that a
target is in a given state. It is assumed that a strike action returns no information about the state of the target. In addition, a strike action is not
coupled with a sensor action, so to determine more accurate information
about a target, an sensor class action must be used.
1.3
Problem Approaches
There are current solution methods to battle planning problems in totally observ-
able worlds [71, and solution methods in partially observable ones [9, 5]. However,
these solutions explore slightly different concepts and applications. There are other
differences in battle management solution approaches. Some solutions plan a policy
of actions determined without any real-time feedback [9]. When all calculations and
policies are determined before the mission begins, this is called offline planning. On
the other hand, some solutions dynamically change the policy based on observations
(completely accurate [7] or not [5]) from the world. When the process is iterative,
with observations from the real world or a simulator, this is called online planning.
Some solutions deal with strike actions [7], some focus on bomb damage assessment
(BDA) of targets, some do both [9], and others use related techniques to examine
different aspects of battle management [5].
16
This thesis will look at three seminal
papers which describe different ways to approach the battle management problem,
analyze them in some detail, and combine them into a more realistic model.
Meuleau et al. [7] examine a totally observable battle management world much
like the problem statement described above.
The model consists of an offline and
online portion, and considers only strike actions. The allocation algorithm used is
a simple greedy algorithm combined with Markov decision processes and dynamic
programming.
Yost's Ph.D. dissertation [9] executes much offline calculation and planning to
determine the best policy for allocation of weapons to damage targets and assess
their damage using sensors in a partially observable world. His allocation method is
a coupling of linear programming and partially observable Markov decision processes.
Castafion's [5] looks at a different aspect of the battle management world. His
focus is on intelligence, surveillance, and reconnaissance (ISR), where target type
is identified.
This is closely related to determining the state of a target in a par-
tially observable world, which is the topic of this thesis. He uses online calculations
with partially observable Markov decision processes, dynamic programming, and Lagrangian relaxation to allocate sensor resources.
1.4
Thesis Approach
To establish a baseline for comparison, we begin by creating a simple totally observable model. The model will be almost identical to the one in Meuleau et al.'s paper,
with a few extensions and clarifications. The model will consist of both online and
offline phases. Examples similar to the paper will be run and compared to the paper's results. The model will be extended by expanding the initial implementation to
operate in a partially observable world with additional characteristics. Then we will
run experiments, analyze and compare results to prior work, and draw conclusions
from the experiments.
Another graduate student, Kin-Joe Sham, is concurrently working in the same
problem domain. However, he is focusing on improving the allocation algorithms and
17
adding additional constraints to make the model more realistic. We have collaborated
to implement the totally observable model, but from there, our extensions are separate, and the code we created diverges. Future work in this problem domain could
combine our two areas of research, as we shared code for the completely observable
implementation and started with the same model for our individual contributions.
1.5
Thesis Roadmap
The remainder of this thesis is laid out as follows: Chapter 2 gives background into
totally and partially observable Markov decision processes. It also describes the three
other papers in much more depth. Chapter 3 discusses the totally observable model
fashioned after Meuleau et al.'s research, including the differences between their approach and ours, additional considerations encountered while developing the model,
implementation optimizations, and experimental result comparisons. Chapter 4 develops the partially observable model, comparing it to the completely observable one
defined in the previous chapter. It discusses interesting new implementation optimizations and then presents experimental results and analysis of different real-world
scenarios.
Chapter 5 describes the conclusions drawn from this research and pos-
sible future extensions to the project. Appendix A contains a detailed description
Cassandra's POMDP solver application.
18
Chapter 2
Background
Our work is founded on totally and partially observable Markov decision processes,
so we begin with a description of them. With the understanding gained from those
descriptions, the other approaches can be explained in more detail.
2.1
Markov Decision Processes
A Markov decision process (MIDP) is used to model an agent interacting with the
world [6].
The agent will receive the state of the world as input, then generate
actions that modify the state of the world. The completely observable environment is
accessible, which means the agent's observation completely defines the current world
state [8].
A Markov decision process can be viewed as a combination of several similar
Markov chains with different transition probabilities. The sum of the probabilities on
the arcs out of each node must be 1. As an example, a sample Markov chain is shown
in Figure 2-1. In this chain, there are four states, a, b, c, and d, which are connected
with arcs labelled with transition probability from state i to
j,
Tj.
Figure 2-2
displays these same four states, but with different transition probabilities. If figure 21 is considered to be a Markov chain for action a 1 and figure 2-2 represents a chain
for action a 2 , the combination of the two is shown in figure 2-3. Now different paths
with different probabilities can be taken to get from one state to another by choosing
19
Pac
a
~
Pab=0.6
0-4
Pca=0.3
b=Id=09P
0.
cd
b
Figure 2-1: A sample Markov chain
Pac=7
pba=IPdI
I
b
b - -d
Pdc=.2
Figure 2-2: A second Markov chain
Pac1
0-4
Pac2=
PeP
Pb=11
ba2
Pab1=0.
=0
6
pIcd2I
Pdcl=0-9
Pbcl=
b
d
PdbI=
0
.IPc2=0.2
Figure 2-3: A combination of the two Markov chains
20
different actions. If the states have reward values, it is now possible to maximize
reward received for a transitions from state to state by choosing the appropriate
action with the highest expected reward. This is a simplification of an MDP.
2.1.1
MDP Model
An MDP model consists of four elements:
" S is a finite set of states of the world.
* A is a finite set of actions.
" T : S x A -
H(S) is a state transition function.
For every action a E A,
and state s E S, the transition function gives the probability that an object
will transition to state s' E S. This will be written as T(s, a, s'), where s is
the original state, a is the action performed on the object, and s' is the ending
state.
" R : S x S
-+
R is the reward function. For every state s E S, R is the reward
for a transition to state s' E S. This will be written as R(s, s').
2.1.2
MDP Solution Method
This thesis is concerned with problems of a finite horizon, defined as a fixed number
of discrete time steps in the MDP . Thus the desired solution is one in which a set of
optimal actions, or a policy, is found. An optimization criterion would be as follows:
-H-1~
maxE E rt,
t=O
where rj is the reward attained at time t. Since we assume that the time horizon is
known for every target, this model is appropriate.
To maximize the reward in an MDP, value iteration is employed. Value iteration
is a dynamic programming solution to the reward maximization problem. The procedure calculates the expected utility, or value, of being in a given state s. To do
21
this, for every possible s' it adds the immediate reward for being in s to the expected
value of being in the new state s'. Then value iteration takes the sum of these values,
weighted by their action transition probabilities, for a given action a. The expected
value, V(s) is set to be the maximum value produced across all possible actions:
V(s) = max
T(s, a, s')[R(s, s') + V(s')].
(2.1)
A dynamic programming concept is used to apply equation 2.1 to multiple time
steps. By definition, the value of being in any state in the final time step, H, is
zero, since there cannot be a reward for acting on a target after the time horizon
has expired. Next, the H - 1 time step's values are calculated. For this iteration of
equation 2.1, V(s') refers to the expected value in the next time step, H, which is
zero. The expected value for every state is calculated in time step H - 1, then these
values are used in time step H - 2. The value iteration equation is adjusted for time:
V(s) = max E T(s, a, s')[R(s, s') + 11(s')],
where VH(s) = 0. In this manner, all possible expected values are calculated from the
last time step to the first using the previously calculated results. The final optimal
policy is determined by listing the maximizing action for each time step.
2.2
POMDP
The world is rarely completely observable. In reality, the exact state of an object may
not be known because of some uncertain element, such as faulty sensors, blurred eyeglasses, or wrong results in a poll, for example. This uncertainty must be addressed.
2.2.1
POMDP Model Extension
A POMDP model has three more elements than that of the MDP:
. b is the belief state, which is a set of probability distributions over the states.
22
Observation
Action
__gon-
BStaee
Stat
Belief
State
Updater
Figure 2-4: A belief state update
Each element in the belief state b(s), for s E S contains the probability that
the world is in the corresponding state. The sum of all components of the belief
state is 1. A belief state is written as:
[ b(so) b(s1 ) b(s 2 )
...
b(slsl)
* Z is a finite set of all possible observations.
* 0 :SxA
-+
-I(Z) is the observation function, where O(s, a, z) is the probability
of receiving observation z when action a is taken, resulting in state s.
In the totally observable case, after every action there is an implied observation
that the agent is in a particular state with probability 1. But since state knowledge is
now uncertain, each action must have an associated set of observation probabilities,
though every observation does not necessarily need to correspond to a particular state.
After every action, the belief space gets updated dependent on the previous belief
space, the transition probabilities associated with the action, and the observation
probabilities associated with the action and the new state [3], as shown in figure 2-4.
2.2.2
POMDP Solutions
Solving a POMDP is not as straightforward as the dynamic programming value iteration used to solve an MDP. Value functions at every time step are now represented as
a set of ISI-dimensional vectors. The set is defined as parsimonious if every vector in
the set dominates all other vectors at some point in the ISI-dimensional belief space.
A vector is dominated when another vector produces a higher value at every point.
23
V(b)
72
7
0
...........
.
73................. ................
o
b1
(dead)
(alive)
Figure 2-5: A sample two-state POMDP vector set
Ft represents a set of vectors at time step t, -y represents an individual vector in the
set, and F* represents the parsimonious set at time t. The value of a target given a
belief state, V(b), is the maximum of the dot product of b and each -y in F*. There
will be one parsimonious set solution for each time step in the POMDP. Like MDPs,
POMDPs are solved from the final time step backwards, so there will be one 17* for
each time step t in the POMDP, and the solutions build off of the previous solution,
t-.
The value function of the parsimonious set is the set of vector segments that
comprise the maximum value for every point in the belief space. The value function
is always piecewise linear and convex [2].
Figure 2-5 shows a sample vector set for a two-state POMDP. With a two-state
POMDP, if the probability of being in one of the states is p, the probability of being
in the other state must be 1 - p. Therefore the entire space of belief states can be
represented as a line segment, and the solution can be depicted on a graph. In the
figure, the belief space is labelled with a 0 on the left and a 1 on the right. This is
the probability that the target is in state 1, dead. To the far left is the belief state
that the target is dead with probability 0, and thus alive with probability 1. To the
far right is the belief state that the target is dead with probability 1, and thus alive
24
V(b)
0
(dead)
(alive)
Figure 2-6: The corresponding two-state POMDP parsimonious set
with probability 0. Each vector has an action associated with it. Distinct vectors
can have the same action, as the vector represents the value of taking a particular
action at the current time step, and a policy of actions in the future time steps. The
dashed vectors represent action a1 , the dotted vectors represent action a2 , and the
solid vectors represent action a3 .
There are six vectors in this set, but not all are useful.
Both 'y4 and ye are
completely dominated, so are not be included in the final optimal value function. It
is useful to note that there are two types of vector domination. The first is complete
domination by one other vector, shown in the figure as 'y4 is completely dominated at
every point in the belief space by 73. The second is piecewise domination, shown in
the figure as
Y6
is dominated at various points in the belief space by 71i, 72 73 and
75 Various solution algorithms make use of the differences between these two types
of domination to optimize computation time.
Figure 2-6 shows the resulting parsimonious set and the sections that the vectors
partition the belief space into. In this particular problem, the belief space has been
partitioned into four sections, with a1 being the optimal action for the first and
fourth sections, a2 producing the optimal value for the second section, and a3 being
25
F*
t -11
a, z.a)VvE
ral
ral
al
-*-F
][aAl
a2j
iFa2P
f
e
e
ra
l
ra
A
][ fah
2
.
falAl
U
Figure 2-7: The dynamic programming exact POMDP solution method
the optimal action for the third. The heavy line at the top of the graph represents
the value function across the belief space.
2.2.3
POMDP Solution Algorithms
How the solutions for each time step are created is dependent on the POMDP solution
algorithm used. However, this thesis does not focus on POMDP solution algorithms,
but rather uses them as a tool to produce a parsimonious set of vectors. This section will discuss general POMDP solution algorithms at a high level, and also the
incremental pruning algorithm used in this research. Cassandra's website [2] has an
excellent overview of many solution algorithms.
General Algorithms
There are two types of POMDP solution algorithms: exact and approximate. Exact
algorithms tend to be more computationally expensive, but produce more accurate
solutions. Ve chose among several exact dynamic programming algorithms.
26
The current general solution method for an exact DP algorithm uses value iteration
for all vectors. The algorithm path can be seen in figure 2-7. Every -Y in the previously
calculated parsimonious set F*
is transformed to a new vector given an action a and
an observation :, according to a function defined as
(-y, a, -). These vectors are then
pruned, which means that all vectors in the set that are dominated are removed. This
produces the set of vectors F'. This is done for all possible actions and observations.
Next, for a given action a, all observations are considered, and the cross-sum, (,
of every F is calculated. This vector set is once again pruned, and this produces 1a.
This is done for all a C A. Finally, the algorithms take the union of every 1a set,
purge those vectors, and produce the parsimonious set F*.
The purging step involves creating linear programs which determine whether a
vector is dominated by any other vector at some point. However, this is optimized by
performing a domination check first. For every vector, the domination check compares
it against every other vector to determine if a single other vector dominates it at every
point in the belief space. If this is the case, the vector is removed from the set, making
the LPs more manageable.
Several algorithms were considered for this research. In 1971, Sondik proposed
a complete enumeration algorithm, then later that year updated it to the One-Pass
algorithm. Cheng used less strict constraints in his Linear Support algorithm in 1988.
In 1994, Littman et al. came up with the widely used Witness algorithm, which is the
basis for the above discussion of a general POMDP solution method. Finally, in 1996,
Zhang and Liu came up with an improvement on the Witness algorithm, calling it
incremental pruning. This is currently one of the fastest algorithms for solving most
classes of problems [4].
Incremental Pruning
The incremental pruning algorithm optimizes the calculation of the IF sets from the
17 sets. The way to get the F' sets is to take the cross sum of the action/observation
27
vector sets and prune the results, in the following manner:
T4purge
@ r
this is equivalent to:
purge(ra
2a
..
Za
IZI.
where k
Incremental pruning notes that this method takes the cross sum of all possible
vectors, creating a large set, then pruning this large set.
However, this is more
efficiently done if the calculation is done as follows:
p pruage(... purge(purge(1
9
1a7)
D r)
(D r
A more detailed description of the algorithm is contained in Cassandra et al.'s published paper
[4]
which reviews and analyzes the incremental pruning algorithm in
some depth.
2.3
Other Approaches Details
As mentioned before, three papers have looked at problems similar to the one this
thesis discusses. Each of these papers has had significant impact on the creation of
this model.
2.3.1
Markov Task Decomposition
Meuleau et al.'s paper focuses on solving a resource allocation problem in a completely
observable world.
Meuleau et al.
use a solution method they call Markov Task
Decomposition (MTD), in which there are two phases to solving the problem:
an
online and an offline phase, thus making it dynamic. The problem they choose to
solve is to optimally allocate resources (bombs) among several targets of the same
type such that at the end of the mission, the reward obtained is maximized. Each
28
target is accessible within a time window over the total mission time horizon, and
has two states: alive or dead. There is one type of action, a strike action, which is
to drop anywhere from 0 to M bombs on a target. Each bomb has a cost and an
associated probability of hitting the target, and the probability of a successful hit
goes up with the number of bombs dropped according to a noisy-or damage model,
which assumes each of the a bombs dropped has an independent chance of causing
the target to transition to the dead state.
The offline phase uses value iteration to calculate the maximum values for being
in a particular state given an allocation and a time step, for all states, allocations,
and time steps. The actions associated with these maximum values are stored for use
in online policy generation.
In the online phase, a greedy algorithm calculates the marginal reward of allocating
each remaining bomb to a target, using the previously calculated offline values. At the
end of this step, every target i will have an allocation mi. This vector of allocations
is passed to the next component of the online phase, the policy mapper. For each
target, the policy mapper looks up the optimal action for that target's time step
ti, mi, and state si. Then the policy mapper has a vector of actions consisting of
one action for each target. These actions are passed to a simulator which models
real world transitions. Every action has a probabilistic effect on the state, and the
simulator calculates each target's new state, puts them into a vector, then passes this
state vector back to the greedy algorithm.
The greedy algorithm then calculates a new allocation based on the updated
number of bombs remaining and the new states, the policy mapper gets the actions
for each target from the offline values, sends these actions to the simulator, and so
on. The online phase repeats in this loop until the final time step is reached.
Figure 2-8 shows the MTD architecture. The first step is the greedy algorithm.
For every target, the greedy algorithm phase uses the target's current state s and time
step t to calculate the marginal reward for adding a bomb to the target's allocation.
The target with the greatest marginal reward mA has its allocation incremented by
one. Once a bomb is allocated to target x, that target's marginal reward, mrn
29
is
Corresponding
actions 1
4
Policy
Actions
sMapper
s, t, m
indexes
Offline
World
Allocations
Dynamic
Programming
Corresponding
values,
P
4 -
s, t, mA
indexes
Greedy
Algorithm
States
Figure 2-8: Architecture of the Meuleau et al. approach
recalculated. When all bombs have been allocated, every target has an allocation. A
vector of allocations is then passed to the policy mapper module.
The policy mapper module in the figure uses the same s and t as the greedy
algorithm used, but now uses each target's allocation from the allocation vector. The
action corresponding to the maximum value for that target's s, t, and m is returned
to the policy mapper, which then creates a vector holding an optimal action for every
target. This vector is then passed to the world.
The world, whether it is a simulator or a real life scenario, will perform the
appropriate actions on each target. The states of these targets are changed by these
actions according to the actions' transition models. The world then returns these
states to the greedy algorithm, incrementing the time step by one. This loop repeats
until the final time step is reached.
The paper lists three different options for resource constraints. The first is the
no resource constraints option, in which each target is completely decoupled, and
there is no allocation involved, so only the offline part is necessary. Each target has
a set of bombs that will not change over the course of the problem. In the second
option, global constraints only, the only resource constraint observed is such that the
total number of bombs allocated must not be more than the total number of bombs
for the entire problem.
The third alternative is the instantaneous constraints only
30
Current object values,
resource marginal costs
Initial
Policies
POMDP
MASTER LP
(1 per object type)
available resources
object constraints
optimal policy for
current costs
Improving policies
Quit when no
improving policies
are found
Figure 2-9: Architecture of the Yost approach, from [9]
option, in which there are a limited number of weapons that can be simultaneously
delivered to any set of targets (i.e., plane capacity constraints). This thesis uses the
second option, global constraints only, based on its simplicity and the possibility for
interesting experiments.
2.3.2
Yost
Yost looks at the problem of allocating resources in a partially observable world.
However, all calculations are done offline. He solves a POMDP for every target, and
allocates resources based on the POMDPs' output. Then his approach uses a linear
program to determine if any resource constraints were violated. If there are resource
constraint violations, the LP adjusts the costs appropriately and solves the POMDPs
again, until the solution converges.
Figure 2-9 shows Yost's solution method. It shows that an initial policy is passed
into the Master LP, which then solves for constraint violations.
The new updated
rewards and costs are passed into a POMDP solver, which then calculates a new policy
based on these costs. This policy goes back into the Master LP, which optimizes the
costs, and so on, until the POMDP yields a policy that cannot be improved within
the problem parameters. This is all done completely offline, so it does not apply to a
dynamic scenario.
31
POMDP
Solver
Updated
costs
Corresponding
values
Actions
Initial
Information
Lagrangian
Relaxation
World
State
Observations
Figure 2-10: Architecture of the Castafion approach
2.3.3
Castafion
Castafion does not deal with strike actions, but instead uses observation actions to
classify targets. Each target can be one of several types, and the different observation
actions have different costs and accuracies. He uses the observations to determine the
next action based on the POMDP model, thus his problem is dynamic.
He has two types of constraint limitations. The first is, again, the total resource
constraint.
He also considers instantaneous constraints, where he has limited re-
sources at each time step. He uses Lagrangian relaxation to solve the resource constraint problems. The entire problem is to classify a large number of targets. However,
he decouples the problem into a large number of smaller subproblems, in which he
classifies each target. Then he uses resource constraints to loosely couple all targets.
But by doing the POMDP computation on the smaller subproblems, he reduces the
state space and is able to use POMDPs to determine optimal actions.
Figure 2-10 depicts Castafion's approach. An initial information state is passed
to the Lagrangian relaxation module.
This in turn decouples the problems into
one POMDP with common costs, rewards, and observation probabilities. Then the
POMDP is solved, and the relaxation phase creates a set of observation actions based
on the results. The world returns a set of observations and the relaxation phase then
uses this to craft another decoupled POMDP, and so on until the final time step.
32
This approach is conducted entirely online. Castaion has efficiently reduced the
number of POMDPs for all targets to one, but because the problem is dynamic, a
new POMDP must be solved at every time step. His problem is one of classification,
so the state of an object never changes. Thus, transition actions do not exist, and
observation actions cause a target's belief state to change.
33
34
Chapter 3
Dynamic Completely Observable
Implementation
To analyze a new partially observable approach to the resource allocation problem, we
begin by expanding the totally observable case. Though most of the ideas presented
in this chapter are from Meuleau et al.'s work, it is necessary to understand them, as
they are fundamental to the new approach presented in this thesis.
The problem that this chapter addresses is one in which there are several targets
to bomb and each is damaged independently. The problem could be modelled as an
MDP with an extremely large state space. However, this model would be too large
to solve with dynamic programming [7]. Thus, an individual MDP for each target is
computed offline, then the solutions are integrated online. The online process is to
make an overall allocation of total weapons to targets, then determine the number
of bombs to drop for each target. The first round of weapons are deployed to the
targets and the new states of the targets are determined. Bombs are then reallocated
to targets, a second round of weapons are deployed, and so on, until the mission
horizon is over.
35
S ={undamaged, damaged}
.5 ..
0
[0
1
'0
501
0
Figure 3-1: A two-state target
S ={undamaged, partially damaged, destroyed}
T =
.6 .3 .1
0 .7 .3j, R=
0 0 1
0 25 501
0 0 20
0 0 0
Figure 3-2: A three-state target
3.1
Differences to MTD
The research presented in Meuleau et al.'s paper is complete, and the problem domain
can be expanded to include partially observable states. However, other enhancements
were made to the problem domain, including implementing and testing with multistate targets (defined as targets with three or more states), updating the damage
model, and allowing for multiple target types.
3.1.1
Two-State vs. Multi-State
Though Meuleau et al.'s model and calculations are of a general nature and can
be used with multi-state targets, the paper only discusses a problem in which the
targets are one type; and this target type has two states: alive and dead. However,
in real-world problems, there will often be more than two states. A trivial example
is a 4-span bridge in which the states range from 0% damaged to 100% damaged in
25% increments [9]. The implementation presented in this thesis can handle multiple
states. It is simple enough to extend the model from two states to multiple states.
All that is involved is adding a state to the S set, increasing the dimensions of the
T matrix by one, and increasing the dimensions of the R matrix by one as well.
For example, figure 3-1 presents a simple target type with two states: undamaged
36
Description
State
Si
Undamaged
25% damaged
50% damaged
75% damaged
Destroyed
S2
S3
S4
S5
Table 3.1: Sample state descriptions
and damaged. Figure 3-2 presents a target with three states: undamaged, partially
damaged, and destroyed. The new T is a 3 x 3 matrix, and the R has more reward
possibilities as well.
3.1.2
Damage Model
Meuleau et al. use a noisy-or model, as described below, in which a single hit is
sufficient to damage the target, and individual weapons' hit probabilities are independent. The state transition model they use for a two state target with states u
undamaged and d = damaged is the following:
T(s, a, s') =
0
if s = d and s' = u
1
if s = d and s' = d
q
if s = u and s' = u
1
if s = u and s' = d
-
The transition probability for a target from state s to state s' upon dropping a bombs
is determined by the probability of missing, q = 1 - p, where p is the probability of
a hit.
To extend the model to multiple states, it is necessary to analyze what an action
actually does to the target. Consider a target with five states, si through s5 , as shown
in table 3.1. A state that is more damaged than a state si is said to be a "higher"
state, while a state that is less damaged is a "lower" state.
Each bomb causes a transition from one state to another based on its transition
matrix T. Since the damage from each bomb is independent and not additive, when
37
multiple bombs are dropped on a target, each bomb provides a "possible transition"
to a state. The actual transition is the maximum state of all possible transitions.
Consider a target that is in s1. If a bombs are dropped, what is the probability
that it will transition to S3? There are three possible results of this action:
" Case 1: At least one of the a bombs provided a possible transition to a state
greater than S3.
If this situation occurs, the target will not transition to S3,
no matter what possible transitions the other bombs provide, but will instead
transition to the higher state.
" Case 2: All a bombs provide possible transitions to lower states. Once again,
if this situation occurs, the target will not transition to S3, but will transition
to the maximum state dictated by the possible transitions.
" Case 3: Neither of the above cases occurs. This is the only situation in which
the target transitions to state s3.
The extended damage model is generalized as follows.
target transitions from state i to state
j
The probability that a
given action a is:
T(si, a, sj) = 1 - Pr(Case 1) - Pr(Case 2).
(3.1)
The probability of Case 1 is the sum of the transition probabilities for state si to all
states higher than sj for action a:
Is'
Pr(Case 1) =
T(si, a, 5m).
)
(3.2)
m=j+1
The probability of a single bomb triggering case 2 is the sum of the transition probabilities for state si to all states lower than sj for a = 1:
j-1
T(si, 1, Sk),
Pr(Case 21a = 1) =
(3.3)
k=1
where T(si,
1, sk)
is given in T. The generalized form of equation 3.3 is the probability
38
that all a bombs dropped transition to a state less than sj:
Pr(Case 2)
r ZT(si, 1,
sk)
(3.4)
_k=1I
Finally, the probability that Case 3 occurs, that a target transitions from si to sj
given action a is a combination of equations 3.1, 3.2, and 3.4:
-1
T(si, a, sj) = 1 -
Is
-a
1 T(si, 1, s)
-
E
T(s a, SM).
(3.5)
m=j+1
-k=1
Since equation 3.5 depends on previously calculated transition probabilities for Case
1, the damage model must be calculated using dynamic programming, starting at the
highest state. Thus, T(sj, a, sisI) must be solved first for a given a, then T(si, a, sIS_11),
and so on.
3.1.3
Multiple Target Types
Any realistic battle scenario will include targets of different types, each with its own
S, T, R, and A.
Each of these target types has an associated MDP. Each one of
these independent MDPs is solved using value iteration, and the optimal values and
actions are stored separately from other target types'.
In the resource allocation
phase, each target will have its target type MDP checked for marginal rewards and
optimal actions. Multiple MDPs now need to be solved to allow for multiple target
types.
3.2
Implementation
The following sections describe how the problem solution method described in Meuleau
et al.'s paper was designed, implemented, and updated.
39
Resource
Affoottion
SOrrsponding
actionsA
poiy
Data
Actions
World
Simulator
SAl- ions
Structures
~s~ Atgcthm
States
'ndexes
Offline
Dynamic
Programming
Target
File
World
W
File
Figure 3-3: The completely observable implementation architecture
40
3.2.1
Architecture
The architecture for the problem solution method is identical to the one discussed in
section 2.3.1. Specific to the implementation, however, are the input files and data
structures, which can be seen in relation to the entire architecture in figure 3-3. The
input files are translated into data structures and used by both the offline and online
parts of the implementation.
The offline calculation loads in the target and world files (1) and produces an
MDP solution data file for every target type (2). Next, the greedy algorithm loads in
the target and world files (3), then begins a loop (4) in which it calculates the optimal
allocation for each target. To do this, for each target i, the greedy algorithm looks
up a value in the data structures, using a state s, a time step t, and an allocation
m.A
as an index. The greedy algorithm uses these values to create a vector of target
allocations, which get passed to the policy mapper (5). For each target, the policy
mapper uses the target's state, time index, and recently calculated allocation to get
an optimal action (6). After calculating the best action for each target, the resource
allocation phase passes a vector of actions to the simulator (7). The simulator takes
in the actions for each target and outputs the new states for each target back to the
resource allocation phase (8), and the online loop (4 - 8) repeats until the final time
step is reached.
3.2.2
Modelling
The entire description of a battle scenario to be solved by this technique can be found
in the information in two data input types: a target file and a world file. All data
extracted from these files will be used in the online and offline portions of the model.
Target Files
The target file defines the costs, states, and probabilities associated with a given
target type. A sample target file is shown in figure 3-4.
The first element defined in a target file is the state set. Following the %states
41
%states
Undamaged
Damaged
%cost
1
%rewards
0 100
0 0
transProbs
.8 .2
0 1
%end
Figure 3-4: A sample target file
separator, every line is used to describe a different state. The order of the states is
significant, as the first state listed will be so, the second si, and so on.
After the states, the %cost separator is used to indicate that the next line will
contain the cost for dropping one bomb. This is actually a cost multiplier, as dropping
more than one bomb is just the number of bombs multiplied by this cost.
The %rewards separator is next, and this marks the beginning of the reward
matrix. The matrix size must be ISI x ISI, where ISI is the number of states that
were listed previously. The matrix value at (i, j) represents the reward obtained from
a transition from si to sj. Note that in this problem, it is assumed that targets never
transition from a more damaged state to a less damaged state, and so a default reward
value of zero is used. This makes the reward matrices upper triangular.
However,
this model allows for negative (or positive, if so desired) rewards for a "backward"
transition, as may occur when targets are repairable.
The transition matrix is defined next using the separator %transProbs.This is
another ISI x ISI matrix, as before, where the value at index (i, j) represents the
42
Target
tb
te
A
B
C
D
E
F
3
1
14
15
3
7
12
5
23
20
21
12
Table 3.2: Sample beginning and ending times of targets
probability of a transition from si to sj if one bomb is dropped. The sum of a row of
probabilities must equal 1. Note that once again, for the definition of the problem in
this research, there are no "backward" transitions, so this matrix is upper triangular.
This model also allows for targets transitioning from a higher damage state to a lower
one. The probability for a transition from si to sj given an action of dropping more
than one bomb is given according to the previously defined damage model. The file
is terminated with the %eind separator.
World Files
The world file defines the time horizon, the resources, and the type, horizon, and
state of each individual target. A sample world file is shown in figure 3-5.
The first definition in a world file is the time horizon, as indicated by the %horizon
separator, followed by an integer representing the total "mission" time horizon. When
a scenario is defined by a world file as having a horizon H, the scenario is divided
into H + 1 time steps, from 0 to H.
The next definition is the total available resources, as indicated by the %resources
separator, followed by an integer representing Al.
This is the total resource constraint
for the mission.
After the resources, the %targets separator is listed. After that, there are one or
more four-element sets. Each of these sets represent a target in the scenario. The first
of the four elements is the target's begin horizon, tb which ranges from 0 to H - 1.
This is when the target comes into "view". The next element is the end horizon, te,
which ranges from tb + 1 to H. Table 3.2 lists several targets with various individual
43
%horizon
25
%resources
50
%targets
3
12
Meuleau
Undamaged
1
5
Meuleau
Undamaged
14
23
Meuleau
Undamaged
15
20
Meuleau
Undamaged
3
21
Meuleau
Undamaged
7
12
Meuleau
Undamaged
%end
Figure 3-5: A sample world file
44
t--
F -- 1
E
D H--C
-
H- B -- i
H
0
-5
A
I
10
15
20
25
Figure 3-6: Timeline of sample targets
horizons. Figure 3-6 depicts these targets graphically on a timeline of H
25. On
this timeline, at any current time step, te, any target whose individual window ends
at or is strictly to the left of t, has already passed through the scenario and will
not return. Any target whose individual window begins at or exists during t, is in
view, and is available for attacking. Any target whose individual window is strictly
to the right of t, will be available in the future for attacking, but cannot be attacked
immediately.
The third element is the target type. This is a pointer to a target type, so a
target file of the same name must exist. The fourth element is the starting state of
the target. The starting state must be a valid state. Currently, upon initialization,
the application checks for a valid state then ignores this value. targets are defined to
start in the first state listed, however, the flexibility exists in this implementation to
start in different states. At the end of every four-element set, the world file is checked
for the %end separator, which signifies the end of the file.
Generating world files is accomplished through interaction with the user. Users
are queried for the scenario parameters: a total allocation M, a time horizon H, the
number of targets N, the number of types of targets, Y, and the target type names.
Then it creates a world file with the appropriate M and H, and N targets, each
of which will be one of the entered target types with probability 1/Y. In addition,
the targets will be given windows with random begin and end times within the time
horizon.
45
3.2.3
Offline-MDP Calculation
Each target has been defined to have a reward matrix, transition probabilities, states,
and so on. Thus, different targets will have different value structures. The purpose
of the offline calculations is to solve an MDP for each target type, which, given an
state, a time horizon, and an allocation, returns an expected value and an optimal
action associated with the value. The problem as defined in this thesis has a finite
time horizon, and the following value iteration equation applies for each target, i:
Vi(si, t, m) = max 1 T(si, a, s') [Ri(si, s') + Vi (s', t + 1, m - a)] - cia
(3.6)
a<m sesi
This equation computes the value of the target i as the probabilistically weighted
sum of the immediate reward for a transition to a new state s' and the value for being
in state s' in the next time step, with a fewer bombs allocated, minus the total cost
of dropping a bombs. This value is maximized over an action of dropping 0 to m
bombs. Each maximized value V (si, t, m) will have an associated optimal action, a.
Equation 3.6 is solved by beginning in the final time step, t = H. The values for
Vi(s', H, m - a) are zero, since there is no expected reward for being in the final time
step, regardless of allocation or state. Thus the values of V at t = H - 1 can be
calculated, then used to calculate the values of V at H - 2, and so on, until t = 0.
The solution is stored as value and action pairs indexed by allocation, time, and
state. A sample data file for a two-state target with a horizon of 15 and an allocation
of 10 is shown in figure 3-7.
3.2.4
Online-Resource Allocation and Simulation
The resource allocation algorithm used in this research is a greedy algorithm. This
algorithm begins by assigning all targets 0 bombs. Then for each target, it calculates
the marginal reward for adding one bomb to the target's allocation. It does this by
looking up the value in the offline results corresponding to that target's time index t,
46
STATE 0
TIME 0
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
TIME 1
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
TIME 15
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
ALLOCATION
STATE 1
TIME 0
ALLOCATION
ALLOCATION
0.0, a = 0
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
V
V
V
V
V
V
V
V
V
V
=
0:
1:
V
V
=
0.0, a = 0
=
18.999999999999996, a = 0
6:
7:
8:
9:
10:
V
V
V
V
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
V
V
V
V
V
V
V
V
V
0:
1:
V = 0.0,
18.999999999999996, a = 0
= 34.199999999999996, a = 0
= 46.359999999999999, a = 0
= 56.087999999999994, a = 0
= 63.870399999999989, a = 0
= 70.096319999999992, a = 0
= 75.077055999999999, a = 0
= 79.061644799999996, a = 0
= 82.249315839999994, a = 0
V = 84.799452672000001, a = 0
=
67.785599999999988, a = 6
= 72.028479999999988, a = 7
= 75.22278399999999, a = 8
= 77.578227199999986, a = 9
V = 77.578227199999986, a = 9
=
0.0,
0.0,
= 0.0,
= 0.0,
= 0.0,
=
=
= 0.0,
=
0.0,
= 0.0,
a
a
a
a
a
a
a
a
a
a
= 0.0,
V = 0.0,
V = 0.0, a
V = 0.0,
= 0
= 0
= 0
= 0
= 0
= 0
= 0
= 0
= 0
= 0
=
0
a = 0
a = 0
Figure 3-7: A sample data file
47
state s, and m =1 allocation. It then subtracts the offline results value corresponding
to that target's t, s, and m = 0. This is the marginal reward for changing a target's
allocation from 0 bombs to 1 bomb.
Once all targets have had their marginal rewards calculated, the greedy algorithm
allocates a bomb to the target with the maximum marginal reward. It then calculates
the marginal reward for adding another bomb to that target. If there is another bomb
left, the target with the maximum marginal reward is allocated a bomb, and so on
until either all bombs are allocated or the marginal reward for all bombs is 0. The
marginal rewards for target i are calculated according to the following equation:
A
(Sim, t) = Vj(si, m + 1, t) - Vj(si, m, t)
(3.7)
Thus, given a state si, a time step t, and an allocation rn, the marginal reward equals
the difference between the expected reward at the current allocation and the expected
reward at the current allocation plus one bomb. The Vi(si, m, t) and VK(si, m + 1, t)
values are retrieved from the data calculated in the offline phase.
After this greedy algorithm is complete, each target will have a certain number
of bombs allocated to it. In the data structures calculated in the offline phase, the
actions are paired with a set of a time index, a state, and an allocation. Thus, given
each target's t, s, and recently calculated m, the optimal action a is determined
directly from the data structures and put into a vector of actions for the first time
step.
This vector is now passed to the simulator, and the sum of these bombing
actions are subtracted from the remaining bombs left.
The simulator takes these
actions and, based on the probabilities calculated in the damage model, assigns a
new state s' E S to each target i, then returns a vector of these new states.
After this is done once, the resource allocation algorithm runs for the next time
step, but this time the total available weapons counter is decreased by the sum of
actions from the previous step and the target states are updated. This loop continues
until the time step is equal to H.
One problem lies in calculating the valid t for equation 3.7. Each target exists
48
in an individual window, but the application only knows
tb
and t, for each target,
and the current time step, t. What value should be used to index the offline values?
There are three cases.
Target window type 1 is the simplest, corresponding to te < t, for the target. In
this case, the target already "existed", and now has disappeared. This could be if it
was a moving target and it has moved into and out of range. The marginal reward
is zero for this target, as it will never be possible to damage it again. The greedy
algorithm will not even consider these targets, since there is no benefit to allocating
a bomb to them, and the actions associated with these targets will be to drop zero
bombs.
Target window type 2 is if tb
t, < t,. In this case, the target exists, as the time
step falls in the target's individual window. In this case, the t value used is equal to
H - (te - tc). To understand this, it is important to remember that Vi is calculated
from the final time step. t, - t, corresponds to the number of time steps left before
the target's window closes. Thus, since the values in the data files were calculated
for a target whose window was of length H, the proper value for t is the value for the
same target type with the same number of time steps left in the window. Section 3.3.1
discusses the optimization implications of this implementation.
Target window type 3 is when t, < tb. In this case, the target has not "come into
view" yet. This could happen if a target is moving towards the attackers. It would
be very bad to do the same thing as in case 1, since no bombs would get allocated to
the target. For example, say this target has a reward of destruction of 1,000,000 and
another target, which is of type 2, has a reward of 10. If there is only one bomb, using
the naYve "type 3 = type 1" method, the greedy algorithm would allocate that bomb
to the type 2 target. Then it may drop that bomb, and not have any to allocate to
this target. To avoid this problem, it is noted that for a fixed s and m, as t decreases
in equation 3.6, the values are nondecreasing. This corresponds to the idea that it is
worth more to have the same allocation in the same state if there is more time left
in the target's window. Thus the time index for a type 3 target is t = H - te, or the
total horizon minus the end time. The action, however, for a type 3 target is always
49
Action dependent on
MDP solution
Action =0
Target considered
Targetnsidred
.for
alUocation
for allocation
Action =0
T5rget NOT
coidred for
for allotin
..I
Type 2
Type 3
Target
0
tb
tH
Figure 3-8: A timeline depicting the three types of target windows
drop zero bombs, since the target "does not exist" in the current time step, and all
bombs would by definition miss.
Figure 3-8 shows the progression of the types of a target over a horizon. The
target "exists" in the white part of the figure, when it is type 2. In this section,
the target is considered for the allocation of bombs, and its action is dependent on
the offline values. In the left grey area, the target does not exist yet, but will in the
future. The target is still significant to the problem as type 3, as it still needs to be
considered for weapon allocation. However, since the target does not exist yet, the
action for a type 3 target will always be to drop 0 bombs. Conversely, a type 1 target
in the right gray area no longer needs to be considered for weapon allocation. Again,
since the target does not exist, the action will always be to drop 0 bombs. For the
boundary cases, at tb and te, the target changes to the next class, as is shown in the
figure.
Determining the target indexes for the target types is simple. All target indexes
begin at t = H - te, and every time step, the index is incremented by one. When
t = H, the target window has passed, and the target is effectively removed from
resource allocation consideration.
50
3.3
Implementation Optimization
The first version of the totally observable resource allocation method took a great
deal of time to calculate the offline values. Optimization did not seem to be a luxury,
but rather a necessity to make the problem more tractable. It is possible to decrease
the total amount of calculation by exploiting certain aspects of the MDP.
3.3.1
Reducing the Number of MDPs Calculated
As mentioned in section 3.2.4, the value of two different targets of the same target
type is the same if they have the same time index, meaning only one MDP calculation
is required for each target type. At first, one MDP was calculated for each target.
This took a great deal of time and computation.
However, since the only thing
that changes for calculation of 1V between these MDPs is the t in equation 3.6, the
calculations were duplicated. However, the calculations need only be done once, for
a maximum horizon, H. Once the maximum horizon MDP has been calculated, the
online phase just needs to select t properly. Making this change reduced the number
of MDPs from the number of targets to the number of target types. Though this
does increase the size of the MDPs, in most realistic world scenarios, the increase
in computation caused by the increased number of time steps is much less than the
computation time to calculate an MDP for every target.
It would be possible to only calculate the MDP for the maximum target horizon
and then use those MDP values in the online phase. This actually optimizes the
solution method for one scenario. But a large horizon can be selected for compatibility
with future scenarios. If a horizon of 100 is calculated, then any problem with a
maximum target horizon of 100 or less would have already been calculated.
3.3.2
Reducing the Computational Complexity of MDPs
In the MDP calculation, for a fixed time and state at a certain allocation the values
converge. As m increases, the values increase until they get to a maximum at an
allocation of m*. As m increases, the values can never go down because there is no
51
cost for allocating another bomb, only for dropping it. Thus if it was determined that
the marginal reward for dropping another bomb is less than the cost of a bomb, the
action would be zero. This would be a marginal increase in reward of zero.
At a certain point, the cost does exceed the marginal reward.
At this point,
Vi(si, t,m*) = Vi(si, t, m* + 1) = v. When this happens, no matter how many more
bombs are allocated, the reward will never go up. Thus instead of continuing the
calculations for n* > m > Al, it saves a large amount of computational time if v is
copied into the data file for the appropriate indexes.
3.4
Implementation Flexibility
This solution model is designed to be flexible, so that it can be extended to the
partially observable case, and modular, so that the same code can be re-used. To this
end, several components have been designed to be computed independently from the
working model code. These components can be changed for different problems and
different research.
The damage model is fine for an extension to the "noisy-or" model, but other
damage models may make more sense in different applications. It is easy to conjecture
a scenario where two bombs that would individually make a state transition from
undamaged to partially damaged, when considered together, might make the target
transition to destroyed. Because of this, the damage model is one of the flexible
modules.
The greedy algorithm is not the optimal solution. We chose it because our goal is
to keep almost everything the same, but change the problem from totally observable
to partially observable. However, the resource allocation code is designed so that a
different solution algorithm can easily be implemented.
One benefit of having the online and offline parts completely separate is that
the offline part can be done once, then the online part can be done over and over
to get averaged experimental results. Another benefit to this model is that battle
planners can do complete calculations for different target types before a mission,
52
creating a database, then use the appropriate values from this database in future
missions.
Since most of the calculation is done offline, the computational cost of
the database calculation will be separated from the mission planning.
Thus, with
a fully instantiated database, a battle scenario needs only to be modelled as a new
world file, and the online computational costs will be negligible compared to the
database construction. This will allow for faster response to dynamic situations on
the battlefield.
3.5
Experimental Comparison
The purpose of creating a working implementation of the research in Meuleau et
al.
is to have something on which to base the partially observable approach.
To
this end, this section compares the results from the implementation created in this
research with the results from the paper. All software implementation in this thesis
was implemented in Java on a 266MHz computer with 192 MB of memory, running
RedHat Linux 7.0.
Meuleau et al.'s paper shows the offline results for an MDP for a single target
problem.
The target has two states: S = {undamaged, damaged}, an individual
target horizon of 10, a single-bomb hit probability p of 0.25, a reward of 90 for
destruction, and a bomb cost of 1. Figure 3-9 from their paper depicts the optimal
number of weapons to send for each time step, given that the target is undamaged.
The plot is monotonically increasing. This makes sense, since if the target will be
around a long time, it is best to spread out the attacks. Given an extended period
of time, the optimal policy would be to drop a bomb and determine if it damaged
the target. If it did not, drop another and then determine if it damaged the target.
Spacing out the attacks prevents the waste of bombs. However, once it gets closer to
the end of the horizon, it becomes more important to make sure that the target is
destroyed, so more bombs should be used. If at any time the target is damaged, the
optimal action is, of course, to drop no bombs, since there will never be a positive
reward for doing so.
53
Time
Allocation
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
0
1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
2
3
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
4
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
5
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
2
2
6 7 8
0 0 0
0 0 0
0 0 1
0 1 1
1 1 2
1 1 2
1 2 2
1 2 3
1 2 3
2 2 3
2 2 4
2 2 4
2 2 4
2 3 4
2 3 5
2 3 5
2 3 5
9
0
1
2
3
4
5
6
7
8
9
10
11
11
11
11
11
11
Table 3.3: Optimal actions across time and allocation
54
12-
0
10
M.........
1
2
3
4
6
5
7
8
9
10
Time
Figure 3-9: Meuleau et al.'s graph of optimal actions across time
This problem was run using the implementation presented in this chapter, and
table 3.3 shows the results. A major difference between the data from this trial and
the data presented in the paper is that this experiment was run over several different
allocations. The data is displayed graphically in figure 3-10.
The graph shows that, regardless of the allocation, the policy says to to drop
fewer bombs at first, then more bombs as the target gets closer to the end of its
horizon. This coincides with both common sense and with the data from Meuleau et
al.'s paper. The other result obtained is that the optimal policy at time step 15 and
above is the same as that in the paper. The graph shows that the time dependent
policy converges to the infinite resource optimal policy at this time step.
In addition, it is clear that 11 bombs is the maximum action that will be taken,
regardless of how many bombs are available to drop. At this point, according to the
damage model the marginal reward for dropping one more bomb is less than the cost
55
12-
10
00
6
9
0
25
36
2
Step
10
15
Time
Allocation
5
0
Figure 3-10: Optimal actions across time and allocation
1, for dropping 11 bombs is:
of that bomb. The expected marginal reward Er
Eril=
Eri
En1l
R(undamaged, damaged) * (1 - qa)
=
90 * (1 - 0.7511) _ I
11
86.1988,
and by the same set of equations, the marginal reward Erl, is 87.1491. The difference
between these two values is .9503, which is less than the cost of 1 that an extra bomb
would incur.
Based on this trial and thorough checking, the completely observable implementation as described in this chapter works and produces expected results.
56
Chapter 4
Dynamic Partially Observable
Implementation
The previous chapter defined and discussed an approach towards, and implementation
of, a completely observable resource allocation problem. Though this approach works
well in ideal situations with perfect sensors, it is not very realistic. This chapter will
discuss the extensions to the previously defined model, describe the implementation,
and analyze the approach in relation to the completely observable case as well as in
other interesting cases.
The problem that this chapter addresses is again one in which there are several targets to bomb and each is damaged independently. Similarly to the previous chapter,
the problem could be modelled as a POMDP with an extremely large state space.
However, even MDPs grow too fast to solve a single one that describes the entire
problem, and POMDPs grow much faster than MDPs. Thus, using a single POMDP
to solve the entire problem is unrealistic.
Therefore, in this model, an individual
POMDP for each target is computed offline and the solutions are integrated online.
The online process is to make the overall allocation of weapons to targets and send
out a round of weapons, much like the previous chapter. But now it is necessary to
get observations of which targets were destroyed, instead of knowing exactly which
ones were. The beliefs of the states of these targets are recomputed and weapons are
reallocated. Then either a round of weapons or an observation mission are sent out
57
for each target, and this online process repeats until the mission is over.
4.1
Additions to the Completely Observable Model
The completely observable model used MDPs, which did not address incomplete state
information. To make a change to a partially observable problem scenario, we must
shift to using POIDPs. There are other required changes associated with this shift,
including belief states, observation models, and a state estimator.
4.1.1
POMDPs
Equation 3.6 for calculating expected values no longer applies.
In the completely
observable case, the state space is discrete, so it is feasible to calculate all possible
values based on all previous states weighted by their relative probabilities. However,
now the state space is continuous.
This continuous space causes the problems to
grow much faster than in the completely observable case. Therefore, enumerating all
possible values becomes intractable for anything other than very small problems.
This can easily be understood by considering a completely observable example
with a simple two-state target which is either alive or dead. It is assumed that a
bomb has a hit probability of 0.5 and a cost of 1, and for simplicity A = {drop 1,
drop 0}. The reward for destroying the target is 10.
Calculating the values for this trivial problem is simple.
First, the immediate
rewards can be calculated. For dropping one bomb, the immediate reward for s' =
dead is the reward multiplied by the probability of destruction, minus the cost for an
action: (.5 * 10) - 1 = 4. The immediate reward for s'
=
alive is -1.
For dropping
zero bombs, this value is zero. These immediate rewards apply for all steps.
In the final time step, H, all rewards are zero. It is simple to use the value iteration
equation 3.6 to calculate all possible values for each s and s' in t = H - 1, then use
these results to calculate all possible values for t = H - 2, and so on.
However, once the state space becomes continuous it is no longer simple to enumerate all possibilities. What if dropping a bomb hits a target with probability 0.5,
58
but there is only a 30% chance of recognizing that the target is destroyed? The value
iteration equation does not deal with this extra information. The solution is to use
a POMDP formulation for the problem, for which several algorithms exist which can
feasibly solve the partially observable problem.
4.1.2
Strike Actions vs. Sensor Actions
In the MDP case, all actions affected the actual state of the model. Each action
had a transition model (also called the damage model) based on the initial transition
matrix T. These actions are called strike actions. In the completely observable case,
strike actions had an implicit observation model 0 which produced an actual state
observation z = s with probability 1.
However, in the partially observable model, all strike actions are defined to produce
no information about the target. Therefore, a new class of actions, sensor actions,
must be introduced. These actions are defined to have no effect on the state of a
target, but instead they will return information of the state of the target.
In a POMDP in general, an action will have an effect on the state of a target and
produce an observation of the state, but for simplicity in this model, this will not be
the case. In this model, every action is either a strike action or a sensor action.
4.1.3
Belief State and State Estimator
Now that the shift has been made from complete to partial observability, the output
from the simulator is no longer a definitive state. Instead, it outputs an observation
based on the action taken and the actual state of the target. Therefore, at every
time step, for every target, it is necessary to calculate a new belief state b. Basic
probabilistic manipulation yields the equations necessary for this update, which will
be covered in section 4.3.6.
At every time step, a new belief state must be created for each target, dependent
on previous belief state, the action taken, and the observation received, shown in
figure 4-1. The observation refers to the observation z received from the simulator
59
Observation (z)
Acton a)
State
State(b
Estimator
P
Figure 4-1: The general state estimator
State
---_
--
Estimator
---
b
(strike action)
Figure 4-2: The state estimator for a strike action
after the action performed, a. The action transition model has an a priori probability
that the target will change state, and the state estimator combines this information
with the observation to update the prior belief state.
The state estimator is split into two cases based on the above assumption that
strike actions and sensor actions are mutually exclusive. For strike actions, the new
belief state only depends on the previous belief state and the action taken, as seen
in figure 4-2.
Since a strike action yields no information from the simulator, the
observation are from figure 4-1 can be eliminated. Conversely, for sensor actions,
the new belief state only depends on the previous belief state and the observation
received, as seen in figure 4-3. This is because the sensor actions never cause a target
to change state, so the action arc can be eliminated from figure 4-1. These two cases
are proved in section 4.3.6.
z
State
b
Estimator ---(sensor action)
Figure 4-3: The state estimator for a sensor action
60
4.2
The Partially Observable Approach
One way to solve the dynamic resource allocation problem would be to combine
Meuleau et al.'s Markov Task Decomposition with Castanon and Yost's use of POMDPs.
To do this, we will decouple each target, as in MTD, solving a POMDP for each target
type. The results from these POMDPs will be used to determine an allocation and
then optimal action for each target. Then these actions will be executed, and the
observations from each action will be used for future resource allocation. However,
resource constraints must be applied to recouple the targets. Yost uses an LP solver
to handle resource constraints, and Meuleau et al. use the value iteration equation,
which takes into account that performing action a reduces the total resources by a
bombs. We used code developed by Cassandra [1] to solve POMDPs. The code takes
in a POMDP definition input file and produces H output files of parsimonious vector sets, as described in appendix A. However, this code does not explicitly handle
resource constraints.
4.2.1
Resource Constraint Problem
The problem with this "simple combination" approach is that the POMDP solver code
was not designed to deal with consumable resources. The only thing that prevents
the policy from taking the most expensive actions is the cost alone, which means that
every action is considered to be viable at every time step. This can lead to resource
constraint violations in the resource allocation problem presented in this thesis. For
example, if there are 10 bombs total for a mission with one target with a time horizon
of 3, Cassandra's POMDP algorithm could say that the best action at every time step
is to drop 10 bombs. This is likely if the cost for dropping bombs is low and the reward
for destruction is high. However, this would mean that the policy says to drop 30
bombs total, which is a violation of the 10 bomb global resource constraint.
The solution to this problem is to increase the state space of the POMDP to
describe not only the actual state of the target, but also the number of bombs allocated to the target. With careful selection of transition and reward matrices, the
61
POMDP solver code will actually consider the allocation as part of the problem, and
deterministically shift the allocation space while probabilistically calculating the real
state space.
This increases the size of the state space of the problem from IS to ISI x (Al + 1).
That is, the number of states is now the state set crossed with the total number of
resources allocated, including the zero allocation. For example, for a target with two
states, alive and dead, and two total bombs allocated to it, the associated POMDP
would have six states, S = {2alive, 2dead, lalive, 1dead, Qalive, Odead}.
There is a tradeoff associated with this approach, and that is the size (number of
coefficients), of the vectors in the POMDP. The maximum number of vectors, F in
time step t is [9]:
JFtJ = JAJJF*_111z.(41
Of course, many vectors in this set will be pruned out, resulting in a parsimonious
set that is most likely much smaller than this worst-case.
Equation 4.1 shows that the number of vectors grow linearly with JAl and exponentially with jZ, but not at all with ISI. This is not to say that increasing the
size of the vectors does not increase the number of calculations required to solve a
POMDP, as it increases the dimensional space of the vectors, but the algorithms to
solve POMDPs work to prune vectors. Thus if this approach does not add to the
number of vectors, it has a minimal impact in solving for a parsimonious set, and
thus does not affect the asymptotic running time of the incremental pruning solution
algorithm.
The transition probabilities must be updated as well. Whereas before, it was easy
to list the hit probability of dropping one bomb in a two-state POMDP as T(alive,
drop 1, dead) = 0.3, for example, it becomes more complicated now. Now, for every
action from dropping zero to dropping M bombs, a transition probability must be
declared which is T(m alive, drop a, (m-a) dead) = 0.3, T(m alive, drop a, (m-a)
alive) = 0.7, and so on.
The transition model expansion is shown in figure 4-4. The white set of proba62
Sr
alive
dead
Sr
S
alive
dead
1
0l
alive dead
alive
S
[.LL
dead
a=0
alive
I,7 I.3
Sf
alive I d
I dead
alive
alive
S -
S-
Lii
dead
dead
a=2
a1
a=.
Sr
S
Oalivi
Odead
00440ME
0
0
0
0
0
1m-
Figure 4-4: Expanded transition model for MA~ - 3
63
bilities represents dropping 0 bombs, the light gray set represents dropping 1 bomb,
the medium gray set represents dropping 2 bombs, and the dark gray set represents
dropping and so on. When an action is executed, all transition probabilities that are
not in that action's set are set to zero. Consider the target starting in state Salive,
and action drop 2 is performed.
The transition probabilities from state 3alive to
all states but the states defined in the medium gray box (1alive, 1dead are set to
zero. Thus, T(Salive, drop 2, lalive)
-
0.4, and T(Salive, drop 2, idead) = 0.6. This
increases the number of declarations necessary in the input file, but the actual state
transition probabilities have not changed.
4.2.2
Impossible Action Problem
Treating weapon allocation as part of a target's state creates another problem. Since
every action is still allowable in every state, what happens if the state dictates that
there are 5 bombs available for dropping, but the action to be considered is drop 6
or more bombs? There are a couple of ways of addressing these impossible actions.
The first is to transition to the same state with no reward.
The problem with
this is that a POMDP solves for the future by taking the immediate reward for
that action given that state and adding it to the probabilistic future rewards for
that target. However, it is likely that the future reward from that particular state
is nonzero, assuming that the cost of dropping bombs does not always exceed the
probabilistic reward for destroying a target. This means that the impossible action
could be considered in a policy, which produces undesired results. At best, it adds
possible vectors to be considered at each step, which grows the problem exponentially.
At worst, it causes the policy to list an impossible action as the optimal choice.
Another way of addressing the problem is to transition to the zero allocation, completely undamaged state, which if defined properly should have zero future reward.
However, there are times that a transition to this state is valid, such as dropping as
many bombs as allocated and missing with all of them. This can lead to confusion,
and is not a very consistent model.
To solve this problem, a new limbo state is introduced. Any time an impossible
64
Sf
SF
Sf
alive dead
alive dead
alive dead
Sf
alive dead
alive
li
dead
o
alive
d
dead
.7
3I
alive
live
dead
dead
a=1
a=O
2a3
S
3alive|3dead2alive|2dead Jlive|1dead laive Odead l imbo
3alive
11
Mdead
2aiive
Fdead
.
e
n
3
m
M
3
=
i
S
lalive
.
1
......
1
Mdead
Oalive
1 |0
Odead
0
|1
limbo
Figure 4-5: Expanded transition model for M = 3 with the limbo state
65
action is performed, the target transitions to this state, as can be seen in figure 45. When an impossible action is attempted, the transition model only considers the
black transition set. The limbo state is an absorbing state, as any action taken while
in this state transitions the target back to limbo. An impossible action is defined
to have a cost of ca, where c is the cost per bomb and a is the number of bombs
dropped. All rewards from this state are 0, and all observations from this state are
nothing (discussed in the next section). Thus the POMDP will eliminate the vector
associated with these actions in each epoch, since the immediate reward is negative,
the future reward will be zero, and the belief state does not change. Not only does
this solution save computation, but it also maintains resource constraints.
4.2.3
Sensor Actions
In the MDP case, the state transition matrix for a strike action was nontrivial, whereas
the implicit observation matrix for a strike action was the identity matrix. Thus, after
every strike action, the state was determined, since the appropriate observation for
the actual state resulted.
However, this is unrealistic. In real-world scenarios, a pilot may see nothing but
smoke from a dropped bomb. This research assume that a strike action provides no
observation. This can either be modelled as a nothing observation, or as a uniform
distribution across all observations, as the change in the belief state is independent
of the observation in both cases. The mathematical reasoning behind this statement
will be discussed in section 4.3.6.
However, there must be a way to receive information about a target, given that
none is gleaned from strike actions. Perhaps the commander of a military unit may
send a UAV or quick manned aircraft with powerful sensors to do reconnaissance. To
model this, the new observation class of actions must be created. These are defined
to have a nontrivial observation probability matrix, where an observation obtained
is based on the new state of the target and the action taken, whereas the transition
matrix is the identity matrix. Thus, any observation action does not change the state
of the target, only the target's belief state.
66
Sensor actions have a cost associated with the action, but there will never be
an immediate reward, since a reward is gained on a transition, and the sensor class
actions have an identity transition matrix. Conversely, strike actions not only have a
cost associated with the action, but may have an immediate reward if the transition
model dictates.
Conversely, the strike actions are based on the cost of a bomb on a target multiplied by the number of bombs dropped but may have an immediate reward if the
transition so dictates.
4.2.4
Belief States
For the simple two-state "alive-dead" POMDP, the belief state is simply [b(alive)
b(dead)], but now it must be expanded to include allocation. However, it is important to remember that the allocation is deterministic. Thus the nontrivial belief state
has not changed in size, only the actual size of the belief vector. The belief state will
be filled in on either side by anywhere from 0 to IVI "bins" of zero-belief probabilities.
Each bin corresponds to a distribution across states for a given allocation. The convention used in this thesis is that the leftmost bin in the belief state is the maximum
allocation, and the rightmost bin is the zero allocation. A generic belief state would
be:
[ <bin M> <bin M-1> .
.
.
<bin 1> <bin 0> I
To better understand this representation, consider a simple two-state alive-dead
POMDP with a maximum allocation of six bombs. Initially, assuming the target is
known to be alive, the belief state is:
[ <1 0> <0 0> <0 0> <0 0> <0 0> <0 0> <0 0> ]
The < and > are added for ease of understanding, but do not actually exist in the
representation of the belief state. Assume the first action is to drop one bomb. Let
Paj be the probability that the target is alive at time step t and Pd, be the probability
that the target is dead. At time step 1, the belief space is:
67
[ <0 0> <Pal Pdl> <0 0> <0 0> <0 0> <0 0> <0 0> ]
Assuming the second action is to drop two bombs, the updated belief space is
shifted two bins to the right, corresponding to having three bombs left:
[ <0 0> <0 0> <0 0> <Pa 2
Pd2 > <0 0> <0 0> <0 0> ]
Finally, assume the next action is to drop three bombs. This shifts the nonzero
bin three to the right, putting it in the zero-bomb bin:
[ <0 0> <0 0> <0 0> <0 0> <0 0> <0 0> <Pa Pd3> ]
It is clear that the values within the bins change probabilistically, but the movement across bins moves deterministically. This is important, because it means the
problem has not really changed from a simple two state POMDP. The extra states
that have been added simply serve to provide information to the implementation so
that resource constraints will be honored.
4.3
Implementation
So far we have presented a high-level description of the partially observable approach.
This section describes the approach in more detail, discussing the architecture, modelling, and offline/online computation. Mathematical detail is provided where applicable.
4.3.1
Architecture
The architecture for the partially observable approach, shown in figure 4-6, is similar
to that of the completely observable approach, but a few additions are necessary.
The target and world input files now include new information required for partially observable models. The offline data structure is now a directory structure in
which every POMDP solved, uniquely identified by its target type, time horizon, and
allocation, has its own .alpha file (Cassandra's vector output file) for each epoch.
68
Corresp(
.i...
.....
b, t
.alpha
File
............
...
...........
Data
Data
World
Corresp
I
Files
valu
.. b..
..
. .-.
Offline
Dynamic
Programming
Observation
lii
Target
File
World
File
Figure 4-6: The partially observable implementation architecture
69
The online phase now requires a state estimator to determine a belief state from
the observations received from the simulator. The estimator combines an observation,
action, and a belief state from the previous time step, and yields a new belief state.
This estimated state is passed to the resource allocation module, which uses it to get
the optimal allocation and actions for each target from the data files. The action for
each target is then sent to the simulator.
The resource allocation module now has to calculate the optimal allocations and
actions using belief states instead of merely looking them up in a preprocessed data
file, as in the completely observable approach. The simulator takes in a vector of
actions and now outputs an observation vector instead of a state vector.
4.3.2
Modelling
The target modelling files for the MDP are augmented to address a POMDP implementation. A sample target file for this approach is shown in figure 4-7. Immediately
after the %states section, a new observations section has been added. This section
begins with %observations, and every subsequent line contains the name of an observation. These names make up the observation set Z. The nothing observation is
not included in the target file, but is hard-coded, since that observation is common
to all targets.
The next section, beginning with the %obsactions separator, is used to describe
the names and the observation probabilities of the sensor actions. The first line after
the separator is an integer, ao, representing the number of sensor actions. After that
are ao sets of three elements: sensor name, cost, and observation matrix.
The first element, the name, is added to the set A.
After the initialization is
done, A will contain M + 1 + ao actions. The next element is the cost, c0 i of the
sensor action i. Since transitions do not happen with sensor actions, the expected
immediate reward for this particular sensor action is -co
with probability 1. The
final element, the observation matrix Oi, is an S x Z matrix, where the rows represent
states a target has just transitioned to, and the columns represent observations. As
an example, the probability of this target observing observation number 3 given it is
70
%states
Undamaged
Damaged
%observations
Undamaged-obs
Damaged-obs
%obsactions
1
look
1
.8 .2
.2 .8
%cost
4
%rewards
0 20
0 0
%transProbs
.5
.5
0 1
%end
Figure 4-7: A sample partially observable target file
71
in state 1 and taken observation action 2 is 02(1, 3).
The remaining elements from the target file are loaded as before. The POMDP
requires a new set of observation probabilities and observation actions, and the target
file modelling now provides these.
4.3.3
Offline-POMDP Calculations
The partially observable resource allocation code is in Java, and the POMDP solver
code is in C. They communicate through a text file. The use of text files to model the
world allows online and offline phases to be run separately, to get a series of online
results with one offline computation.
The offline phase creates for each target type an input file for the POMDP solver.
From the target type file, S, A, and Z are extracted and placed in the header part of
the file, adding a limbo state. Then for each element in T, an appropriate transition
entry for all non-impossible actions is created, while impossible actions are set to
transition to the limbo state. This process is repeated for the observation matrices. A
reward entry for every action is created, dependent on expected reward for a transition
minus the number of bombs dropped. The impossible actions are listed as well, as
these actions will have a negative expected value and will be pruned out in the
POMDP solution.
Something interesting to note is that even though the limbo state has been added
to the state space, its value in the solution file is actually never used online. Offline,
the limbo state forces the POMDP solver to prune out vectors which transition to it.
Since it is always zero by definition, it can be ignored in the online resource allocation
section.
Once a file is created for each target type, the POMDP solver is run. The command line execution is:
pomdp-solve-4.0/pomdp-solve -save-all -horizon <H> -epsilon 1E-9
-o POMDPSols/<target type>_allocation<M>_horizon<H>-E-9/solution
-P <target type>.POMDP
72
This command runs the POMDP solver code, saving all epochs, with horizon H,
and epsilon 10- 9 . The input file has been saved previously as <target type>. POMDP.
It is not necessary to specify incremental pruning as the POMDP solution algorithm,
since Cassandra's code uses this algorithm as a default.
The .alpha files will be saved into the directory POMDPSols, subdirectory listed
by its allocation, horizon, and epsilon, and will have a file prefix of solution. These
text files are for future use by the online phase.
4.3.4
Online-Resource Allocation
Once the offline phase has been completed, the online phase begins.
The online
execution loop begins with the greedy algorithm, which calculates optimal allocations
for all targets. Then the policy mapper determines the optimal actions for all targets
based on their recently computed allocations. These actions affect the targets in the
world. Finally a state estimator takes the observations from the world, updates each
target's belief state, then passes these belief states back to the greedy algorithm for
the next time step.
Each target's belief state is initialized to the first actual state, allocation zero, with
probability one. Thus, all bins but the zero-allocation bin are filled with
ISI
zeros,
and the zero-allocation bin has a 1 in b(so) and zeros for all other b(s). In addition,
a resources remaining counter, w, is initialized to M. Subsequent iterations of the
resource allocation phase will use an updated counter and also an updated belief state
for each target from the state estimator, so this is the only time the nonzero portion
of the belief state will be set to [ 1 0
... 0
1.
For bombs 0 through w, the marginal reward for allocating that bomb is calculated
for each target. To do this, it is first necessary to determine the time index of the
target, t, which is the number of remaining time steps that the target will "exist".
This is calculated by subtracting the current time step t, from the target's end time,
te. If t
; 0, then the target's window has closed, and no bombs will be allocated to
this target. If it is strictly positive, then the marginal reward can be extracted from
the offline values.
73
For each target, i, determining the marginal reward for allocating m bombs, rT is
done by iteration across all vectors in the .alpha<ti> file. We define b
to be a belief
state for target i in which the non-trivial allocation bin is m. The initial belief state
is b'. On the initialization step, this is zero. This belief state is copied as belief state
bm+,
which is the same belief state with an extra bomb allocated. Now, for each
vector -y, the maximum expected value of having m bombs allocated is calculated
by taking the dot product of the b' belief state and selecting the maximum value.
Similarly, the maximum expected value of having m +1 bombs is calculated by taking
the maximum of the dot products of every vector and b'+'. The marginal reward for
allocating a bomb to this target is the difference between these two values:
rM = max[yb+
E]F*
-
max[yb.
-yEF*
bm.
The calculation of the time index and marginal reward will be performed once
for each target. The resource allocation algorithm will then select the maximum of
these values and assign one bomb to that target. In this case, the nonzero portion
of the belief state for this target shifts to the next higher bin. For example, the
nonzero portion of the belief state for a target that is allocated its first bomb will
move from the zero-allocation bin to the one-allocation bin. Also, assuming there are
more bombs to be allocated, the new marginal reward for one bomb will be calculated
for this target with the new belief state and can be compared to all the other targets'
marginal rewards, which have not changed. The resource allocation algorithm then
repeats these steps until all bombs are allocated, or until all marginal rewards are
zero.
After the resource allocation phase, the next step is to determine the appropriate
actions corresponding to these allocations. This comes from the same .alpha<ti>
file for every target as before. Since the greedy algorithm has shifted every target's
nonzero belief state to the appropriate allocation bin, the optimal action, aj, is the
action corresponding to the maximum value of the dot product of each vector, -y, and
74
the belief state bi. The index of the vector with the maximum dot product is:
max =
argmax[.
b],
then a, is set to the action associated with the -)7ax vector in the . alpha<ti> file. A
vector of target actions, A, is created. Finally, the resource counter is decreased by
the sum of the bombs dropped in the strike actions.
4.3.5
Online-Simulator
The simulator is run using the action vector. For every target, the simulator updates
the state of the object using the transition probabilities for that target's state and
action T(s, a, s') for all s'.
Once the states have been updated, the simulator now returns a vector of observations, Z. The observation returned for a given target is dependent on the action
class. For a sensor action, a,, the observation returned will depend on the observation probabilities for that target's new state and action: O(s', a,, z). For a strike
action, however, a uniform distribution across Z, excluding the nothing observation,
is returned. This distribution corresponds to the nothing observation in the POIDP
solver, as it does not provide any information as to the state of the object. This
statement will be proved in the next section.
4.3.6
Online-State Estimator
The next step in the online phase is to update the belief states. For ease of understanding, this section will focus on the nonzero bin of the original belief state. For
example, if a target is allocated two bombs, the 2-allocation bin will be nonzero,
and this section will discuss the changes in that bin. The state estimator only cares
about the nonzero bin because it uses probabilistic rules to change the belief that the
target is in a particular physical state, whereas after the action is taken, the nonzero
probabilities will shift to the zero-allocation bin with probability 1. The belief state
b is made up of ISI probabilities [b(sl)b(s 2 )
...
b(slsl)].
Given the above definition of the belief state, the updated belief state for a target,
b , given an action a and an observation z is provided by Bayes' Theorem:
,
where O(a, s",z)
s', z) Es T(s, a, s')b(s)
s~"O(a, s", z)T(s, a, s")b(s)'
aO(a,
Pr(z Is", a).
Intuitively, this makes sense. The probability that the target is in a given state
s' is a function of the sum of the previous belief that the target was in a state and
transitioned to s' given a, multiplied by the probability of making the observation z
given s' and a, and normalized by all possibilities from s to s'.
The previous section stated that a nothing observation is equivalent to a uniform
distribution of all other observations. This statement can now be proved. A nothing observation occurs with probability 1 for all strike actions, regardless of state.
Therefore, for a strike action, a, equation 4.2 is reduced to:
T(s, a, s')b(s)
ba=strike
z=nothing(
)
(1)T(s, a, s")b(s)'
or
ba=strike
z=nothing (s
) =
ss,
~
s')b(s)
T (s, a,
,a "b
(4.3)
This is for an observation of nothing. Since no other observations can occur,
the belief state for other observations need not be calculated. If every strike action
produces the same observation, the updated belief state will only depend on the
previous belief state and the transition probabilities.
But the claim was that a new nothing observation is equivalent to setting the
probability across other observations to be uniform. Let p. = 1/ Z'j, where Z' does
not have the nothing observation. Now equation 4.2 is reduced to:
bastrike
)
pz
Z, T(s, a, s')b(s)
b(,s" p-T(s, a,, s")b(s)
76
ba=strike
bz-unif orm (S )
ps E, T(s, a, s')b(s)
Pz Es,s" T(s, a, s")b(s)'
, T(s, a, s')b(s)
T(s, a, s")b(s)'
/
ba=strike
bZ=Uif orm (S)
which is the same as equation 4.3.
We will now show discuss sensor action belief state updates. If a sensor action is
taken, no state transition occurs, which can be modelled as an identity matrix. Thus,
T(s, a0 , s') =
1 if s = S'
0 otherwise.
Therefore,
T(s,ao, s')b(s)
Z T(s, ao, s')b(s)
(l)b(s = s') + (0)b(Vs # s')
b(s'),
and equation 4.2 can be simplified to:
S(a, s', )b(s')
b a=observation(S8
Similarly to the strike action case, the sensor actions are dependent only on the
previous belief state and the observation probabilities for a given observation and
action across all states.
After every simulation step, the state estimator updates the belief states for every
target and passes these belief states to the resource allocation module. This loop
continues until the final time step is reached, at which point the application returns
the actual state of each target.
4.4
Implementation Optimization
Because the computational time of solving POMDPs can grow exponentially, the
offline phase can be extremely computationally intensive. Equation 4.1 shows how
77
the number of vectors in a POMDP grows with time. Since solving each epoch of
an incremental pruning POMDP will often take more and more time as the number
of vectors increases, optimizing the offline phase becomes a necessity. In addition, as
the problem grows, the data files grow as well, so it is also useful to find ways to the
time to read data in the online phase.
4.4.1
Removing the Nothing Observation
Section 4.3.6 discussed how the nothing observation is equivalent to a uniform distribution across all other observations. In addition, section 4.2.1 showed that the
growth of the vector set for every epoch in a POMDP solution grows exponentially
in the size of the observation set Z. Thus, it is clear that the number of vectors
in a POIDP solution epoch can be reduced by removing the nothing observation.
The vectors created by the nothing observation do get pruned out fairly quickly, but
because the Incremental Pruning algorithm is used in this implementation, removing
the observation does serve to eliminate an entire subset of [I sets across all a E A.
4.4.2
Calculating Maximum Action
Remembering again that the number of vectors that get checked in a POMDP solution time step grows linearly with JAl, it would be helpful to reduce this number.
Intuitively, there is no benefit to drop additional bombs beyond some number, assuming all bombs have a positive cost, because the marginal reward for dropping a
bomb is strictly decreasing. If the POMDP is limited to this maximum action ama,
it would save computational time, while producing the same result.
To calculate this maximum strike action, some definitions are necessary.
Let
E,(s, a) be the expected reward (not including cost) for dropping a bombs if the
target is in state s. This is defined as:
E,(s, a)
R(s, s')T(s, a, s').
=
s
S
78
To get the marginal reward,
ZEr(S,
a
+ 1) for dropping an additional bomb, E'(s, a)
should be subtracted from E,(s, a + 1) as follows:
AE,(s, a + 1)
AE,(s, a + 1)
Er(s, a + 1) - Er(s, a),
R(s, s')T(s, a + 1, s') - 1 R(s, s')T(s, a, s),
=
s'CS
s'eS
AE,(s, a + 1)
R(s, s') [T(s, a,+ 1, s') - T(s, a, s')
=
s'ES
This is the marginal reward for being in a given state and dropping a
+ 1 bombs.
However, this optimization requires knowing the maximum action, so the next step
is to determine the maximum marginal reward possible for an action, AE,(a + 1) by
taking the max over all states s E S:
AE,(a+1) = maxAE,(s,
sS
A E,(a+1) =
max(
+ 1),
R(s, s')[T(s, a + 1, s')
-
T(s, a,s')]).
The final step is to iterate from a = 0 to a = M, calculating AE,(a + 1) and
comparing it to the cost of a bomb. When the cost of a bomb exceeds AE,(a + 1),
amax-=
a. It will never be worth it to drop more than amax bombs, since the marginal
reward minus the cost of a bomb would be negative.
As in the previous section, the vectors created by all actions that drop more than
anax bombs would be pruned out in each time step of the solution algorithm, but this
optimization serves to prevent that calculation from occurring in the first place.
4.4.3
Defining Maximum Allocation
As soon as a target has been allocated
amax
bombs the marginal reward for adding
another bomb will be zero and the greedy algorithm will allocate no more weapons.
Marginal reward can never be less than zero, since the drop zero weapons action has
a zero reward and is always possible. The max-action calculation has determined at
which point the cost for a bomb outweighs the marginal reward for dropping another
79
bomb. Thus, a target will never be allocated more bombs than ama, the number
calculated in the max-action optimization. Therefore, a POMDP never needs to be
calculatedfor more allocation bins than ama
Limiting the POMDP to the max2 +1.
imum allocation, mmax, serves to reduce the number of states, which will in turn
reduce the computation time for each vector. This may not work for different allocation algorithms, but greatly reduces the size of the problem for a greedy algorithm.
4.4.4
One Target Type, One POMDP
Different targets of the same type differ in two respects: the number of time steps in
the targets' windows and the number of bombs allocated to the target. Every other
aspect of the POMDP created with the target type is the same. The previous section
discussed how to limit the state size in a POMDP, by never increasing the allocation
for a POMDP beyond the maximum, nmax.
If two targets have a different number of time steps in their windows, one target
will have a larger window.
Let h, be the horizon for the target with the longer
window, and h, be the horizon for the target with the shorter one. Assuming they
both have the same allocation, mnmax, the . alphal through . alpha<h,> files will be
exactly the same. Thus, if the target with the longer window is computed, it is no
longer necessary to compute the target with the shorter window.
Therefore, given this information, it is only necessary to solve one POMDP per
target type. If the allocation is set to
mmax
and the horizon H is set to some large
number, the solution data files will work for all targets of that target type up to an
individual window of length H.
4.4.5
Maximum Target Type Horizon
The online phase converts the POMDP solution into appropriate marginal values in
a data structure to speed up online computation. It is possible to do this completely
offline and create a text file that will be read directly into a data structure. However,
due to time constraints, we were not able to implement that optimization. Instead, the
80
vectors are converted into a data structure entirely in the online phase. This involves
iterating through each vector in each file, determining marginal rewards for adding a
bomb to an allocation, then determining an optimal action given an allocation and
storing these in the appropriate place. This can be quite computationally intensive,
given that there could be a large number of vectors, and a large number of time steps.
To minimize the time required to load these files into a data structure, at every step, as long as the target window is not of type 1, the online phase uses the
.alpha<te
-
tc> file.
Thus, only the . alphal through . alpha<te> files need to be
loaded for a given target type. This will eliminate
H - maxt,
type
data structures from having to be loaded. Also, given that more vectors are generally
required as the number of epochs increases, eliminating the necessity to load future
epochs into memory saves computational time and memory space.
4.5
Experimental Results
This section describes the experiments for the partially observable model. This model
produces the same results as the one in the previous chapter when a completely
observable target is used. This section also presents Monte Carlo simulation results
for the new partially observable model.
4.5.1
Completely Observable Experiment
The first experiment is to compare results from the partially observable approach
with those from the completely observable case. A target for the partially observable
model is given the same parameters as that in the experimental results section in the
previous chapter. There, the target has two states, undamaged and damaged, a bomb
costs 1, the reward for destruction is 90, and the damage probability p is 0.25. To
model this case as a POMDP, we augment the state according to our transition model,
81
2.5
2
& 1.5
0
0
0.5
1
2
3
4
5
6
7
8
9
10
Time
Figure 4-8: Optimal policy for M = 11
as previously described, but change the observation model. To make it completely
observable, all strike actions for this experiment are set to produce the actual state
observation with probability 1.
Whether sensor actions are included or not, identical parsimonious sets are produced. This is because in a completely observable model, there is no extra information
to be gained with sensor looks. Since they have a cost and take time to perform, they
are pruned from the optimal policy.
In our standard partially observable approach, updating the belief state for a
target with an action of one bomb usually requires dropping a bomb in one time
step then performing a sensor action to determine whether the bomb was successful
in the next. This is called a "look-shoot-look" policy and given unlimited time and
resources, it is the optimal policy for a target. In this experiment, however, every
strike action is both a "look" and a "shoot" action, taking only one time step to
perform.
Since the parameters of the target observation model is implicitly the same as
82
those of the targets in the completely observable approach, the results are identical.
This can be seen in figure 4-8, which shows the policy for an allocation of 11 bombs.
The total allocation Al is 11, since dropping 11 bombs is the maximum action for
this target, as defined in its damage model. Figure 3-10 shows the optimal action for
a 10 step problem from allocation 0 to 25. The policy in figure 4-8 can be extracted
from this graph, by selecting an action at each time step corresponding to the bombs
remaining. Since the policy for any allocation in the completely observable experiment
matches the results from the previous chapter's MDP model, our approach works for
the completely observable case.
4.5.2
Monte Carlo Simulations
This section compares the results for four different problem scenarios: complete observability, no sensor action, a realistic sensor action, and a perfect sensor action. All
scenarios use a new two-state undamaged/damaged target with a bomb cost of 1, a
sensor action look with a cost of 0.1, a reward for destruction of 10, and a damage
probability p of 0.5. The total mission horizon H is 7. A scoring metric is defined
for the online results, such that destruction of a target increases the score by 10,
dropping a bomb decreases the score by 1, and looking decreases the score by 0.1.
Five total experiments are run:
" Experiment 1-No Sensor Scenario: This experiment performs 1000 trials
of a single target without a sensor action. For the first four experiments, the
total allocation .I
is 25 bombs, to ensure that the number of bombs does not
limit the policy.
* Experiment 2-Complete Observability Scenario This experiment performs 1000 trials of a single target with complete observability. This is the
scoring results from the experiment performed in the last section.
" Experiment 3-Perfect Sensor Scenario This experiment performs 1000
trials of a single target. The sensor action associated with this target has perfect
accuracy.
83
900800
700
600S500
U.
*400--
300
200
100
0
0
-9 -8 -7 -6 -5 -4 -3 -2 -1
1
2
3
4
5
6
7
8
9
Score
Figure 4-9: Score histogram of a single POMDP target with no sensor action
" Experiment 4-Realistic Sensor Scenario This experiment performs 1000
trials of a single target with a realistic sensor action. This experiment most
closely models a real world scenario.
" Experiment 5-Scoring Analysis This experiment consists of four separate
sets of trials. Each set of 1000 trials is a world with 100 targets from one of the
four scenarios above. The total allocation M in these trials is 250, so that a
shortage of bombs is a possibility. The experiment compares the relative results
from the four sets of trials.
The first experiment is for a single target with no sensor action. Policies with
no sensor actions can arise when the cost for a look action is expensive relative to
the reward. Figure 4-9 shows the score for 1000 trials of a single target scenario. As
expected, this result is bimodal, and indicates that the policy is to drop a total of three
bombs. When the target is destroyed the score is 7, and when it isn't destroyed the
score is -3. The policy is to drop three bombs because this is the maximum action as
determined by the damage model. Since there are no observations, the belief state is
only affected by the a priori probability probabilities defined in the transition model,
84
450
400
350f
%300
i 250
-
d200
150
100
50
0
-.
--
-
-
-
-
- ----- -0
-9 -8 -7 -6 -5 -4 -3 -2 -1
1
2
3
4
5
6
7
8
9
Score
Figure 4-10: Score histogram of a single completely observable target
so after three bombs are dropped no further actions will be performed.
An interesting result observed in the no observation case is that many different
policies exist with the same value. The overall policy says to drop a total of three
bombs, but the time at which each one is delivered is irrelevant. Dropping three in
the last time step, H is equivalent to dropping two in H and one in H - 1, which is
equivalent to dropping one in H, one in H - 1, and one in H - 2, and so on. This
unfortunately has the effect of increasing the size of the parsimonious set, since none
of these vectors can be pruned. Even given this increase in vectors, the offline time
for this experiment was on the order of one second.
The next experiment is for a single, completely observable target. In this case the
POMDP solver knows the state the target is in after every time step, and no sensor
actions are necessary.
The histogram for 1000 trials of this scenario in figure 4-
10 shows seven values. The first one on the right is a successful destruction after
dropping one bomb, the next one is for dropping two bombs, and so on. Two trials
out of 1000 missed with nine bombs, for the only negative score, at -9. Compared to
fewer bombs in the previous experiment, the reason that up to nine bombs can be
dropped in this scenario is the extra knowledge of state after every strike action. If a
85
500
.
............
......
....
.....
.....
......
...
..
....
.
...
...
450
400
350
...................
......
- .5
.
. ....
....
...
...........
.. ...
11.............
.. ..
......................
E.
..
I
............
.......
....
..
......
I...
. .I..
..
...
....
.....
..
.
..
........
j' 300
C
i
C.
..
250
U.: 200
..-----------
150
5ii
IBM
100
50
0
-9 -8 -7 -6 -5 -4 -3 -2 -1
1
2
3
4
5
6
7
8
9
Score
action
Figure 4-11: Score histogram of a single POMDP target with a perfect sensor
target is undamaged after an attack, this is known with 100% certainty, so the policy
for this
could say to drop up to three bombs for the next time step. The offline time
experiment was on the order of one second.
The next experiment is for a single target and a sensor with perfect accuracy. The
for
result for 1000 trials is shown in figure 4-11. There are very few possible policies
this target.
The policy is to drop one bomb and look, then if it is not destroyed,
The
drop another bomb and look. If it is not destroyed, drop two bombs and look.
increase in the action is due to the closing target window. If it is still not destroyed,
drop the maximum action of 3 bombs, then do nothing on the last time step.
No action is performed in the last time step because the sensor action has a cost
of 0.1, but it is not possible to act in the following time step H + 1. Therefore, there
- 1
is no point to looking since there is no explicit reward for knowledge. In the H
time step in the scenario, the problem can be considered to be a new horizon-two
steps, but
problem. The optimal policy is to drop three bombs over those two time
a sensor action will never be used. Once again, there are multiple policies to drop
three bombs in two time steps, but they all have the same value. The offline time for
this experiment was also on the order of one second.
86
400
350
---
300-
,250
-
U
i 200
I-
150
-
100
-
50
-
0
-9 -8
0
-7 -6 -5 -4 -3 -2 -1
1
2
3
4
5
6
7
8
9
Score
Figure 4-12: Score histogram of a single POMDP target with a realistic sensor action
The next experiment is for a single target with a realistic sensor action. The look
action in this problem is defined to be 85% accurate. The result for 1000 trials is
shown in figure 4-12. In this case, it is clear how many bombs it took to destroy
the target. The first spike to the right of the graph is at 8.9, which corresponds to
dropping one bomb and looking, but now it is possible for the sensors to be incorrect.
This shows up on the graph at the next spike, at 8.6. This corresponds to two more
look actions, to calculate the belief state more accurately. These results show many
more negative values than the preceding graphs, which is expected since incorrect
assumptions of destruction can occur. The offline time for this experiment was on
the order of 17 hours.
Histograms for the scoring analysis experiment are shown in figures 4-13, 4-14, 415, and 4-16. In each case, the scoring curve has the same basic bell shape. However,
the average score increases as the problem scenario is shifted from no sensor action,
to a realistic sensor action, to a perfect sensor action, to a completely observable
case. This is because in the first three cases, more knowledge about the state of the
target is available at every step with a more accurate sensor action. The completely
observable case improves on the average score from the perfect sensor action because
87
120
100
80
U-
Cr
60
40!
20
0
450
400
500
550
600
650
700
750
800
Score
Figure 4-13: Score histogram of 100 POMDP targets with no sensor action
20
18_
16
14
12
U.
8
6--
4
2
400
450
500
550
600
650
700
750
800
Score
Figure 4-14: Score histogram of 100 POMDP targets with a realistic sensor action
88
20
18
16
14
12
10
G)
8
U-
6
4
2
400
450
500
550
600
650
700
750
800
Score
Figure 4-15: Score histogram of 100 POMDP targets with a perfect sensor action
20
18
16
14
12
C
10
IL
8
6
4
2
0
400
450
500
550
1
1
1
600
650
700
1Emma",
750
800
Score
Figure 4-16: Score histogram of 100 completely observable targets
89
time steps are not wasted using a look.
It is also interesting to note how spaced out the scores are in figure 4-13, the
no sensor action scenario.
This is because each individual target has a bimodal
distribution with a distance of 10. Thus, all possible scores occur at intervals of value
10.
90
Chapter 5
Conclusion
This thesis presents a new approach to allocating resources in a partially observable
dynamic battle management scenario by combining POMDP algorithmic techniques
and a prior completely observable resource allocation technique.
The problem we
present is to allocate weapon resources over several targets of varying types, given
imperfect observations after an action is taken.
The scenario is dynamic, in that
values are computed offline and then resources are allocated via a greedy algorithm
in an online loop.
The mathematical background behind completely observable and partially observable Markov decision processes was discussed, and its use was mentioned in three
battle management approaches.
A completely observable model by Meuleau et al.
was then fully described and implemented as a starting point for our new partially
observable approach.
One problem in changing the completely observable model to partial observability
was that resource constraints were violated. To address this problem, we augmented
the state space in a POMDP such that the allocation is included in the actual state
of a target. The state becomes more descriptive with deterministic and probabilistic
elements. This increases the dimension of the vectors in the solution set, but does
not increase the total possible number of vectors in the parsimonious set.
An experiment compared the new partially observable approach with the completely observable one, and the results show that the new model honors weapon
91
constraints and produces the optimal policy in a completely observable case. Finally,
Monte Carlo experiments were run on four different battle scenarios with various observation models, and their results were compared to show the relative importance of
observation information.
The approach we took to observe resource constraints in a partially observable
battle management world produced optimal policies, but at a significant cost. The
time involved to compute the offline values is much higher than that of the completely
observable approach. Though the state space expansion does not increase the number
of possible vectors in the next epoch's solution set, it does increase the number of
vectors carried over to the next time step. This is because with state space expansion,
a vector can dominate in a projection onto a range of states, and be dominated in
the projection to other states.
Since the time required is much more significant, this approach works well when
there is ample time for computation beforehand. The online phase is relatively unaffected by the state space expansion and is practical for resource allocation.
5.1
Thesis Contribution
The contribution that this thesis makes to the field is to introduce a way of coupling
resource constraints to a POMDP without having to use an LP solver. This state
space expansion does not cause the POMDP to become intractable, but only causes
linear growth in the dimension of the vectors. It has minimal impact on the incremental pruning solution algorithm. Thus, this thesis presents a simple way of solving
a dynamic partially observable resource allocation problem.
5.2
Future Work
There are several possible avenues for future work. One is to optimize the running
time of the POMDP solver code to take advantage of the fact that the transition and
observation models are very sparse. In addition, more realistic transition and obser-
92
vation models can be considered, as strike actions usually produce some observation
in a real-world scenario. A new resource allocation algorithm could be created, as the
greedy algorithm used in this thesis is not optimal.
Also, developing a more flexible POMDP solver is a natural extension to this research. If the action set in the POMDP model was able to be changed between every
epoch, the impossible action issue would be avoided. The value iteration concept of
taking resource use into consideration when deciding future actions could be included
in the POIDP solver, such that impossible actions are eliminated. This would preclude the need for state space expansion, and keep the model of the POMDP small
and efficient.
93
94
Appendix A
Cassandra's POMDP Software
Because the focus of this thesis is not to find better ways to solve POMDPs, we
decided to use software developed by Anthony Cassandra to solve the POMDPs that
our approach models. We focus on how to take the model of the problem from our
input files to an output file that can be used by Cassandra's code to produce the
offline values.
A.1
Header
The input file header, or preamble, defines the setup of the POMDP. The first line
defines the discount factor for infinite time horizon problems, discount:
(value).
In such problems, the expected value at each time step is calculated by adding the
future reward multiplied by the discount factor to the immediate reward. This biases
the policy to act earlier rather than later, as the relative reward earned goes down
with every time step.
In an infinite horizon case, this would eventually cause the
policy to converge to a set of actions dependent on a belief state range, as the cost
for dropping bombs will eventually exceed the marginal reward for acting. In a finite
horizon problem, the discount factor can still be used to encourage acting earlier, but
is usually set to one.
The next element in the preamble is the values:
[reward/cost]
line.
This
defines whether the rewards listed later in the file are rewards (positive values), or
95
costs (negative values).
The states:
<list of states> element comes next. This allows the user to
define by name all the states that are possible in the POMDP. These state names are
used later in the file to define probabilities and values. It is important to note the
order the states are listed, as the output files to the POMDP will list several nodes
which are arrays of numbers of length
Next is actions:
jS|.
<list of actions>, which allows the user to define A, all
actions possible in the POMDP. Like the states descriptor, the names listed will be
used later in the file for definitions. Also, the output file will not use the names of
the actions, but rather their indices in the list, with the first action having an index
of zero.
The final line of the preamble is observations:
<list of observations>.
This defines the POMDP's observation set Z. Like the previous two elements, the
observations will be used to define probabilities and values, and only the index is
important. For all three of the previous elements, the use of explicit names is meant
for the user's ease of reading and understanding the input files.
A.2
Transition Probabilities
The section following the preamble is the section describing the transition probabilities. There are several ways to enter the transition probabilities in the input file, but
the simplest way is to define a probability for each possible action, start state, and
end state. The syntax for this is:
T: <a E A>
:
<s E S> :
<s' E S> (value)
It is also possible to use the wildcard operator, *, for any action, state, or observation, to make multiple definitions. If this is desired, the wildcard must be used
first, since any transition, observation, or reward that is defined more than once will
use the definition that comes last in the file.
96
A.3
Observation Probabilities
Defining the observation probabilities in the next section of the input file is very
similar to defining the transition probabilities. In this case, however, an individual
probability for an observation is listed based on an action and an end state, in the
following way:
0: <a CA>
A.4
<s' C S>
<z CZ> (value)
Rewards
In Cassandra's code, it is necessary to specify that the values defined in the rewards
section are either costs or rewards. There is no way to specify both a particular
cost for an action and a reward for a transition, so the costs must be factored in
when listing them in the input file. The next section defines the values for a given
action with a transition from one state to another and an observation, in the following
format:
R: <a E A> :
<s E S>:
<s' E S> :
<z E Z> (value)
The value listed for each entry is R(s, s') - cost(a).
A.5
Output files
Cassandra's POMDP solver produces a set of files he calls "alpha" files. A
.alpha
file for a given time step contains the parsimonious set of vectors that describe the
value function over belief states. Each .alpha file depends on the previous one, such
that . alpha3 depends on . alpha2 which depends on . alphal.
A . alpha file at time step t has n pairs of an action and a list of vector coefficients,
where n is the size of the parsimonious set of vectors F*. The action is the index in A,
from 0 to JAl - 1. The length of the vector coefficient list is ISI, and the it coefficient
in the vector is the value for being in state si E S.
97
If the save-all option is selected in the command-line execution, then the POMDP
solver will save the results associated with every time step, or epoch, as a .alpha<X>
file, where X is the epoch number. This number corresponds to the number of time
steps the target has left in its individual window. The output file for the final time
step in the horizon will have a .alphal extension, the second to last will have a
. alpha2 extension, and so on.
A.6
Example-The Tiger Problem
A well documented POMDP, presented in the paper by Kaelbling et al. [6], describes
a simple scenario that Cassandra uses as an example for his code. This section will
define the scenario and present the input and output files, to aid in the understanding
of the POMDP solver syntax.
The tiger problem is places a person in front of two closed doors. Behind one is
some reward and behind the other is a hungry tiger, with equal probability. If the
door with the reward is opened, the person receives the reward, whereas if the door
with the tiger is opened, the person receives a penalty. In either case, when a door
is opened, the problem resets, and the reward and tiger are placed behind the two
doors with equal probability again. The actions the person can take are to open the
left door, open the right door, or listen at a door. Listening is not free, however, and
it is not completely accurate. There is some probability that the tiger will be silent
as the agent listens, and some probability that the person may falsely hear something
behind the reward door.
The state space is defined as S = {tiger-left, tiger-right}. The action space is
defined as A = {open-left, open-right, listen}. The reward for opening a reward door
is +10, opening a tiger door is -100, and listening is -1.
The possible observations
are Z = {tiger-left, tiger-right}. There is an 85% chance for a correct observation.
In addition, a discount factor of 0.95 is used, as it is modelled as an infinite horizon
problem. Figure A-1 is the input file to the POMDP code, as described in the previous
section.
98
0.95
discount:
values: reward
states: tiger-left tiger-right
listen open-left open-right
actions:
observations: tiger-left tiger-right
T:listen
identity
T:open-left
uniform
T:open-right
uniform
0:listen
0.85 0.15
0.15 0.85
o:open-left
uniform
0: open-right
uniform
R:listen :
:
*
: * -1
*
R:open-left : tiger-left :
* -100
:
10
tiger-right :
*
:
*
R:open-right :
tiger-left :
*
:
* 10
R:open-right :
tiger-right :
R:open-left :
*
:
* -100
Figure A-1: The input file for the tiger POMDP
99
0
19.3713683559737184225468809
19.3713683559737184225468809
0
0.6908881394535828501801689
25.0049727346740340294672933
0
16.4934850146678506632724748
21.5418370968935839471214422
0
3.0147789375580762438744387
24.6956809390929521441648831
0
25.0049727346740340294672933
0.6908881394535828501801689
0
21.5418370968935839471214422
16.4934850146678506632724748
0
24.6956809390929521441648831
3.0147789375580762438744387
1
-81.5972000627708240472202306
2
28.4027999372291724000660906
28.4027999372291724000660906
-81.5972000627708240472202306
Figure A-2: The converged . alpha file for the tiger POMDP
100
30
25
.
20
10
151
5
0
tiger-right (1)
tiger-left (0)
Belief Space
Figure A-3: The alpha vectors of the two-state tiger POMDP
The solution file is shown in Figure A-2.
Every vector is represented by
ISI
coefficients. Each one of these vectors has an associated optimal action. In this case,
a 0 corresponds to the first action defined in the problem, listen. A 1 is the open-left
action and a 2 is the open-right action.
This problem is simple enough to represent in a two-dimensional graphical format,
shown in Figure A-3. Each numeric representation of a vector in the . alpha file in
figure A-2 corresponds to a vector in this graph. The solid lines are listen actions, the
dotted line is the open-left action, and the dot-dashed line is the open-right action.
As shown, when the knowledge of the state is more certain (towards the left or right
edges of the belief space) the optimal action is to open a door with a high expectation
for a reward. In the middle, "gray area" of the belief space, the optimal action is to
listen, with a lower expected reward.
A.7
Running the POMDP
The POMDP solver runs with several command line options. The ones that are used
in this thesis are the following:
e -horizon <int>: This option allows the user to specify the length of the hori-
101
zon. If the policy does converge on an epoch before the number chosen, the
solver stops, and outputs that epoch's result as the final solution.
" -epsilon <0-oc>: This option allows the user to set the precision of the pruning operation.
The default value is 10- 9 . Higher values will generate faster
solutions, though they will be less accurate.
" -save-all: Normally, only the . alpha file of the final epoch is saved. However,
if this option is selected, every epoch's . alpha file will be saved with the epoch
number appended to the end. This becomes important for solving the online
portion of the problem, which will be discussed later.
" -o
<f ile-pref ix>: This allows the user to specify the prefix for the . alpha
files. This helps to keep the directories and files easy to read and understand,
but is only aesthetic, and not integral to solving the problem.
" -p pomdp-f ile: This option tells the POMDP solver which file to use for the
input to set up the POMDP.
" -method
[incprune]:
This option describes the POMDP solution method to
be used. This thesis uses the default, incremental pruning.
When the POMDP solver is run, it checks the input file both for syntax correctness
as well as mathematical correctness (probabilities of a transition or observation matrix
must add up to 1).
If this is correct, it runs the POMDP. Figure A-4 is a sample
screenshot of the POMDP solver, showing the problem parameters, the number of
vectors and time taken for each epoch, and the total amount of time taken for the
problem.
A.8
Linear Program Solving
The POMDP solver comes with two different LP solver options. The first is a generic,
unsupported LP solver package which is bundled with Cassandra's code. The second
option is to use a commercial software package, CPLEX.
102
Value iteration parameters:
POMDP file = POMDPVals/code.POMDP
Initial values = default
Horizon = 10.
Stopping criteria = weak (delta = 1.000000e-09)
VI Variation = normal
Optimization parameters:
Domination check = true
General Epsilon = 1.0000OOe-09
LP Epsilon = 1.0000OOe-09
Projection purging = normal-prune
Q purge = normal-prune
Use witness points = false
Algorithm parameters:
Method = incprune
Incremental Pruning method settings:
IncPrune type = normal
Solutions files:
Saving every epoch.
Initial
Epoch:
Epoch:
Epoch:
Epoch:
Epoch:
Epoch:
Epoch:
Epoch:
Epoch:
Epoch:
policy has 1 vectors.
1.. .3 vectors. (0.01 secs.)
2.. .4 vectors. (0.10 secs.)
3.. .5 vectors. (0.19 secs.)
4.. .5 vectors. (0.28 secs.)
5...6 vectors. (0.30 secs.)
6.. .7 vectors. (0.48 secs.)
7.. .8 vectors. (0.65 secs.)
8.. .9 vectors. (0.96 secs.)
9... 10 vectors. (1.20 secs.)
10.. .11 vectors. (1.74 secs.)
(0.01 secs. total)
(0.11 secs. total)
(0.30 secs. total)
(0.58 secs. total)
(0.88 secs. total)
(1.36 secs. total)
(2.01 secs. total)
(2.97 secs. total)
(4.17 secs. total)
(5.91 secs. total)
Solution found. See file:
POMDPSols/code-allocation30_horizonl0-E-9/solution. alpha
POMDPSols/code-allocation30lhorizonl0-E-9/solution. pg
User time = 0 hrs., 0 mins, 5.57 secs.
(= 5.57 secs)
System time = 0 hrs., 0 mins, 0.34 secs. (= 0.34 secs)
(= 5.91 secs)
Total execution time = 0 hrs., 0 mins, 5.91 secs.
Proj-build time: 0.07 secs.
Proj-purge time: 0.71 secs.
Qa-build time: 4.12 secs.
1.01 secs.
Qa-merge time:
Total context time: 5.91 secs.
Figure A-4: A screenshot of the POMDP solver
103
A.9
Porting Considerations
A problem arose when trying to integrate the original resource allocation software
and Cassandra's POMDP solver. The implementation presented earlier in this thesis
is written in Java in the Windows environment, and Cassandra's code is written in C
in the UNIX/Linux environment. Since the POMDP solver can be used by a simple
system call from any application, the language differences can be handled easily, but
the platform incompatibility cannot. This incompatibility was addressed by porting
the implementation code from Windows to Linux.
104
Bibliography
[1] Anthony R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. PhD dissertation, Brown University, Department
of Computer Science, May 1998.
[2] Anthony R. Cassandra.
Pomdps for dummies.
Online Tutorial Website:
http://www.cs.brown.edu/research/ai/pomdp/tutorial/,
January 1999.
[3] Anthony R. Cassandra, Leslie Pack Kaelbling, and Michael L. Littman. Acting
optimally in partially observable stochastic domains.
Technical report, Brown
University, 1994.
[4] Anthony R. Cassandra, Michael L. Littman, and Nevin L. Zhang. Incremental
pruning: A simple, fast, exact method for partially observable markov decision
processes.
Proceedings of the Thirteenth Annual Conference on Uncertainty in
Artificial Intelligence, 1997.
[5] David A. Castanon. Approximate dynamic programming for sensor management.
In Proc. Conf. Decision and Control, 1997.
[6] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra.
Plan-
ning and acting in partially observable stochastic domains. Artificial Intelligence,
101:99-134, 1998.
[7] Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, Leonid Peshkin, Leslie Pack
Kaelbling, and Thomas Dean. Solving very large weakly coupled markov decision
processes. American Association for Artificial Intelligence, 1998.
105
[8] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.
Prentice Hall, 1995.
[9] Kirk A. Yost.
Solution of Large-Scale Allocation Problems with Partially Ob-
servable Outcomes. PhD dissertation, Naval Postgraduate School, Department of
Operations Research, September 1998.
106