Orn

advertisement
Mind-Theoretic Planning for Social Robots
by
Sigur6ur Orn A6algeirsson
MSc. Media Arts and Sciences, MIT (2009)
BSc. Electrical and Computer Engineering, University of Iceland
(2007)
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Media Arts and Sciences
MASSACHUSETsS Irr
OF TECHNOLOGY
at the
JUL 1 4 2O14
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
LIBRARIES
June 2014
@ Massachusetts Institute of Technology 2014. All rights reserved.
Author
Si
gnature redacted_
__
Program in Media Arts and Sciences
May 2, 2014
Certified by
Signature redacted
Dr. Cynthia Breazeal
Associate Professor of Media Arts and Sciences
Program in Media Arts and Sciences
Thesis Supervisor
Accepted by
Signature redacted
rPattie Maes
Associate Academic Head
Program in Media Arts and Sciences
E
2
Mind-Theoretic Planning for Social Robots
by
Sigurbur
Orn Algeirsson
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
on May 2, 2014, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy in Media Arts and Sciences
Abstract
As robots move out of factory floors and into human environments, out from safe
barricaded workstations to operating in close proximity with people, they will increasingly be expected to understand and coordinate with basic aspects of human
behavior. If they are to become useful and productive participants in human-robot
teams, they will require effective methods of modeling their human counterparts in
order to better coordinate and cooperate with them.
Theory of Mind (ToM) is defined as people's ability to reason about others' behavior in terms of their internal states, such as beliefs and desires. Having a ToM
allows an individual to understand the observed behavior of others, based not only on
directly observable perceptual features but also an understanding of underlying mental states; this understanding allows the individual to anticipate and better react to
future actions. In this thesis a Mind-Theoretic Planning (MTP) system is presented
which attempts to provide robots with some of the basic ToM abilities that people
rely on for coordinating and interacting with others.
The MTP system frames the problem of mind-theoretic reasoning as a planning
problem with mixed observability. A predictive forward model of others' behavior
is computed by creating a set of mental state situations (MSS), each composed of
stacks of Markov Decision Process (MDP) models whose solutions provide approximations of anticipated rational actions and reactions of that agent. This forward
model, in addition to a perceptual-range limiting observation function, is combined
into a Partially Observable MDP (POMDP). The presented MTP approach increases
computational efficiency by taking advantage of approximation methods offered by a
novel POMDP solver B3 RTDP as well as leveraging value functions at various levels
of the MSS as heuristics for value functions at higher levels.
For the purpose of creating an efficient MTP system, a novel general-purpose online POMDP solver B3 RTDP was developed. This planner extends the Real- Time
Dynamic Programming (RTDP) approach to solving POMDPs. By using a bounded
3
value function representation, we are able to apply a novel approach to pruning the
belief-action search graph and maintain a Convergence Frontier,a novel mechanism
for taking advantage of early action convergence, which can greatly improve RTDP's
search time.
Lastly, an online video game was developed for the purpose of evaluating the MTP
system by having people complete tasks in a virtual environment with a simulated
robotic assistant. A human subject study was performed to assess both the objective behavioral differences in performance of the human-robot teams, as well as the
subjective attitudinal differences in how people perceived agents with varying MTP
capabilities. We demonstrate that providing agents with mind-theoretic capabilities
can significantly improve the efficiency of human-robot teamwork in certain domains
and suggest that it may also positively influence humans' subjective perception of
their robotic teammates.
Thesis Supervisor: Dr. Cynthia Breazeal
Title: Associate Professor of Media Arts and Sciences, Program in Media Arts and
Sciences
4
Mind-Theoretic Planning for Social Robots
by
Sigurbur Orn Aalgeirsson
The following people served as readers for this thesis:
Signature redacted
Thesis Reader___
Dr. Julie Shah
Assistant Professor of Aeronautics and Astronautics
Massachusetts Institute of Technology
6-7
TIesis Reader
Signature redacted
Dr. Leila Takayama
Senior User Experience Researcher
Google[x]
r
Acknowledgments
I am so grateful for all the help and support I have received from friends, family, and
colleagues during my time here at MIT.
First of all I would like to thank my advisor, professor Cynthia Breazeal, for
taking a chance on my all of those years ago and admitting me into her awesome
group. Since then, she has always supported me in all of the things I have wanted
to explore and learn about as well as provided me with very insightful advise and
guidance in my research. She has trusted me to have a great level of autonomy as
a researcher while simultaneously making herself available to discuss ideas and share
wisdom when requested. One couldn't ask for a better research group to be a part of,
where people are as concerned with the success of their fellow grad students as they
are with their own. I want to thank all of my predecessors in the Personal Robots
Group for creating the legacy of the group and leaving a wonderfully collaborative
and helpful group culture which I have done my best to pass on. The great group
atmosphere is in no little part due to our administrator Polly Guggenheim, she is
the beating heart of this group and a surrogate mother to us all (watch out for her
affectionate/bone-crushing jabs).
Of the students that were present for my "formative" years as a grad student, I
particularly want to acknowledge my good friends Matthew Berlin and Jesse Gray
for their friendship and general helpfulness with everything. Any ability I have to
problem-solve and think abstractly about programming was acquired in constant attempts to keep up with those guys.
My friend Philipp Robbel has been equally
important in helping me through discussions about my research and comparing different approaches to solving problems as he was in helping me forget all of that and
simply have fun and enjoy life.
I have had countless discussions with my office mate Nick DePalma about both
of our research, which have often been very productive and helpful in airing out and
developing ideas. Jin Joo Lee has also listened to me talk about my research more
7
times than I care to count, and not only did she listen but actually diligently proofread
my proposal as well as this thesis and provided very insightful and valuable feedback
and assistance. I am particularly thankful for her help with designing the human
subject study for this thesis.
A benefit of working in a great research group at MIT is that it attracts incredibly
intelligent and capable post-doctoral researchers to work with us. I have learned a
lot by getting to work with both Sonia Chernova and Brad Knox. I am inspired by
Brad's style of continuous learning and self-improvement. I feel that he applies his
incredibly critical and rigorous academic thinking equally to his research as well as
his personal life. I aspire to attain his wisdom and critical thought and can only hope
that I will handle it with the same level of casual humility and humor that he has.
I would like to thank my general examination committee as well as my thesis
committee. Professors Rosalind Picard and Rebecca Saxe spent a significant amount
of their limited time to help me understand the relevant literature and hone in on
my final dissertation topic, while serving on my general examination committee. My
research was influenced by Leila Takayama's work well before I asked her to serve
on my thesis committee. I have attended several workshops she has organized and
read her papers, in fact the evaluation task of my master's thesis was adapted from
the one she used in her dissertation work. Conversations with her were absolutely
invaluable to the development of my thesis and in particular the evaluation part of
the work. No less than her actual contribution to the thesis work, has her personal
support throughout the process been just incredible.
About six years ago when I took professor Brian William's class on Cognitive
Robotics, one of his grad students, Julie Shah, gave a great guest lecture in the class.
That was the first time I was thoroughly impressed with her work and devotion to
creating systems to support human-robot teamwork, but far from the last. I was
incredibly excited when she agreed to be on my committee, and her guidance and
support with the more computational parts of my thesis was invaluable.
8
I thank my Icelandic friends H6ssi and Siggi P6tur (my namesake) for helping me
to take my mind off my research with various shenanigans and for helping me not
forget the mother tongue. My dear girlfriend and love Nancy deserves much more
praise than I could write down here, both for her direct involvement in both my M.S.
and PhD dissertation work, and for her endless patience and support, especially in
this last year.
Lastly, I am so grateful for my wonderful family for all of their love and support
throughout the years.
Mamma, Villi pabbi and Alli pabbi have spent my whole
lifetime preparing me, motivating and inspiring me to do whatever I want, and I am
so very thankful for all that they have given me. The same goes for my grandparents
and all of my siblings, it is a rare privilege to have such a wonderful family to lean
on and learn from. Eg elska ykkur 611 !
My research has been funded by Media Lab Consortia, and both MURI8 and
MURI6 grants from the Office of Naval Research.
9
10
Contents
Abstract
1
2
3
Introduction
29
1.1
M otivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
1.2
A Mind-Theoretic Robot Planning System . . . . . . . . . . . . . . .
32
1.2.1
Proposed System . . . . . . . . . . . . . . . . . . . . . . . . .
33
1.3
Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
1.4
Overview of This Document . . . . . . . . . . . . . . . . . . . . . . .
35
Autonomous Planning
37
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
2.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
2.2.1
Classical Planning
. . . . . . . . . . . . . . . . . . . . . . . .
39
2.2.2
Decision Theoretic Planning . . . . . . . . . . . . . . . . . . .
42
2.3
POMDP Planning Algorithms . . . . . . . . . . . . . . . . . . . . . .
44
2.4
Real-Time Dynamic Programming . . . . . . . . . . . . . . . . . . . .
46
2.4.1
RT D P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
2.4.2
Extensions to RTDP . . . . . . . . . . . . . . . . . . . . . . .
47
Belief Branch and Bound Real-Time Dynamic Programming . . . . .
49
2.5.1
RTDP-Bel . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
2.5.2
Bounded Belief Value Function
. . . . . . . . . . . . . . . . .
52
2.5
11
2.6
2.7
3
2.5.3
Calculating Action Selection Convergence
. . . . . . . . . . .
55
2.5.4
Convergence Frontier . . . . . . . . . . . . . . . . . . . . . . .
60
2.5.5
Belief Branch and Bound RTDP . . . . . . . . . . . . . . . . .
63
R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
2.6.1
Rocksample . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
2.6.2
T ag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . .
74
2.7.1
Discussion of Results . . . . . . . . . . . . . . . . . . . . . . .
74
2.7.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
Mind Theoretic Reasoning
77
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
3.1.1
Mind-Theoretic Planning . . . . . . . . . . . . . . . . . . . . .
79
3.1.2
Overview of Chapter . . . . . . . . . . . . . . . . . . . . . . .
80
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
3.2.1
Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . .
82
3.2.2
Internal Representation of ToM . . . . . . . . . . . . . . . . .
83
3.2.3
False Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
3.2.4
Mental State
. . . . . . . . . . . . . . . . . . . . . . . . . . .
84
3.2.5
Knowledge Representation . . . . . . . . . . . . . . . . . . . .
86
3.2.6
Bayesian Networks
. . . . . . . . . . . . . . . . . . . . . . . .
87
Overview of Related Research . . . . . . . . . . . . . . . . . . . . . .
90
3.3.1
ToM for Humanoid Robots . . . . . . . . . . . . . . . . . . . .
90
3.3.2
Polyscheme and ACT-R/E . . . . . . . . . . . . . . . . . . . .
91
3.3.3
ToM Modeling Using Markov Random Fields
. . . . . . . . .
93
3.3.4
Plan Recognition in Belief-Space
. . . . . . . . . . . . . . . .
94
3.3.5
Inferring Beliefs using Bayesian Plan Inversion . . . . . . . . .
95
3.3.6
Game-Theoretic Recursive Reasoning . . . . . . . . . . . . . .
98
3.3.7
Fluency and Shared Mental Models . . . . . . . . . . . . . . .
100
3.2
3.3
12
3.4
3.5
4
3.3.8
Perspective Taking and Planning with Beliefs
102
3.3.9
Belief Space Planning for Sidekicks . .
104
Earlie r Approaches to Problem
. . . . . . . .
107
3.4.1
Deterministic Mind Theoretic Planning
107
3.4.2
The Belief Action Graph . . . . . . . .
109
Mind Theoretic Planning . . . . . . . . . . . .
112
3.5.1
Definitions of Base States and Actions
113
3.5.2
Types of Mental States . . . . . . . . .
114
3.5.3
Inputs to the Mind Theoretic Planner .
115
3.5.4
Mental State as Enumeration of Goals and False Beliefs
117
3.5.5
Action Prediction . . . . . . . . . . . .
118
3.5.6
POMDP Problem Formulation . . . . .
123
3.5.7
Putting it All Together . . . . . . . . .
131
3.5.8
Demonstrative Examples . . . . . . . .
134
Evaluation of Mind Theoretic Reasoning
141
4.1
142
4.2
4.3
4.4
Sim ulator
. . . . . . . . . . . . . . . . . . .
4.1.1
Different Simulators
. . . . . . . . .
142
4.1.2
On-line Video Game for User Studies
145
Human Subject Study
. . . . . . . . . . . .
149
4.2.1
Hypotheses
. . . . . . . . . . . . . .
150
4.2.2
Experimental Design . . . . . . . . .
151
4.2.3
M etrics
. . . . . . . . . . . . . . . .
155
4.2.4
Exclusion Criteria . . . . . . . . . . .
157
Study Results . . . . . . . . . . . . . . . . .
159
4.3.1
Statistical Analysis Methods . . . . .
159
4.3.2
Behavioral Data . . . . . . . . . . . .
160
4.3.3
Attitudinal Data . . . . . . . . . . .
175
D iscussion . . . . . . . . . . . . . . . . . . .
176
13
4.4.1
Task Efficiency
. . . . . . . . . . . . . . . . . . .
. . . . . .
176
4.4.2
Team Fluency . . . . . . . . . . . . . . . . . . . .
. . . . . .
177
4.4.3
Attitudinal Data . . . . . . . . . . . . . . . . . .
. . . . . . 178
4.4.4
Informal Indications from Open-Ended Responses
. . . . . . 178
4.4.5
Funny Comments . . . . . . . . . . . . . . . . . .
. . . . . . 182
5 Conclusion
183
5.1
Thesis Contributions
5.2
Recommended Future Work
184
. . . . . . . . . . . . . . . . . . . . .
185
. . . . . . . . . . . . . . . . .
5.2.1
Planning with personal preferences
. . . . . . . . .
185
5.2.2
Planning for more agents . . . . . . . . . . . . . . .
186
5.2.3
State-space abstractions and factoring domains into
5.2.4
MTP and
N on-M T P . . . . . . . . . . . . . . . . . . . . . . .
187
Follow-up user study . . . . . . . . . . . . . . . . .
187
Appendices
189
A Existing Planning Algorithms
189
A.1
Various Planning Algorithms . . . . . . . . . . . . . . . . . . . . . .
B Study Material
190
195
B .1 Q uestionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
196
B .2 Study Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
201
211
References
14
List of Figures
2-1
(a) Demonstrates how the state tree can be traversed by selecting actions and transition links to successor states according to the transition
function T(s, a, s'). (b) Shows how traversing the belief tree is similar
to traversing the state tree except that when an action is taken in a
belief b we use equation 2.4 to determine the "belief transition probability" to the successor beliefs, through observation probabilites, which
can be calculated with equation 2.3. . . . . . . . . . . . . . . . . . . .
2-2
50
Demonstrates how the transition function for the discounted Tiger
POMDP is transformed into a Goal POMDP. From: (Bonet & Geffner,
2009)...........
2-3 ............
2-4
....................................
52
.......................................
Shows the
Q boundaries
56
for two example actions. The value of the true
Q*(a) is uniformly distributed between the bounds for both actions. .
2-5
In addition to the
Q*(a')
Q
59
distributions, the probability function Pr(q <
g) is plotted. This function always evaluates to the probability
mass of the Q(a') function that exists between q and QH(a') which for
uniform distributions is a piecewise linear function of a particular shape. 59
2-6
Finally this figure shows the function whose integral is our quantity of
interest Pr(Q*(a) < Q*(a')). This integral will always simply be the
sum of rectangle and triangle areas for two uniform
15
Q distributions.
.
60
2-7
Demonstrates how action choice can converge over a belief, creating effectively a frontier of reachable successive beliefs with associated probabilities. This effect can be taken advantage of to shorten planning. .
2-8
61
Shows the ADR of B 3 RTDP in the RockSample_ 7 8 domain. Algorithm was run with D = 10 and a = 0.75 and ADR is plotted with
error bars showing 95% confidense intervals calculated from 50 runs. .
2-9
70
Shows the ADR of B3 RTDP in the Tag domain as a function of the
action pruning parameter a and discretization D. ADR is plotted with
error bars showing a 95% confidence intervals calculated from 20 runs
of the algorithm s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
2-10 Shows the convergence time of B3 RTDP in the Tag domain as a function of the action pruning parameter a and discretization D. ADR is
plotted with error bars showing a 95% confidence interval calculated
from 20 runs of the algorithms. We can see that the convergence time
of B3 RTDP increases both with higher discretization as well as a higher
requirement of action convergence before pruning. This is an intuitive
result as the algorithm also garnishes more ADR from the domain in
those scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
2-11 Shows the ADR of B3 RTDP in the Tag domain. The algorithm was
run with D = 15 and a = 0.65 and ADR is plotted with error bars
showing 95% confidence intervals calculated from 50 runs. We can see
that B 3RTDP converges at around 370 ms, at that time SARSOP is
far away from convergence but has started to produce very good ADR
values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-1
Demonstrates the recursive nature of ToM. Adapted from (Dunbar,
200 5).
3-2
74
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
A classic false belief situation involving the characters Sally and Anne
(Image courtesy of (Frith, 1989)). . . . . . . . . . . . . . . . . . . . .
16
85
3-3
Shows the inter-connectivity of the MRF network. Each agent is represented with an observation vector y, and a state vector xi.
From
(Butterfield et al., 2009). . . . . . . . . . . . . . . . . . . . . . . . . .
3-4
94
An example visual stimuli that participants would be shown in (Baker
et al., 2009). An agent travels along the dotted line and pauses at the
points marked with a (+) sign. At that point participants are asked to
rate how likely the agent is to have each marked goal (A, B and C). In
(b) the participants were asked to retroactively guess the goal of the
agent at a previous timepoint. . . . . . . . . . . . . . . . . . . . . . .
3-5
96
An example of an expanded tree structure of pay-off matrices as perceived by P when choosing whether to execute action A or B. With
probability p, P thinks that
Q
views the payoff matrix in a certain
way (represented by a different matrix) and with probability (1 - pi)
in another. This recursion continues until no more knowledge exists,
in which case a real value is attributed to each action (0.5 in the uninformed case) (Durfee, 1999). . . . . . . . . . . . . . . . . . . . . . . .
3-6
99
A view of the ABB RoboStudio Virtual Environment during task execution. The human controls the white robot (Nikolaidis & Shah, 2013).101
3-7
Shows the trajectory by which data flows and decisions get made within
the "Self-as-Simulator" architecture.
Demonstrates how the robot's
own behavior generation mechanisms are used to reason about observed
behavior of others as well as performing perspective taking.
(Breazeal et al., 2009).
From
. . . . . . . . . . . . . . . . . . . . . . . . . .
17
103
3-8
A robot hypothesizes about how the mental state of a human observer
would get updated if it would proceed to take a certain sequence of motor actions. This becomes a search through the space of motor actions
that the robot could take which gets terminated when a sequence is
found that achieves the robot's goals as well as the mental state goals
about other agents (Gray & Breazeal, 2012). . . . . . . . . . . . . . .
3-9
105
(a) Shows the game with a simulated human player in the upper right
corner and a POMCoP sidekick in the lower left. (b) Shows comparative results of steps taken to achieve the goal between a QMDP planner
and different configurations of POMCoP
. . . . . . . . . . . . . . . .
106
3-10 (a) An demonstrative example of a simple BAG. (b) An example of a
BAG instantiated for an actual navigation problem. . . . . . . . . . .
111
3-11 Shows how a goal situation is composed of stacks of predictive MDP
models for each agent. Each model contains a value function, a transition function and a resulting policy. Each transition function takes
into account predictions from lower level policies for the actions of the
other agent. Value functions are initialized by heuristics that are extracted from the optimal state values from the level below, this speeds
up planning significantly.
Since every level of the stack depends on
lower levels, special care needs to be taken for the lowest level.
In
the MTP system, we have chosen to solve a joint centralized planning
problem as if one central entity was controlling both agents to optimally achieve both of their goals, since this is a good and optimistic
approximation of perfect collaborative behavior. . . . . . . . . . . . .
18
121
3-12 This figure shows an example MTP action prediction system with two
goal hypotheses and a single false belief hypothesis (in addition to the
true belief), resulting in four distinct mental situations.
An action
prediction from any level can be queried by following the enumerated
steps in section 3.5.5 under "Mental State Situation".
. . . . . . . . .
124
3-13 Shows two states that would produce the same observation because
they are undistinquishable within the perceptual field of the robot
(which is limited by field of view and line of sight in this domain).
If the other agent would move slightly into the white space in the state
on the right, then the observation function would produce a different
observation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
128
3-14 Shows an example scenario where the robot knows that the human's
goal is to exit the room, and the robot also knows the location of
the exit.
The robot is uncertain about whether the human knows
where the location of the exit and therefore creates two false belief
hypotheses, one representing the true state and another representing a
false alternative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
130
3-15 Shows the stage where the human is one action away from learning
what the true state of the world is . . . . . . . . . . . . . . . . . . . .
131
3-16 When the human agent has turned left, it will expect to see either the
exit or the wall depending on its mental state. In the false belief state
where it expects to see the exit, a special observation of is also expected
since in this state the agent should be able to perceive the error of its
false belief. Since this observation will actually never be emitted by
the MTP system, the belief update will attribute zero probability to
any state in the subsequent POMDP belief where that observation was
expected. .........
.................................
19
132
3-17 Shows the complete MTP system on an example problem with two
goal hypotheses and one false belief hypothesis. On top sits a POMDP
with an observation function that produces perceptually limited observations with the addition of specialized false belief observations when
appropriate. The POMDP transition function is deterministic in the
action effects of the robot but uses the lower level mental state situations to predict which actions the other agent is likely to take and
models the effects of those stochastically. The figure also shows how
the value functions at lower levels serve as initialization heuristics for
higher-level value functions. The value function of the highest level of
the robot's predictive stack is used as an initialization to the QMDP
heuristic for the POMDP value function. . . . . . . . . . . . . . . . .
3-18 Shows the configuration of the environment of this example.
133
Gray
areas represent obstacles. . . . . . . . . . . . . . . . . . . . . . . . . .
135
3-19 Simulation at t = 0. The robot can perceive the human but is initially
uncertain of their mental state.
. . . . . . . . . . . . . . . . . . . . .
135
3-20 Simulation at t = 11. The robot has moved out of the human's way
but did not see if they moved east or west.
Robot maintains both
hypotheses with slightly higher probability of the false belief since the
human did not immediately turn east at t = 1. . . . . . . . . . . . . .
136
3-21 Simulation at t = 20. The robot now expects that if the human originally held the false belief that they would have perceived its error
by now and is confident that they currently hold the true belief. The
robot expects that if the human originally held the false belief then it
should pass by the robot's visual field in the next few time steps. Notice how the robot has been waiting for that perception to happen or
not happen (indicating that the human held the true belief the whole
time) before it proceeds to move to its goal.
20
. . . . . . . . . . . . . .
136
3-22 Simulation at t = 28. Finally once the robot has seen the human pass
by it proceeds to follow it and subsequently both agents accomplish
their goals. Even if the robot had not seen the human pass by, eventually once it would become sure enough, it would proceed in exactly
the same manner thinking that the human had originally held the true
b elief.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 7
3-23 (a) and (c) refer to the simulation at t = 0, (b) and (d) refer to the
simulation at t = 1. We can see that initially the robot is completely
uncertain about the mental state of the human but after seeing that
the human took no action, it assumes that goals 5 and 6 are most likely
(the ones that the robot is currently blocking access to).
. . . . . . .
138
3-24 (a) and (c) refer to the simulation at t = 13, (b) and (d) refer to the
simulation at t = 18. Once the robot has retreated to allow the human
to pursue the two goal hypotheses that are most likely, it chooses to
make one goal accessible. If the human does not pursue that goal given
the opportunity, the robot assumes that the other one is more likely
and creates passageway for it to pursue.
. . . . . . . . . . . . . . . .
139
3-25 (a) and (c) refer to the simulation at t = 28, (b) and (d) refer to the
simulation at t = 36. If the human moves away while the robot cannot
perceive it, the robot uses its different goal hypotheses to predict the
most likely location of the human. The robot then proceeds to find
human in the most likely locations. In this case, its first guess was
correct and by using our Mind-Theoretic Reasoning techniques, it was
able to find the human immediately.
4-1
. . . . . . . . . . . . . . . . . .
140
A snapshot from the USARSim simulator which was used to simulate
urban search and rescue problems. On the right, we can see a probabilistic Carmen occupancy map created by using only simulated laser
scans and simulated odometry from the robot. . . . . . . . . . . . . .
21
142
4-2
Screenshots from the (a) Java Monkey EngineT
M
and (b) Unity3D
simulators that were developed to evaluate our robot systems.
4-3
. . . .
143
Shows the video game from the perspective of the human user. The
character can navigate to adjacent grids, interact with objects on tables, and push boxes around. The view of the world is always limited to
what the character can currently perceive, so the user needs to rotate
the character and move it around to perceive more of the world.
4-4
. . .
144
These are the objects in the world that can be picked up and applied
to the engine base. To fully assemble an engine, two engine blocks and
an air filter need to be placed on the engine base. After placing each
item, a tool needs to be applied before the next item can be placed.
The type of tool needed is visualized with a small hovering tool-tip
over the engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-5
146
This figure demonstrates the sequence of items and tools that need to
be brought to and applied to an engine base to successfully assemble it. 147
22
4-6
This figure demonstrates the connectedness of different components to
create the on-line study environment used in the evaluation. A user
signs up for the study either by going to the sign-up website (possibly
because they received a recruitment email or saw an advertisement) or
because they are an Amazon MTurk user and accepted to play. The
web server assigns the user to a study condition and finds a game
server that is not currently busy and assigns it to the user. The game
server initializes the Unity game and puppeteers the robot character
according to the condition of the study assigned to the user. The game
state, character actions, and environment are synchronized between
the game server and the user's browser using a free cloud multiplayer
service called PhotonT^.
Study data is comprised of both the behav-
ioral data in the game logs as well as the post-game questionnaire data
provided by the Survey Monkey service.
4-7
. . . . . . . . . . . . . . . .
148
Shows the mean task completion times of all rounds of each task (*
p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-8
160
Shows the mean number of actions taken by both agents over all rounds
of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate
a 95% confidence interval) . . . . . . . . . . . . . . . . . . . . . . . .
4-9
163
Mean action intervals of participants across all rounds of each task (*
p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
166
4-10 Shows the mean rates of change in action intervals averaged over all
rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars
indicate a 95% confidence interval)
23
. . . . . . . . . . . . . . . . . . .
168
4-11 Shows the mean participant functional delay ratios across rounds of
each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a
95% confidence interval)
. . . . . . . . . . . . . . . . . . . . . . . . .
171
4-12 Shows the mean rates of change in participant functional delays aver-
aged over all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002,
error bars indicate a 95% confidence interval)
24
. . . . . . . . . . . . .
174
List of Tables
2.1
Results from RockSample_ 7_ 8 for SARSOP and B3 RTDP in various
different configurations. We can see that B3 RTDP confidently outperforms SARSOP both in reward obtained from domain and convergence
time (* means that the algorithm had not converged but was stopped
to evaluate policy). The ADR value is provided with a 95% confidence
interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
70
Results from Tag for SARSOP and B3 RTDP in various different configurations (* means that the algorithm had not converged but was
stopped to evaluate policy). ADR stands for Adjusted Discounted Reward and is displayed with 95% confidence bounds . . . . . . . . . . .
4.1
Task 1 task completion time. ANOVA p value and effect sizes q 2 for
all pairwise comparisons of conditions.
4.2
71
. . . . . . . . . . . . . . . . . 161
Task 2 task completion time. ANOVA p value and effect sizes q2 for
all pairwise comparisons of conditions.
. . . . . . . . . . . . . . . . . 161
4.3
Task 1 completion time in milliseconds . . . . . . . . . . . . . . . . . 161
4.4
Task 2 completion time in milliseconds . . . . . . . . . . . . . . . . . 162
4.5
Task 1 total number of actions. ANOVA p value and effect sizes rJ2 for
all pairwise comparisons of conditions.
4.6
Task 2 total number of actions. ANOVA p value and effect sizes r 2 for
all pairwise comparisons of conditions.
4.7
. . . . . . . . . . . . . . . . . 163
. . . . . . . . . . . . . . . . . 164
Task 1 total number of actions . . . . . . . . . . . . . . . . . . . . . . 164
25
4.8
Task 2 total number of actions . . . . . . . . . . . . . . . . . . . . . .
4.9
Task 1 human action interval. ANOVA p value and effect sizes
all pairwise comparisons of conditions.
T2
165
for
. . . . . . . . . . . . . . . . .
166
4.10 Task 2 human action interval. ANOVA p value and effect sizes rj 2 for
all pairwise comparisons of conditions.
. . . . . . . . . . . . . . . . .
167
4.11 Task 1 human action interval rate of change. ANOVA p value and
effect sizes r 2 for all pair wise comparisons of conditions. . . . . . . .
168
4.12 Task 2 human action interval rate of change. ANOVA p value and
effect sizes
q2
for all pairwise comparisons of conditions . . . . . . . .
4.13 Task 1 human action interval rate of change
169
. . . . . . . . . . . . . .
169
4.14 Task 2 human action interval rate of change. . . . . . . . . . . . . . .
170
4.15 Task 1 human functional delay ratio. ANOVA p value and effect sizes
2
T/
for all pair wise comparisons of conditions.
. . . . . . . . . . . . . 171
4.16 Task 2 human functional delay ratio. ANOVA p value and effect sizes
. . . . . . . . . . . . .
172
4.17 Task 1 human functional delay ratio. . . . . . . . . . . . . . . . . . .
172
4.18 Task 2 human functional delay ratio . . . . . . . . . . . . . . . . . . .
173
72
for all pair wise comparisons of conditions.
4.19 Task 1 human functional delay rate of change. ANOVA p value and
effect sizes rj 2 for all pair wise comparisons of conditions. . . . . . . .
174
4.20 Task 2 human functional delay rate of change. ANOVA p value and
effect sizes 72 for all pair wise comparisons of conditions. . . . . . . .
26
175
List of Algorithms
1
Convergence Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
2
The Belief Branch and Bound RTDP (B 3RTDP ) algorithm . . . . . .
65
3
Subroutines of the B 3 RTDP algorithm . . . . . . . . . . . . . . . . . .
67
4
Pseudocode for a simplified deterministic approach to MTP
5
Pseudocode for constructing the transition functions
77
. . . . . .
108
h/r, at levels I >
0, of the goal situations. Note that the subscript h/r denotes that this
works for either agent's predictive stacks but the order of h/r versus
v/h marks that if one refers to the human then the other refers to the
robot and vice versa. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Pseudocode for constructing the transition function TPOMDP for the
M TP POM DP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
120
Pseudocode for the
127
GRAPHPLAN algorithm. The algorithm operates in two
steps, graph creation and plan extraction. The EXTRACTPLAN algorithm is a
level-by-level backward chaining search algorithm that can make efficient use
of mutex relations within graph. . . . . . . . . . . . . . . . . . . . . . . .
8
The RTDP algorithm interleaves planning with execution to find the optimal
value function over the relevant states relatively quickly.
9
190
The BRTDP algorithm.
. . . . . . . . . .
191
Uses a bounded value function and search
heuristic that is driven by information gain. . . . . . . . . . . . . . . .
27
192
10
The RTDP-Bel algorithm from (Geffner & Bonet, 1998) and (Bonet &
G effner, 2009)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
193
Chapter 1
Introduction
29
1.1
Motivations
As robots move out of factory floors and into human environments, out from safe
barricaded workstations to operating in close proximity with people, they will increasingly be expected to understand and be able to coordinate with basic aspects of
human behavior. If they are to really become useful and productive participants in
human-robot teams, they will require good methods of modeling their human counterparts in order to be able to better coordinate and cooperate with them.
One of the ways that people reason about other people's behavior is to explain and
describe the actions of others in terms of their presumed intentions or goals (Blakemore & Decety, 2001). In effect, people are constantly performing plan recognition
when observing actions of others to better understand the underlying reasons for their
behavior. It has been argued that our human obsession with teleological interpretation (the explanation of phenomena by its ultimate purpose) of actions stem from
its importance of both on-line action prediction as well as social learning, enabling
an agent to learn about new affordances of actions or resources (Csibra & Gergely,
2007).
This ability is observed at a very early age; an experiment was conducted
which showed that infants begin reasoning about observed actions in a goal-directed
way near the time that they gain control over those actions themselves, or as early
as nine-months old (Woodward, 2009).
When people work closely with each other they often naturally reach a high level
of fluent coordination. Explicit planning, verbal communication, training, and experience can speed this process but are not necessary as people can achieve fluent joint
action using non-verbal behaviors such as attention cueing, autonomous mimicry,
mentalizing and anticipation, and more (Sebanz et al., 2006). In this thesis we plan
to endow robots with some of the core capacities required for achieving such autonomously coordinated behavior by developing methods that begin to provide robots
with a basic understanding of how actions are predicated on beliefs and directed towards goals, and how to leverage that information for planning.
30
A great example of the utility of mental state reasoning, and how we employ it
naturally, is that of the piano mover.
A piano mover's task is very complicated,
not only because of the challenging geometric problem of moving a large, heavy,
and yet delicate object through a cluttered environment, but because of the intense
need for tight coordination with other movers.
With many hands on the piano,
each mover needs to be constantly reason about the environment as they perceive it,
which parts of it the others might be able to perceive and what they are attempting
to accomplish, predicting their behavior and react to changes, and make sure they
all move in synchrony. The fact that the human brain can perform such challenging
tasks with relative ease supports the theory that this computation is important for
our survival that it has been allotted dedicated neural circuitry in the brain (mirror
neurons) (Gallese & Goldman, 1998).
The piano mover's challenge is one where coordination and mental reasoning is
particularly important, but even the challenge of navigating crowded environments
can present interesting problems. When we plan to move through such environments,
we need to avoid running into others, which calls for constant mental state reasoning
and behavior anticipation.
For example, when passing someone on a sidewalk we
intuitively give individuals who we believe are unaware of us a wider berth than
others, since we can anticipate no cooperation from them (as they are unaware of
us) and in fact they might do something completely unexpected like turn around or
change direction at any moment.
Lastly, mental state reasoning can be used as an extra "sensor" in the environment.
If one truly understands how behavior is based on beliefs and desires formed about
the environment, additional information about the environment can be indirectly
inferred from the behavior of others. An example of this is the bicyclist that cannot
see if they can cross an intersection because of a visual obstruction.
If that cyclist
were to see that another person, whose line of sight is not obstructed, moves into the
intersection with a baby stroller, then the cyclist can draw the conclusion that no
31
traffic is oncoming since that person would not rationally take those actions if they
believed that a car was coming and their goal was to cross the street unharmed.
In light of the apparent importance of mental state reasoning and its useful application to human coordination and interaction, we believe that it is a crucial capability
for a robot to have if it is to be a useful teammate in human-robot teams. This thesis
presents an approach that makes progress on solving this problem.
1.2
A Mind-Theoretic Robot Planning System
A robotic agent that can reason about people in terms of their mental states needs to
possess a range of different capabilities. The following is a "wish list" of capabilities
for a mind-theoretic agent.
Action prediction: A mind-theoretic agent needs to have a way of predicting
which actions the other agent will take. This prediction will serve to help the robot
anticipate future changes in the environment, which can help the robot avoid damages
that might occur as well as exploit opportunities that get created.
Means-end understanding of action: In addition to being able to predict
future actions of the other agent, the robot needs to understand on what basis that
prediction was built and how it should be adapted as the environment or the agent's
beliefs change.
Recursive mental state reasoning: Mental state reasoning is inherently a
recursive process. As we think about the thoughts and beliefs of others, do we take
into account that they might have beliefs about us? And if so, how deeply should
we recurse? Should we reason about the beliefs of the other concerning the degree to
which they think we are reasoning about their beliefs?
Information seeking behavior: An agent reasoning about mental states of
others and using them to predict their behavior needs to understand the value of
information. Specifically it should understand that there is value in being certain
32
about which mental state the other has, so it may improve the prediction of their
actions. Such an agent would need to understand that it might be worth taking a few
actions simply to gather information before starting to take task-directed actions.
Hedging against uncertainty: Once the agent has the capacity to anticipate
possible different future configurations of the environment, as caused by different
predictions of others' actions based on their mental states, the agent should be able
to hedge against uncertainty in those predictions.
Planning to manipulate mental states: Lastly, once agents are reasoning
about each other's beliefs and using them to predict actions and plan, they should be
able to have goals that relate to mental states. There are plenty of examples for this
type of behavior in games and sports. Usually in games, the goal is to achieve some
task for your team while trying to block the other team from achieving theirs. This
often involves selectively sharing information with your team members while denying
it to the opposing team. A mental-state goal might therefore be to take actions in
a way that maximally informs your team about your intentions and the state of the
world while simultaneously trying to hide them from the opposing team, causing them
to have false beliefs about the world, which might disadvantage them in the game.
1.2.1
Proposed System
In this thesis we present a system that attempts to accomplish most of the aforementioned desired features for a mind-theoretic agent system. The Mind-Theoretic
Planner (MTP) presented is able to create predictions of others' actions based on
what they believe about the environment and what goals they have. These predictions in turn take into account what actions the other agent expects of other agents
and how they will react. These predictions are leveraged to create a predictive forward model of the world, which includes how the world state will be affected as a
function of the mental states of the others. This forward model is used in conjunction
with a perceptional observation model of the world to produce mind-theoretic agent
33
behavior that seeks to better understand the mental states of others in order to better
predict state changes and to produce better plans.
1.3
Research Questions
The research questions we are interested in investigating with this work concern both
the technical feasibility of making a mind-theoretic planning system, as well as the
objective and subjective difference that system would have on the performance and
attitudes of human-robot teams.
1. How can mind-theoretic reasoning abilities be encoded in the formalisms of
autonomous planning and reasoning?
2. To calculate near real-time solutions to realistic problems in Human-Robot Interaction (HRI) domains, what kinds of approximations and heuristics can be
applied in the MTP method?
3. What are the strengths and weaknesses of the MTP approach with respect to
the task performance and subjective experience of a mixed human-agent team?
The following questions concern comparisons between a person's subjective attitudes towards the competencies of its autonomous teammate when it uses the MTP
system as opposed to other autonomous systems.
4. Will teaming with an MTP system, as opposed to a different autonomous system, influence how people judge an autonomous partner in a human-agent interaction?
5. More specifically, how will an MTP system influence perceptions of the partner
as being engaging, likeable, capable, intelligent, or team-oriented?
34
1.4
Overview of This Document
This thesis is mainly separated into three different chapters. Chapter 2 presents a
novel Partially Observable Markov Decision Process planner called B 3 RTDP . This
planner is capable of producing approximate solutions to belief planning problems
which is crucial to the implementation of the mind-theoretic system presented in this
thesis. Chapter 3 presents the implementation of the MTP system, and lastly Chapter
4 covers the human subject study that was performed to evaluate the MTP system.
Chapter 1: Introduction
The current chapter presents motivations for the mind-theoretic reasoning problem
along with the research questions for the presented work and this overview.
Chapter 2: Autonomous planning
This chapter introduces some of the basic methods and representations within the
autonomous planning literature that are relevant to this thesis. Specifically the RealTime Dynamic Programmingmethod (Barto et al., 1995) and several extensions that
solve Markov Decision Planning problems. Belief planning is introduced and a few
solution techniques discussed. The novel B 3 RTDP algorithm is presented and all of its
approximation methods detailed. Lastly, B 3RTDP is evaluated on known benchmark
problems against a state-of-the-art planning algorithm.
Chapter 3: Mind-Theoretic Reasoning
In this chapter, the motivations and some of the psychological concepts underlying
mental-state reasoning are presented. Existing work on problems in this domain are
discussed and compared, and our own previous approaches to solving the problem
briefly presented.
The MTP system is presented in detail with all of its internal
35
mechanisms and representations explained. Lastly, some demonstrative examples of
the MTP system in action are presented.
Chapter 4: Evaluation of Mind-Theoretic Reasoning
This chapter introduces an on-line video game and simulator that was developed to
evaluate the MTP system. A user study is presented in which people interact with
a virtual agent in a task-oriented environment. The experimental conditions of the
study were different methods to control the agent, two of which were the MTP system
in different configurations. Results from the study are presented and discussed.
Chapter 5: Conclusions
This chapter discusses the impact of this work along with the research contributions
of this thesis. Some future directions for the work are also discussed.
36
Chapter 2
Autonomous Planning
37
2.1
Introduction
In this section I will introduce the research field of autonomous planning as well
as provide background on some important algorithms and representations that are
frequently used in that field.
I will introduce classical planning, both plan-space
and state-space methods, as well as decision theoretic planning based on Markov
Decision Processes both directly- (MDP) and partially-observable (POMDPs). I'll
introduce a novel POMDP solver algorithm named Belief Branch and Bound RealTime Dynamic Programming (B3 RTDP ) which extends an existing RTDP approach.
B 3 RTDP employs a bounded value function representation and uses a novel pruning
technique Confidence Action Pruning which allows pruning actions from the search
tree before they become provably dominated by other actions, and a Convergence
Frontier which serves to speed up search time. I present empirical results showing
that B3 RTDP can outperform a state-of-the-art planning system named SARSOP
both in convergence time and total adjusted discounted reward from two well known
POMDP benchmarking problems.
2.2
Background
Planning is a term that is commonly used in many fields related to Al and Robotics.
In the most general case, it describes a process of determining which action should
be taken at a given time which transforms the world state in a way that is conducive
to help satisfy a goal criterion. This process can be implemented in many different
ways, and we will discuss a few of them in this section.
States and actions are two of the most important representational concepts in
autonomous planning. Most planning systems deal with state either explicitly or
implicitly. It serves as a description of the environment at a given time and how it
is encoded can have an incredible effect on the complexity of planning task (Ghallab
et al., 2004). Generally it is important that only the features of the environment that
38
are significant to solving the planning task should be encoded in the state but equally
no important feature can be missing. State-space "explosion" is a term that has been
used for when the size of the set of states to solve a planning problem grows so large
that it either becomes unmanagable by the planning algorithm or even too large to
store in memory. Several planning approaches attempt to use factored state-spaces
(Boutilier et al., 2000) or state-space abstractions (Dearden & Boutilier, 1997) to
reduce the negative effect that a large state-space has on the planning process. The
ultimate goal of planning is to figure out the best action to take at any given time,
where actions can be thought of as operators on the states. Each action a is defined
by a function that transforms a state s into a different state s' or even a set of states
in the case of probabilistic planning. This function is generally referred to as the
transition function.
2.2.1
Classical Planning
Classical Planning (CP) systems generally refer to ones that solve a restricted planning problem that satisfies the following simplifying assumptions (Ghallab et al.,
2004):
1. State-space is finite and discretely represented as a set of literals that hold true
in a state.
2. States are always fully observable.
3. Environment is static and deterministic (only planning actions can affect state
and they do it in a predictable and deterministic manner).
4. Actions are instantaneous in time.
5. Actions are described by three sets of literals.
(a) A set of precondition literals that need to hold true in the current state
for the action to be applicable.
39
(b) A set of "add-effects" which will be added to the current state literals
should the action be taken.
(c) A set of "del-effects" which will be removed from the current state literals
should the action be taken.
6. A plan consists of linearly-ordered sequences of actions
Classical planning domains are fully described by E
{A, P}, where P represents
a set of lifted logical predicates and A a set of lifted actions. The term lifted here refers
to an un-instantiated variable, example: IsHolding(?human,?object) is a lifted logical predicate which could be grounded over a set of humans and objects to produce a
list of grounded literals like this one: IsHolding(John,redball). CP planning problems are fully described by their domain, a set of grounded world objects, an initial
state and a logical goal state description (a conjunctive set of grounded predicates
that need to hold true in a state to qualify) E
{A, P,0, 1, G}. A standard has
been developed for representing CP domains and problems as well as various planning features and functionality in the Planning Domain Definition Language (PDDL)
(Ghallab et al., 1998).
Plan-space planning
CP systems generally fall into one of two categories: state-space planners and planspace planners. Originally, plan-space systems such as Partial-OrderPlanners (Barrett & Weld, 1994) and HierarchicalTask Network (HTN) planners (Nau et al., 2003)
were considered faster and more efficient than their state-space counterparts.
Plan-space planners search over a space of partial plans (note that this space is
infinitely large), constantly attempting to refine the current plan and resolve flaws and
unsatisfied constraints. These planners tend to produce short plans very quickly but
cannot guarantee their optimality. They can naturally take advantage of hierarchical
structures in the action space (such as macro actions composed of a fixed sequence of
40
regular actions) and have therefore been favored by the game development community
for a long time.
Examples of state-space planners
With the advent of Graphplan (Blum, 1995) and subsequent systems that used its
compact and efficient graph-based state representation, state-space planners became
more scalable to realistic domains. The GraphPlan algorithm operates on a structure called the planning graph which consist of sequential temporal layers of state
variables and action variables in the following fashion:
{So, Ao, S 1 , A 1 , S 2 , A 2 ,
.. .
}
The algorithm proceeds in two interleaved steps until termination: graph expansion
and plan extraction (see Algorithm 7).
After each graph expansion, the state liter-
als at the current level are inspected for mutex relations. These basically represent
constraints on which state literals can truly "co-exist" at any given time. Once we
find all the goal literals non-mutexed in a level, we can attempt to extract a plan.
The EXTRACTPLAN subroutine is implemented as a backward chaining search algorithm that takes advantage of the pre-calculated mutex relations and uses heuristics
available from the graph.
Modern CP systems are often based on heuristic search, and their performance is
critically impacted by the quality and efficiency of that heuristic calculation. Domain
independent heuristics are desirable but often hard to come by. A popular usage
of the GraphPlan algorithm is to produce exactly such a heuristic.
The first level
where the goal literals appear non-mutexed is the theoretically the shortest possible
theoretical at which a certain goal might be reached by a valid plan. This level can
serve as a theoretical minimum for the plan length and can therefore serve as an admissible heuristic for another search algorithm. A "tighter" heuristic can be extracted
if one also performs the EXTRACTPLAN routine but omits the del-effects of actions
(significantly reducing number of mutexes in the graph and therefore simplifying the
planning problem). Several heuristic-based search algorithms take advantage of these
41
heuristics such as the FF planner (Hoffmann & Nebel, 2011) and FastDownward
(Helmert, 2006).
2.2.2
Decision Theoretic Planning
Markov Decision Processes
Markov Decisions Processes (MDPs) (Bellman, 1957a) have been a favored problem
representation amongst Al researchers for a long time, especially in the Reinforcement Learning and ProbabilisticPlanning communities. The model is based on the
assumption that all relevant information to solving a planning problem can be encoded in a state and furthermore that no element of the domain dynamics (transition
probabilities, rewards, etc.) should ever depend on any state history other than the
single previous state. This is referred to as the Markovian Property. A fully specified
MDP is represented by:
" S: A finite set of states.
* A: A finite set of actions.
* T(s, a, s'): A transition function that defines the transition distributions Pr(s'ls,a).
" C(s, a) or R(s, a): A cost or reward function.
* 7: A discount factor.
The representation of an MDP can either encode action costs as positive quantities
or action rewards as positive quantities, this creates no significant difference except for
whether or not to use a min or max operator in the Bellman value update calculation
(equation 2.2) and how to interpret upper and lower bounds of value functions. For
reward-based domains the upper is the "greedy" boundary which should be initialized
to an admissible heuristic whereas for cost-based domains the opposite is true. In
this document we will always refer to cost-based domains unless otherwise specified.
42
The transition function T encodes the dynamics of the environment. It specifies
how each action will (possibly stochastically) transition the current state to the next
one. The cost/reward function C/R can be used to encode the goal of the planning
task or more generally to specify states and/or actions that are desirable.
A solution to an MDP is called an action policy and is often denoted by the symbol
7w. It represents a mapping between a state and an action that should be taken in
that state 7r : S -- A. The optimal action policy is often referred to as r*, and it is
the policy that maximizes the Expected Future Reward for acting in this domain.
The following equations are called the Bellman equations, which recursively define
the value of a state as a function of the cost of greedily choosing an action and an
expectation over the successive state values. The solution to these equations can be
found via Dynamic Programming.
Q(s,
a)
C(s, a) + -y E T(s, a, s')V(s')
(2.1)
s'CS
V (s) :=min Q (s, a)
aEA
(2.2)
(For reward-based domains, equation 2.1 would use R instead of C and equation
2.2 would use a max operator instead of min)
The optimal action policy can therefore be defined as always choosing the action
with the lowest
Q
value: 7r*(s) := argminaEAQ (s, a).
Partially Observable Markov Decision Processes
MDPs still play an important role in the autonomous planning literature and are
a sufficient representation for a large host of problems.
An important limitation
of MDPs is that they can only represent uncertainty in the transition function but
assume that the planning agent can always perfectly sense its state at any time.
Partially Observable Markov Decision Processes (POMDPs) (Kaelbling et al., 1998)
43
are an extension to the MDP model that is capable of representing both transitional
uncertainty as well as observational uncertainty (this can be thought of as "actuator
noise" and "sensor noise"). To fully represent a POMDP model, in addition to the
aforementioned MDP parameters S, A, T, C/R and -y we need to specify:
* 0: A set of observations
* Q(a, s', o): An observation function dictating probability distributions over observations given an action and a resulting state, Pr(ola,s')
In a partially observable domain, the agent cannot directly observe its state and
therefore needs to simultaneously do state estimation with planning. A planning
agent represents its uncertainty about its current state as a probability distribution
over possible states which we will refer to as a belief b where b(s) := Pr(sjb).
Equation 2.3 demonstrates how the state estimation proceeds to update the belief
b given that an action a is taken and observation o is received. To calculate bo(s'),
we sum up all possible transitions from any state s with non-zero probability in b to
s', weighed by T(s, a, s') and b(s). That sum is then multiplied by the probability
of observing o when taking a and landing in s' or Q(a, s', o). Finally, this quantity
is divided by a normalization factor that can be calculated by 2.4 but is not needed
if we perform the belief update for all observations o E 0 since it will simply be the
normalization factor that makes all of the nominators of equation 2.3 sum to one.
V
Pr(o b, a)
Q(a, s', o)
zS T(s, a, s')b(s)
Pr(olb, a)
Q(a, s', o)
=
(2.4)
sCS
s'CS
2.3
>3 T(s, a, s')b(s)
(2.3)
POMDP Planning Algorithms
Algorithms exist to solve for the optimal value function of POMDPs (Sondik, 1971),
but this is rarely a good idea as the belief space is infinitely large and only a small
44
subset of it relevant to the planning problem. A discovery by Sondik about the value
function's piece-wise linear and convex properties led to a popular value function
representation which consists of maintaining a set of ISI-dimensional a vectors, each
representing a hyperplane over the state space. This representation is named after its
author Sondik and many algorithms, both exact and approximate, take advantage of
it.
The Heuristic Search Value Iteration (HSVI) algorithm extends ideas of employ-
ing heuristic search from (Geffner & Bonet, 1998) and combines them with the Sondik
value function representation but with upper and lower bound estimates. It employs
information seeking observation sampling technique akin to that of BRTDP (introduced below) (Smith & Simmons, 2004) but aimed at minimizing excess uncertainty.
The Point-based Value Iteration (PBVI) algorithm doesn't use a bounded value function but introduced a novel concept of maintaining a finite set of relevant belief-points
and only perform value updates for those sampled beliefs.
Lastly the SARSOP algorithm combines the techniques of HSVI and PBVI, performing updates to its bounded Sondik-style value function over a finite set of sampled
belief points. Additionally, SARSOP uses a novel observation sampling method which
uses a simple learning technique to predict which beliefs should have higher sampling
probabilities to close the value gap on the initial belief faster.
45
2.4
Real-Time Dynamic Programming
Several methods exist for solving MDPs, most of which are based on learning the
value function, V : S
-±
R, over the state space by solving the Bellman equations
(equation 2.1 and 2.1) using dynamic programming.
Algorithms such as Value Iteration (Bellman, 1957b) and Policy Iteration (Howard,
1960) are successive approximation methods that can solve this problem by either
explicitly initializing V arbitrarily and then iteratively performing Bellman value updates or implicitly by iteratively improving an arbitrarily initialized policy.
2.4.1
RTDP
Real- Time Dynamic Programming (RTDP) (Barto et al., 1995) is a family of algorithms that perform asynchronous updates to the value function by combining simulated greedy action selection with Bellman updates. This approach leads to more
focused updates in a part of the state space that is relevant to the optimal action
policy.
The RTDP algorithm in its native form operates on a special case of MDPs called
Stochastic Shortest Path (SSP) problems that are the subset of all MDPs which have
absorbing terminal goal states and strictly positive action costs. Even though these
constraints seem like they would limit RTDP's applicability to general MDPs they
really do not as there are methods to transform general MDPs to SSP MDPs. Bonet
and Geffner have shown how this can be applied to POMDPs, and it is trivial to
apply their method to MDPs (Bonet & Geffner, 2009).
The basic RTDP algorithm (Algorithm 8) repeatedly simulates acting on what
is currently the best estimate of the optimal greedy policy, while simultaneously
updating state values, until it either finds the goal state or hits a depth limit. The
value function can be initialized arbitrarily but if it is initialized to an admissible
heuristic then it can be shown that under the assumption that the goal is reachable
46
with positive probability from every state, repeated iterations of RTDP will yield the
optimal value function Vu(s) = V*(s) for all relevant states.
2.4.2
Extensions to RTDP
Several extensions have been proposed to improve the RTDP algorithm. These extensions generally attempt to improve convergence time by focusing updates on "fruitful"
parts of the state space.
Labeling solved states
Labeled-RTDP (Bonet & Geffner, 2003) (LRTDP) introduced a method of marking
states as solved once their values and those of all states reachable from them had
converged. Solved states would subsequently be avoided in the RTDP's exploration
of the state space.
This effectively improves convergence time by creating a con-
ceptual boundary of solved states which initially only contains the goal states, and
successively expanding the boundary out toward the initial state with every iteration
of the algorithm. Each iteration of the algorithm in turn is more efficient as it needs
to travel a shorter distance to meet the boundary.
Bounding the value function
In the previous approaches discussed above, a single value function is maintained. If
initialized to an admissible heuristic for the problem, the value function will represent
a lower bound of the true optimal value function. The following extensions of RTDP
also include an upper bound which should be initialized such that:
VL(s) < V*(s) < VU(s)
VL(s) = VU(s) = 0 |VsCG
We also define the following specifications of equation 2.1:
47
3
QL (s, a) =C (s, a) + -y E
s a, s')VL(s')
S'ES
3
QH (s, a) =C (s, a) - y E
s a, s')VH (s')
Bounded-RTDP (BRTDP) (McMahan et al., 2005) takes advantage of this bounded
value function representation in two ways, for search iteration termination and search
heuristic guidance (Algorithm 9). Each trial iteration follows greedy action selection
according to the lower value function boundary (as in RTDP) but performs value
updates on both boundaries. Each iteration is terminated when the expected value
gap of next states becomes smaller than a certain fraction (defined by the parameter
T)
of the value gap of the initial state sl. This effectively achieves the same effect
of the LRTDP termination criteria except that this boundary of "solved" states is
dynamic and moves further away from the initial state as its value becomes more
certain. Lastly, BRTDP samples the next state to explore not from the transition
function but rather from the a distribution created by the value gaps at subsequent
states weighed by their transition probabilities. This equates to a search heuristic
that is motivated to seek uncertainty in the value function to quickly "collapse" its
boundaries onto the optimal value V*.
Several algorithms have been proposed to improve RTDP using a bounded value
function, each of which providing different search exploration heuristics and iteration
termination criteria. These generally attempt to focus exploration onto states that are
likely to have large contributions toward learning the optimal value function (Sanner
et al., 2009) and (Smith & Simmons, 2006).
48
2.5
Belief Branch and Bound Real-Time Dynamic
Programming
In this section we present a novel planning algorithm for POMDPS called Belief
Branch and Bound Real-Time Dynamic Programming (B3 RTDP ). The algorithm
extends the RTDP-Bel system with a bounded value function, a Branch and Bound
style search tree pruning and influences from existing extensions to the original RTDP
algorithm for MDPs.
2.5.1
RTDP-Bel
Geffner and Bonet proposed an extension to the RTDP algorithm that was designed
to solve MPDs (introduced in Section 2.4.1) called RTDP-Bel which is able to handle
partially observable domains (Geffner & Bonet, 1998).
The two most significant
differences between RTDP and RTDP-Bel are what types of graphs the algorithms
search over and how they store the value function.
RTDP searches a graph composed of states, a selection of actions, and stochastic
transitions into other states. RTDP-Bel searches a graph of beliefs, a selection of
actions, and stochastic transitions, through observations and associated probabilities,
into other beliefs. The two graph structures are depicted in figure 2-1.
The second and more significant contribution of this work is in the value function
representation. Implementing a value function over beliefs is much more challenging
than for states as even for domains with a finite number of states the belief space
is infinitely large as the probability mass of a belief can be distributed arbitrariliy
over the finite states. One of the most commonly used representations (attributed
to Sondik (Sondik, 1971)) maintains a set of oz vectors, each of dimension IS1, where
V(b) = max, a -b. RTDP-Bel uses a function-approximation scheme which discretizes
the beliefs and stores their values in a hash-table which uses the discretized belief as
key.
49
b(s) = ceil(D - b(s))
(2.5)
An RTDP-Bel value function is therefore defined as such:
The calculation of action
h(b)
I if 6 V HASHTABLE
HASHTABLE(b)
I otherwise
Q values
is adjusted to use the discretized value function:
Q(b, a) = c(b, a) + y E
(2.7)
Pr(olb,a)V(b")
oEO
Where Pr(olb, a) is calculated with Equation 2.4 and b" with Equation 2.3.
bol
S
S2
a1
a,
T(si, ai, s)
Pr(o bi, a,)
S1
S~
bi
S4
a2
T(si,
b0
S
Sa
\o
a2, S )
a2
S
Pr(o bi,
S
a2)
b0
(b)
(a)
Figure 2-1: (a) Demonstrates how the state tree can be traversed by selecting actions
and transition links to successor states according to the transition function T(s, a, s').
(b) Shows how traversing the belief tree is similar to traversing the state tree except
that when an action is taken in a belief b we use equation 2.4 to determine the "belief transition probability" to the successor beliefs, through observation probabilites,
which can be calculated with equation 2.3.
50
Transformation from General POMDP to Goal POMDP
As previously discussed in section 2.4.1, the RTDP algorithm operates on so-called
Stochastic Shortest Path MDP problems.
Similarly, RTDP-Bel operates on Goal
POMDPs which satisfy the following criteria (only listing differences from general
POMDPs for brevity):
1. All actions costs are strictly positive.
2. A set of goal states exist that are:
(a) absorbing and terminal.
(b) fully observable, upon entering them a unique goal observation is emitted.
These constraints seem on first sight quite restrictive and would threaten to limit
RTDP-Bel to only be applicable to a small subset of all possible POMDP problems.
This is not the case in reality as general POMDPs can be transformed to a goal
POMDP without much effort. The transformation is explained in detail in (Bonet &
Geffner, 2009) but basically proceeds as follows:
1. The highest positive reward in the Discounted POMDP is identified and a constant C is defined as C := max(R(s, a)) + 1
2. A "fake" Goal state g is constructed along with a new unique observation og
3. The observation function is modified to include the goal observation: Q(a, g, og)
1.
4. A cost function is defined such that C(s, a) := C - R(s, a) and C(g, a) := 0
5. A new transition function is formulated that introduces a probabilistic transition to the goal state from any state with probability 1
-
y where -y is the
discount factor from the Discounted POMDP: Tew(s, a,.) := 7Yld(s, a, -) with
the addition that Tnew(s, a, g) := 1 - -y.
51
listen/1.0
openL/.50
openL/.50
jopenR/."50
left
openR/.,50
openL/.50
openR/.50
listen/1.0
openL/.50
oenR/.50
right
listen/1.0
openL/1.0
openR/1.0
target
listen/.05
openL/.05
openR/.05
listen/.05
openL/.05
penR/.05
openL/.475
listen/.95
openL/.475
openR/.475
Figure 2-2:
left
opn/45listen/.95
openU.475
openR/.475
right
openL/.475
openR/.475
Demonstrates how the transition function for the discounted Tiger
POMDP is transformed into a Goal POMDP. From: (Bonet & Geffner, 2009)
2.5.2
Bounded Belief Value Function
3
As was previously mentioned, B RTDP maintains a bounded value function over the
beliefs. In the following discussion, I will refer to these boundaries as separate value
functions
VL
(b) and
ZH(b)
7
but the implementation actually stores a two-dimensional
vector of values for each discretized belief in the hash-table (see equation 2.6) and so
only requires a single lookup operation for the value retrieval of both boundaries.
It is desirable to initialize the lower bound of the value function to an admissible
heuristic for the planning problem. This requirement needs to hold for an optimality
guarantee.
It is easy to convince oneself of why this is, imagine that at belief bi,
all successor beliefs to taking the optimal action a 1 have been improperly assigned
inadmissible heuristic values (that are too high). This will result in an artificially high
Q(bi, ai) value, resulting in the search algorithm choosing to take a 2 . If we assume
that the successor beliefs of action a 2 were initialized to an admissible heuristic then
after some number of iterations we can expect to have learned the true value of
52
Q* (b,
a 2 ), but if that value is still lower than the (incorrect) Q(bi, ai), then we will
never actually choose to explore a, and never learn that it is in fact the optimal action
to take.
It is equally desirable to initialize the upper boundary VH(b) to a value that
overestimates the cost of getting to the goal. This becomes evident when we start
talking about the bounding nature of the B 3 RTDP algorithm, namely that it will
prune actions whose values are dominated with a certain confidence threshold. For
that calculation, we require that the upper boundary be an admissible upper heuristic
for the problem.
In the previous section, we discussed a method of how to transform discounted
POMDPs to goal POMDPs. This transformation is particularly useful when working
with goal POMDPs we get theoretical boundaries on the value function for free.
Namely, there is no action that has zero or negative costs which means that no belief,
other than goal beliefs, has zero or negative values. This means that the heuristic
hL(b) = 0 is an admissible (although not very informative) heuristic. Similarly we
know that since the domain is effectively discounted (through the artificial transition
with probability 1 - 'y to the artificial goal state) that the absolute worst action
policy an agent could follow would be repeatedly taking the action with highest cost.
Because of the discounted nature of the domain, this (bad) policy has a finite expected
value of max(C(s, a))/(1 - -y) which provides a theoretical upper bound. This value
bound is called the Blind Action value and was introduced by (Hauskrecht, 2000).
Even though it is important that the heuristics (both lower and upper) be admissible, and it is nice that we can have guaranteed "blind" admissible values, it is
still desirable that the heuristics be informative and provide tighter value bounds.
Uninformed heuristics can require exhaustive exploration to learn which parts of the
belief space are "fruitful". An informed heuristic can quickly guide a search algorithm
towards a solution that can be incrementally improved. This is especially important
for RTDP-based algorithms as they effectively search for regions of good values and
53
then propagate those values back to the initial search node through Bellman updates.
What this effectively means is that the sooner the algorithm finds the "good" part of
the belief space, the quicker it will converge.
Since the lower boundary of the value function is used for exploration, the informative quality of the lower heuristic plays a much bigger role in the convergence time
of the algorithm.
Domain-dependent heuristics can be hand-coded by domain experts, which is a
nice option to have when the user of the system possesses a lot of domain knowledge
that could be leveraged to solve the problem.
In lieu of good domain-dependent
heuristics, we require methods to extract domain-independent heuristics that are
guaranteed to be admissible.
There are several different ways to obtain admissible lower heuristics from a problem domain. The most common method is called the QMDP approach and was introduced by (Littman et al., 1995).
This approach ignores the observation model
in the POMDP and simply solves the MDP problem defined by the transition and
cost/reward model specified. This problem can be solved with any MDP solver much
faster than the full POMDP and provides a heuristic value for each state in the domain which can be combined into a belief heuristic as such: h(b) --
,C
h(s)b(s).
The QMDP heuristic provides an admissible lower bound to the POMDP problem as
it is solving a strictly easier fully observable MDP problem. This heuristic tends to
work well in many different domains, but it generally fails to provide a good heuristic
in information-seeking domains since it completely ignores the observation function.
Another admissible domain-independent heuristic is the Fast Informed Bound
(FIB) which was developed by Hauskrecht (Hauskrecht, 2000).
This heuristic is
provably more informative than QMDP as it incorporates the observation model of the
domain. This more informative heuristic does come at a higher cost of O(JA
2
S
210)
whereas QMDP has the complexity of regular Value Iteration or O(JAJ S12 ).
There also exist methods to improve the upper bound of the value function. One
54
method is to use a point-based approximation POMDP solver to approximately solve
the actual POMDP problem as long as the approximation is strictly over-estimating
(Ross et al., 2008). This is a very costly operation but can be worth it for certain
domains.
The B 3 RTDP algorithm can be initialized to use any of the above or more heuristic
strategies, but we have empirically found its performance satisfactory when initialized
with the QMDP heuristic as the lower bound and the blind worst action policy heuristic
as its upper bound.
2.5.3
Calculating Action Selection Convergence
A central component of the B 3 RTDP algorithm is determining when the search tree
over beliefs and actions can be pruned. This pruning will lead to faster subsequent
Bellman value updates over the belief in question, less memory to store the search
tree, and quicker greedy policy calculation.
Traditionally, Branch and Bound algorithms only prune search nodes when their
bounds are provably dominated by the bounds of a different search node (therefore
making traversal of the node a sub-optimal choice). To mitigate the large belief space
that POMDP models can generate we experiment with pruning actions before they
are actually provably dominated. Figure 2-3 demonstrates how the search algorithm
might find itself in a position where it could be quite certain (within some threshold)
that one action dominates another but it could still cost many search iterations to
make absolutely sure.
We use the assumption that the true value of a belief is uniformly distributed between the upper and lower bounds of its value. Notice that the following calculations
could be carried out for any number of types of value distributions, but the uniform
distribution is both easier to calculate in closed form and appropriate since we really
do not have evidence to support a choice of a differently shaped distribution.
55
10
9
8
7
6
5
4
I
-1
3
2
0
al
a2
a,
02
(b)Beies
ad cton reevntto hefiur
ontelf
a3
(a) Demonstrates example Q boundaries
for three actions. Action a3 is the greedy
action to choose and it seems like its Q
value will dominate that of a2
Figure 2-3
For readability we introduce the following shorthand notation:
Q"
=zmin(QH (a), QH (a'))
Q74a= max(QL(a), QL(a'))
G(a) = QH (a) - QL(a)
If we assume that the true value of a belief is uniformly distributed between its
bounds, then the action
Q
values are also uniformly distributed and the following
holds:
Pr(q = Q*(a)) =
{
1
G(a)
QL(a) < q <
0
otherwise
56
QH(a)
(2.8)
Pr(q < Q*(a) q) =
1
q<QL(a)
Qj (a)-q
G(a)
QL(a) < q < QH(a)
0
q > QH(a)
We are interested in knowing the probability that one action's
Q
(2.9)
value is lower
than another's at any given time during the runtime of the algorithm so that we
can determine whether or not to discard the latter. This is a crucial operation to
the bounding portion of the algorithm. The quantity we are interested is therefore
Pr(Q*(a) < Q*(a')) when calculating whether we can prune action a' since its
Q
value is dominated by that of a.
We start by noticing that there are two special cases that can be quickly determined in which the quantity of interest is either 0 or 1. If QH(a)
the probability mass of
Q*(a)
is guaranteed to be below that of
QL(a') then all of
Q*(a')
and therefore
Pr(Q*(a) < Q*(a')) = 1. By the same rationale we have Pr(Q*(a) < Q*(a')) = 0
when QH(a')
QL(a).
QH(a')
QL(a)
I
| QH (a)
QL(a')
equation 2.11
| otherwise
0
Pr(Q*(a) < Q*(a'))) =
e o
And we carry out the following calculation:
57
(2.10)
Pr(Q*(a) < Q*(a'))
{We begin by applying the law of total probability}
00
JPr(q < Q*(a') q)Pr(q = Q*(a)) dq
0
{We apply equation 2.8}
QH(a)
Pr(q < Q*(a') q) dq)
G(a)
\QL(a)
{We apply equation 2.9 and split the integral into three intervals}
QL(a')
G
I
)
G(a)
QH(a')
ldq +
\Q L(a)
G(a)
Q ma
Q
(
QH(a')
SI
QL(a')
- QL(a) +
dq +
IOdq
G(al)
QH(a)
1
G(a')
QH(a) (Qmin - QL(a'))
QL(a')
Qmnax
QL(a)
±
G(a)G(a') (QH(a) (QrHin
G(a)
-
"ma-
QL(a)
G(a)
)
qdq
-F
2QH(a)Qmin - 2QH(a')QL(
QL(a')) - (Q
I
(Qmin)
2
)
-2
2
(QL(a'))
(QL~a')
2
2G(a)G(a')
(2.11)
This calculation can also be demonstrated graphically for a deeper intuitive understanding. Figures 2-4, 2-5 and 2-6 demonstrate how this calculation equates to
finding the area under rectangles and triangles and can be done quite efficiently.
58
1
---
-
Pr(Q(a'))
Pr(Q(a))
1
Q (a) - QL (a)
*--------------+
QL(a)
(a)
-QL
I
0
0
0
0
0
0
0
*QH
(a')
I
I
I
I
I
I
I
Qk(a)
QL(a')
q
Q (a')
Figure 2-4: Shows the Q boundaries for two example actions. The value of the true
Q*(a) is uniformly distributed between the bounds for both actions.
Pr(q < Q(a') I q)
Pr(Q(a'))
Pr(Q(a))
M%,
QH(a)--QL(a)
*
*
QHn(a,)
a
-
Q(a')
.
-qQL(a)
Q (a)
QL(a')
Qj(a')
Figure 2-5: In addition to the Q distributions, the probability function Pr(q <
Q*(a') g) is plotted. This function always evaluates to the probability mass of the
Q(a') function that exists between q and QH(a') which for uniform distributions is a
piecewise linear function of a particular shape.
59
-
1---
-
-
....
-Pr(q < Q(a') | q)
Pr(Q(a'))
Pr(Q(a))
Pr(q < Q(a') | q)Pr(Q(a))
QH(a)-~QL(a)
'Ilk
~iq
0
QL(a)
QH(a)
QL(a')
QH(a)
Figure 2-6: Finally this figure shows the function whose integral is our quantity of
interest Pr(Q*(a) < Q*(a')). This integral will always simply be the sum of rectangle
and triangle areas for two uniform
2.5.4
Q distributions.
Convergence Frontier
The Convergence Frontier(CF) is a concept created to take advantage of early action
convergence close to the initial search node. An intuitive understanding of it can be
gained by thinking about when the action choice for a belief has converged to only one
action that dominates all others. At that point, actually simulating action selection is
unnecessary as it will always evaluate to this converged action. This can be the case
for several successive beliefs. Figure 2-7 demonstrates how action choice can converge
over the initial belief as well as some of its successor beliefs, effectively extending the
CF further out whenever the action policy over a belief within it converges. When
planning for a given POMDP problem, the usefulness of the convergence frontier
depends on the domain-dependent difficulty of choosing actions early on.
The CF is initialized to only contain the initial belief with probability one. Whenever the action policy converges to one best action over any belief in the CF, that
belief is removed and the successor beliefs of taking that action are added with their
respective observation probabilities weighted by the original CF probability of the
originating belief. Sometimes the value function converges over a belief before the
60
Figure 2-7: Demonstrates how action choice can converge over a belief, creating
effectively a frontier of reachable successive beliefs with associated probabilities. This
effect can be taken advantage of to shorten planning.
action policy has converged. In this case, we simply remove this belief from the CF
(effectively reducing the total CF probability sum from one) as the value function
has been successfully learned at that node. This presents two separate termination
criteria: 1. If the total probability of the CF falls below a threshold. And 2. If the
total probability weighted value gap of all beliefs in the CF falls below a threshold.
Pseudocode for the UPDATECF routine is provided in algorithm 1. The routine
iterates through every belief b currently in the frontier, in line 5 it checks to see if
the value function has been collapsed over b and if so it is simply removed (which
reduces the total probability of the frontier since no subsequent beliefs are added
for that node). In line 10 the routine checks to see if only one action is left to be
taken for b (the action policy has converged), if so then b is removed from the frontier
and all subsequent beliefs of taking the converged action from b are added with their
respective observation probabilities multiplied by the probability of b in the frontier.
The short SHOULDTERMINATECF routine is also defined in algorithm 1. It simply dictates that the algorithm should terminate when either of two conditions are
satisfied. The first condition activates when the total probability of all beliefs in
the frontier goes below a threshold #t. This means that when acting on the optimal
policy, starting at the initial belief, it is sufficiently unlikely that any frontier belief
61
is experienced. The second condition activates when the probability weighted value
gap of the frontier goes below the threshold c, this means that the value function has
been sufficiently learned over all the beliefs in the frontier.
Lastly, the SAMPLECF routine in algorithm 1 simply creates a probability distribution by normalizing the probability weighted value gaps of the beliefs in the frontier.
At line 33 this distribution is sampled and the corresponding belief returned.
Algorithm 1: Convergence Frontier
1 UPDATECF (c: ConvergenceFrontier)
2
foreach b E c do
4
//
If value of belief in c becomes certain, we remove it
5
if
H (b) -
L
7
VL(b)
< c then
c.REMOVE(b);
//
If action policy has converged over belief, we add all
successor beliefs
else if SIZE(A(b)) = 1 then
a:= PICKACTION(A(b));
foreach o E 0 1 Pr(olb, a) > 0 do
9
10
12
13
16
if b"o
c then
c.APPEND(b );
18
c.prob(b") := c.prob(ba) + c.prob(b) - Pr(olb,a);
14
c.REMOVE(b);
20
21
23
24
SHOULDTERMINATECF (c: ConvergenceFrontier)
// We terminate when either the total probability of the CF
falls below a threshold or when the probability weighted
value gap does
return (Ec.prob(b) < 0) V (1c.prob(b) (VH(b) - VL(b))
\bec
<c);
bc
/
SAMPLECF (c: ConvergenceFrontier)
27
// We sample a belief from the frontier seeking high value
uncertainty
25
29
Vb
31
G
33
cC,g(b)
c.prob(b) ('H
(b)
-
IL (b)
Eg;
return b'~ g(-)/G;
62
2.5.5
Belief Branch and Bound RTDP
In the previous sections we have introduced many of the relevant important concepts that we now combine together into a novel POMDP planning system called
Belief Branch and Bound RTDP (B3 RTDP ). The algorithm extends the RTDPBBel (Geffner & Bonet, 1998) system but uses a bounded value function. It follows a
similar belief exploration sampling strategy as the BRTDP (McMahan et al., 2005)
system does for MDPs except adapted to operate on beliefs rather than states. This
exploration heuristic chooses to expand the next belief node that has the highest
promise of reduction in uncertainty. This approach realizes a search algorithm that is
motivated to "seek" areas where information about the value function can be gained.
An interesting side-effect of this search algorithm is that the algorithm never visits
beliefs whose values are known (VL(b) = VH(b)) such as the goal belief, because there
is nothing to be gained by visiting them. Finally B3 RTDP is a Branch and Bound
algorithm, meaning that it leverages its upper and lower bounds to determine when
certain actions should be pruned out of the search tree.
The B3 RTDP system is described in detail in algorithms 2 and 3, but we will
informally describe its general execution. Initially, the upper bounds on the value
function VH are initialized to a Blind Policy value (Hauskrecht, 2000), which can
generally be easily determined from the problem parameters (namely the discount
factor -y and the reward/cost function R/C). The lower bounds of the value function
are initialized to an admissible search heuristic such as the QMDP which can be
efficiently calculated by solving the MDP that underlies the POMDP in question by
ignoring the observation model.
The initial belief b, is added to the Convergence Frontier (CF discussed in section 2.5.4) with probability one in line 5. Until convergence as determined by the
SHOULDTERMINATECF
routine in algorithm 1, the B3 RTDP algorithm samples an
initial belief for that trial bT from the CF in line 8, performs a B3 RTDPTRIAL and
finally updates the CF in line 12.
63
Each B3 RTDPTRIAL in algorithm 2 initializes a stack of visited beliefs in line 19
and proceeds to execute a loop until either maximum search depth is achieved or the
termination criteria (discussed below) is met. On every iteration we push the current
belief onto the stack and the find the currently "best" action to take according to
the lower boundary of the QL function is found in line 28.
need to calculate the
QL
To find this action we
values for all actions so we might as well use them to
perform a Bellman value update on the current belief in line 32 (we also perform a
bellman update for the upper bound of the value function). We then select the next
successor belief to explore, using the PICKNEXTBELIEF routine line 38.
Once the
loop terminates, we perform Bellman updates on both value boundaries (lines 55 and
57) and action pruning (line 51) on all the beliefs we visited in this trial, in reverse
order. This is done to both propagate improved value boundaries back to the initial
belief as well as make successive search iterations more efficient.
The PICKNEXTBELIEF routine in algorithm 3 starts by creating a vector containing the observation probability weighted value gaps of the successive beliefs of taking
action a in line 3. The sum of the values of this vector is called G and is used both for
normalization and determining termination (line 8). If G, which is the expected value
gap of the next belief to be experienced, is lower than a certain portion (defined by
T)
of the value gap at the trial initial belief, then the trial is terminated. Otherwise
we sample a value from the normalized vector and return the associated belief.
Lastly, the PRUNEACTIONS routine in algorithm 3 simply iterates through the set
of all actions available at the current belief, calculates the probability of them being
dominated by the currently best action (line 19), and removes them from the set if
that probability is higher than a threshold o.
The following are the important parameters for B 3RTDP along with a discussion
of their impact on the algorithm:
e D:
Belief discretization factor.
Determines how densely the belief space is
64
Algorithm 2: The Belief Branch and Bound RTDP (B3 RTDP) algorithm
1 B 3 RTDP (b1 : Belief)
3
//
5
INITIALIZECF(c,
6
while! SHOULDTERMINATECF(c) do
Initialize the Convergence Frontier to b,
bl);
b:= SAMPLECF(c);
8
10
B 3 RTDPTRIAL(b);
12
UPDATECF(c);
14
return GREEDYUPPERVALUEPOLICYO;
15 E 3 RTDPTRIAL
(bT : Belief)
17
//
Maintain a stack of visited beliefs
19
trace.CLEARO);
21
b:=bT;
22
while (trace.SIZE() < MAXdepth) A (b 7 0) do
trace.PUSH(b);
// Pick action greedily from lower Q boundary
24
26
28
a := argminaEQ
4 QL(b, a);
30
//
Perform Bellman updates for both boundaries
/L (b)
32
34
VH(b)
36
//
41
mina'cAb QH(b, a)
Sample next belief to explore
b:= PICKNEXTBELIEF(b, bT, a);
38
40
QL(b,a);
(see alg.
3)
//
Value update and prune visited beliefs in reverse order
while visited.SIZE() > 0 do
43
b:= visited.PoP();
45
//
47
a := argminaEA6 QL(b7 a);
49
53
// Prune dominated actions (see alg.
3)
PRUNEACTIONS(b, a, Ab);
// Perform Bellman updates for both boundaries
55
VL (b)
57
_VH(b)
51
Pick action greedily from lower Q boundary
QL(b, a);
mina'cA 6 QH(ba')
65
clustered for value updates. If set too low then belief clustering might cluster
together beliefs that should receive very different values and negatively impact
planning result. If set too high then value hash-table will grow very large and
many updates will be required to learn the true values of beliefs. Typical range:
[5, 20]. See equation 2.6.
* a: Action convergence probability threshold. This parameter determines when
it is appropriate to prune an action from the selection at a given belief. If the
Q*(b, ai) dominates Q*(b, a 2 ) with probability higher than a, then a1 is pruned.
Typical range: [0.65, 1]. See equation 2.11.
*
E:
Minimum value gap. This threshold dictates whether the search algorithm
has converged when VH(bI)
-
VL(bI)
< E or when the probability weighted value
gap of the Convergence Frontier beliefs is below e. Typical range [0.0001, 0.1].
* #: Minimum Convergence Frontierprobability. This provides a secondary termination criteria for the algorithm. When the total probability of the CF falls
below 3 the algorithm terminates. Typical range [0.0001, 0.01].
0
T:
Trial termination ratio. This parameter is used to determine whether a
search trial should be terminated. When the search arrives at a belief where
the expected value gap of the successor beliefs is lower than a ratio measured
by T of the value gap at the trial's initial belief then the iteration is terminated
and value updates are propagated backwards to the trial's initial belief. Typical
range [5, 100].
The parameters that have the biggest impact on the efficiency of B3 RTDP are D
and a. For all of the evaluations and future discussion we will use the values e = 0.01,
0 = 0.001 and T =10, and show results with varying values of D and a.
66
3
Algorithm 3: Subroutines of the B RTDP algorithm
1 PICKNEXTBELIEF (b, bT : Belief, a : Action)
3
Vo E 0, g(bo) := Pr(o b, a)(fH(bo) - VL(ba));
5
G :
7
//
8
if G <
14
//
Sample next belief according to probability weighted value
uncertainty
return b' ~ g(-)/G;
15 PRUNEACTIONS
16
L(bT)) /T then
HbT-
return 0;
10
12
Terminate search iteration when the expected value of
uncertainty at current transition is lower than some
portion of trial-initial belief value uncertainty
(b,: Belief, abet : Action, Ab : ActionSet)
foreach a E Ab
a : abest do
If probability that abest dominates a is higher than a
(eqn. 2.11) then remove a
18
//
19
if Pr (Q*(b, abest)) < Q*(b, a)) > a then
21
LAb :=
Ab
\ a;
67
2.6
Results
In this section we present evaluation results for the B3 RTDP POMDP planning algorithms on two well-known evaluation domains. We have chosen to include evaluation
results for a state-of-the art POMDP planner called SARSOP (Kurniawati et al.,
2008) which is a popular belief point-based heuristic search algorithm. The following
points are good to keep in mind when comparing performances between the SARSOP
planner and B3 RTDP :
1. Because of the RTDP-Bel hashtable value function implementation, B 3 RTDP
consumes significantly more memory than SARSOP.
2. SARSOP takes advantage of the factored structure in the problem domains.
This makes for a significantly more efficient belief update calculation. There
is no reason why B3 RTDP could not do the same but it simply hasn't been
implemented yet (see section 2.7).
3. Even though both algorithms are evaluated on the same machine, SARSOP is
implemented in C++, which compiles natively for the machine, whereas this implementation of B3 RTDP is implemented in JavaTM. Therefore its performance
can suffer from the level of virtualization provided by the JVM.
4. In the standard implementation, SARSOP uses the Fast Informed Bound lower
heuristic (see section 2.5.2). SARSOP is similar to B3 RTDP in that as a search
algorithm it can be provided with any number of different heuristics to initialize
its value function so we modified it to use the QMDP heuristic so that its results
would be more comparable with those of B3 RTDP . In actuality, the difference
between using the two heuristics was not very noticeable for these evaluations.
To evaluate the B3 RTDP algorithm, we have chosen to use two commonly used
POMDP problems called Rocksample and Tag. We will show anytime performance
of B 3 RTDP on these domains, that is how well a policy performs when the algorithm
68
is stopped at an arbitrary time and policy is extracted. We will also compare convergence times with the Average Discounted Reward measure which is a measure of how
much discounted reward one could expect to garnish from a problem by acting on a
policy produced by the algorithm. All reported times are on-line planning times, we
do not include the time to calculate the QMDP heuristic for either system as there
exist many different ways to solve MDPs and this does not fall into the domain of
the contributions made by either POMDP planners.
2.6.1
Rocksample
Rocksample was introduced by Smith and Simmons to evaluate their algorithm Heuris-
tic Search Value Iteration (HSVI) (Smith & Simmons, 2004). In this domain, a robotic
rover on mars navigates a grid-world of fixed width and height. Certain (known) grid
locations contain rocks which the robot wants to sample. Each rock can either have
the value good or bad. If the rover is in a grid location that has a rock, it can sample
it and receive a reward of 10 if the rock is good (in which case sampling it makes
the rock turn bad) or -10 if the rock was bad. The rover can sense any rock in the
domain from any grid with a sensei action which will return an observation about
the rock's value stochastically such that the observation is more accurate the closer
the rover is to the rock when it senses it. The rover also receives a reward of 10 for
entering a terminal grid location on the east side of the map. The rover's location
is always fully observable, and the rock locations are static and fully observable but
the rock values are initially unknown and only partially observable through the sense
action. In Rocksample_ n_ k, the world is of size nxn and there are k rocks. The robot
can choose from actions move_ north, move
sense1 , ... , sense8 .
69
south, move_east, move_ west, sample,
Algorithm
SARSOP
SARSOP
B3 RTDP (D=15,a=0.95)
B3 RTDP (D=15,o=0.75)
B3 RTDP (D=10,a=0.75)
ADR
21.28
0.60
20.35
0.58
21.47 ± 0.02
21.45 ± 0.02
21.03 ± 0.39
Time [ms]
100000*
1000*
1300
1241
508
Table 2.1: Results from RockSample_ 7 8 for SARSOP and B3 RTDP in various different configurations. We can see that B3 RTDP confidently outperforms SARSOP
both in reward obtained from domain and convergence time (* means that the algorithm had not converged but was stopped to evaluate policy). The ADR value is
provided with a 95% confidence interval.
22
20
18
a:
0
-
- - - - - - - - - - - - - -
16
B3RTDP anytime performance
-_I
B 3RTDP convergence time
SARSOP @ 1000 ms
14
12
102
10
t [ms]
Figure 2-8: Shows the ADR of B3 RTDP in the RockSample_ 7 8 domain. Algorithm
was run with D
-
10 and a = 0.75 and ADR is plotted with error bars showing 95%
confidense intervals calculated from 50 runs.
70
Algorithm
SARSOP
SARSOP
B3 RTDP (D=-c20,a-0.85)
B3 RTDP (Dt=15,a=0.65)
B3 RTDP (D=10,a=-z0.95)
B3 RTDP (D=10,a=0.65)
ADR
-5.57 ± 0.52
-6.38 ± 0.52
-5.43 ± 0.09
-5.88 t 0.08
-6.13 ± 0.28
-6.28 ± 0.35
Time [ms]
100000*
1000*
72137
4950
1476
680
Table 2.2: Results from Tag for SARSOP and B3 RTDP in various different configurations (* means that the algorithm had not converged but was stopped to evaluate
policy). ADR stands for Adjusted Discounted Reward and is displayed with 95%
confidence bounds
2.6.2
Tag
The Tag domain was introduced by Pineau, Gordon and Thrun to evaluate their
Point-Based Value Iteration algorithm (Pineau et al., 2003). In this domain, a robot
and a human move around a grid-world with a known configuration. The robot's
position is fully observable at all times but the human's position can only be observed
when the two occupy the same grid. The robot chooses among the following actions:
move_ north, move_ south, move_ east, move_ west and tag and receives a negative
reward of -1 for every move action, a negative reward of -10 for the tag action if it
is not in the same grid as the human and positive reward of 10 if it is (which leads
to a terminal state). For every move action, the human moves away from the robot's
position in a stochastic but predictable way.
71
-5
-5.2
-5.4-5.6-5.8 -
0
-61k
r
-6.2-
}
-
-
-6.4-
-
-6.6-
+
-
-6.8-7'
0. 5
4
0.55
0.6
0.65
0.7
I
i
0.75
0.8
.
i
0.9
0.95
i
0.85
d=10
d=15
d=20
1
alpha
Figure 2-9: Shows the ADR of B3 RTDP in the Tag domain as a function of the action
pruning parameter a and discretization D. ADR is plotted with error bars showing
a 95% confidence intervals calculated from 20 runs of the algorithms.
72
105
-
d=10
- d=15
d=20
*
E 10
..
.
. . .
.
. ...
.
0
1
0.5
0.55
0.6
0.65
0.7
0.75
alpha
0.8
0.85
0.9
0.95
1
Figure 2-10: Shows the convergence time of B3 RTDP in the Tag domain as a function
of the action pruning parameter a and discretization D. ADR is plotted with error
bars showing a 95% confidence interval calculated from 20 runs of the algorithms. We
can see that the convergence time of B3 RTDP increases both with higher discretization as well as a higher requirement of action convergence before pruning. This is an
intuitive result as the algorithm also garnishes more ADR from the domain in those
scenarios.
73
-5
-6
-7
-8
0
-9
B3RTDP anytime performance
B3RTDP convergence time
-.
-10
-
I.
Ii
SARSOP @ 3700 ms
Ii
Ii
I.
I;
-12
103
102
10,
t [ms]
Figure 2-11: Shows the ADR of B3RTDP in the Tag domain. The algorithm was
run with D = 15 and a = 0.65 and ADR is plotted with error bars showing 95%
confidence intervals calculated from 50 runs. We can see that B3 RTDP converges at
around 370 ms, at that time SARSOP is far away from convergence but has started
to produce very good ADR values.
2.7
2.7.1
Discussion and Future Work
Discussion of Results
As we can see from tables 2.1 and 2.2, B3 RTDP can outperform the SARSOP algorithm both in the Average Discounted Reward (ADR) measure as well as in convergence time. We can also see that the anytime behavior of the algorithm is quite good
from the graphs in figures 2-8 and 2-11 such that if the planner were stopped at any
time before convergence, it could produce a policy that returns decent reward.
We know that these benefits are largely due to the following factors:
74
1. The belief clustering which is inherent in the discretization scheme of our value
function representation. This benefit comes at the cost of memory.
2. The action pruning that significantly improves convergence time and is enabled
by the boundedness of the value function.
3. The search exploration heuristic which is guided by seeking uncertainty and
learning the value function rapidly.
It is satisfying to see such positive results but it should be mentioned that two
parameters of the B 3 RTDP algorithm which most heavily impact its performance,
namely D and a. These parameters have quite domain-dependent implications and
should be reconsidered for different domains the algorithm is run on. We show in our
results how the performance of the planner varies both in ADR and convergence time
as a function of these parameters on the two domains.
2.7.2
Future Work
During the development of B 3 RTDP we identified several areas where it could be
improved with further research and development.
Much of the current running time of the algorithm can be attributed to the belief
update calculation of equation 2.3. This update is computationally expensive to carry
out or O( OIS
2)
in the worst case for each action (this can be mitigated by using
sparse vector math when beliefs do not have very high entropy).
Many POMDP
problems have factored structure which can be leveraged. This structure means that
a state is described by a set of variables, each having their own transition, observation and reward functions. Factored transition functions are traditionally represented
as Dynamic Bayes Nets (DBNs) and can cause a significant reduction both in the
memory requirement of storing the transition matrix as well as in the computational
complexity of the belief update. This benefit is gained if the inter-variable dependence of the transition DBNs is not too complex.
75
RTDP-based algorithms would
clearly benefit greatly from taking advantage of factored structure, possibly even
more so than other algorithms as the hash table value function representation might
be implemented more efficiently.
To improve convergence of search-based dynamic programming algorithms, it is
desirable to spend most of the value updates on "relevant" and important beliefs.
If the point of planning is to learn the true value of the successor beliefs of the
initial belief so that a good action choice can be made, then we should prioritize
the exploration and value updating of beliefs whose values will make the greatest
contribution to that learning. B 3 RTDP already does this to some degree but could
do more. The SARSOP algorithm (Kurniawati et al., 2008) uses a learning strategy
where it bins beliefs by discretized features such as the upper value bound and belief
entropy. Then, this algorithm uses the average value in the bin as a prediction of the
value of a new belief when determining whether to expand it. Focused RTDP (Smith
& Simmons, 2006) also attempts to predict how useful certain states are to "good"
policies and focuses or prioritizes state value updates and tree expansion towards such
states. B 3 RTDP could take advantage of the many existing strategies to further focus
the belief expansion.
76
Chapter 3
Mind Theoretic Reasoning
77
3.1
Introduction
In this chapter, we address the main challenge of producing robotic teammate behavior, which approaches that of how a human teammate can dynamically anticipate,
react and interact with its other human teammate. When people work together,
they can reach a natural level of fluency and coordination that stems from not just
understanding the task at hand but also understanding how the other reacts to the
environment, what part of the task they are working on at any given time and what
they might or might not know about the environment.
Castelfranchi wrote about modeling social action for Al agents and had some
insightful thoughts that are relevant (Castelfranchi, 1998):
"Anticipatory coordination: The ability to understand and anticipate
the interference of other events with our actions and goals is highly
adaptive. This is important both for negative interference and the avoidance of damages, and for positive interference and the exploitation of
opportunities." -Castelfranchi
As Castelfranchi points out, anticipating future events by modeling the social
environment can not only help to avoid damages but it can present opportunities
which can be exploited. Lets take an example, lets say that an agent knows that
to accomplish its goal, a certain sub-task needs to be completed, and that the same
sub-task is needed for the successful completion of another agent's goal. The agent
can choose to either complete that sub-task and do it in a way that is obvious to the
other agent that it was completed to spare him the work, or it can spare itself the
work by exploiting the opportunity presented by knowing that the other is likely to
complete it later.
"Influencing: The explicit representation of the agents' minds in terms of
beliefs, intentions, etc., allows for reasoning about them, and even
78
more importantly it allows for the explicit influencing of others,
trying to change their behavior (via changing their goals/beliefs)..."
-
Castelfranchi
Not only can agents that model their social environment passively avoid damages
and exploit opportunities, but they can also actively predict how they might manipulate the other to produce a more favorable outcome. This distinction has also been
called weak social action vs. strong social action. This can clearly be advantageous in
many scenarios but particularly in collaborative settings such as human-robot teamA robot should be able to model its teammate and leverage that model not
work.
only to passively exploit predictive actions but actively manipulate by for example
making sure the teammate is aware of relevant features in the environment that they
might not have been otherwise.
Mind-Theoretic Planning
3.1.1
In this chapter, we present the development of a Mind-Theoretic Planning system
(MTP). We believe that it is crucial for a robot or an autonomous agent, that is
charged with collaborating and with people in human-robot teams, to have some
understanding of human mental states and be able to take them into account when
reasoning about and planning in the world. We strive to accomplish the following
design goals with such a system. It should be able to:
" reason about the following types of unobserved mental states of its human
teammate:
- possible false beliefs they might have about the environment
- possible goals or desires they have
* predict future actions based on what their mental state is
" plan its own actions in a way that:
79
- seeks to better know which mental state the other has if useful
- seeks to correct false beliefs of other if useful
- exploits opportunities created by prediction of actions
- avoids damages anticipated by predicted actions
- accomplishes the goals of human-robot team
The rest of this chapter is dedicated to introducing the relevant background concepts, reviewing existing literature, demonstrating earlier approaches which lead to
the MTP system, presenting the final version of the MTP system, and finally presenting methods and results of evaluating it.
3.1.2
Overview of Chapter
Background
In this section, a few of the important underlying psychological concepts of mental
state reasoning are presented and their relevance to this thesis discussed.
Overview of Related Research
This section will outline some important existing research in the areas of computational mind-theoretic reasoning, human-robot teamwork and belief space planning.
Earlier Approaches to Problem
Here we will give a brief synopsis of a couple of approaches that we experimented with
to implement the MTP system. This section simply serves to show how we arrived
at the final formulation, which is presented in the next section. Much more detail on
these earlier approaches is provided in the appendix of this thesis.
80
Mind Theoretic Planning
This section details the final implementation of the MTP system using a variety of
Markov Decision Process models both fully and partially observable.
Evaluation
In this section, we describe the simulators and on-line game environment that was
developed to evaluate the MTP system. We also describe the experimental setup and
results of a user study that that demonstrates the capabilities of the MTP system
and evaluates its impact on human-robot teamwork.
81
3.2
Background
In this section, we introduce some of the psychological concepts that underly human
mental state reasoning and discuss their relevance to our MTP system. We also introduce basic concepts from logic, knowledge representation, and probabilistic reasoning.
Note that we provide no background on the autonomous planning literature here as
that has been covered in our previous chapter 2.2.
3.2.1
Theory of Mind
Theory of Mind (ToM) is the term that has been coined for our ability to reason about
other people's behavior in terms of their internal states (such as beliefs and desires).
Having a ToM allows an individual to understand and predict others' behaviors not
only based on the directly observable perceptual features but also knowledge about
the other person, what have they done in the past, what they know about their
environment, what types of relationships they have with others that are involved,
and more. Reasoning about their thoughts, beliefs, and desires can also include what
beliefs they hold about you, or even your beliefs. This can become a recursive process
as demonstrated in figure 3-1, although it seems unlikely that a human reasoner would
take more than about two recursive steps or to what would be level four in the figure
but is not shown ("Jill believes that Jack knows that she is mind-reading him").
This skill does not only help us understand others' behavior better and make more
accurate predictions about possible future actions, but it also supports traits that are
important for our society to function. This includes developing and feeling empathy
and compassion for other people and their suffering (Christian Keysers, 2008) which
has been empirically tested and confirmed with f\IRI studies (Singer et al., 2004) and
(Singer et al., 2006).
82
Level 2
Level 1
ack:
Menmal
sink,
Mindreadift
Level 3
Embedded onindreadeng
Mest recuslion)
Figure 3-1: Demonstrates the recursive nature of ToM. Adapted from (Dunbar, 2005).
3.2.2
Internal Representation of ToM
The internal mechanisms that get used to understand, interpret, and reason about
others' behavior have also become a topic of much debate and discussion in the ToM
community. The debate has been about whether people actually use a naive theory
of psychology (Theory-Theory or TT) to make inferences or predictions about human
behavior, or whether they feed the perceptual features they believe their model observes through their own cognitive mechanisms and then read the suppressed outputs
of those mechanisms to reason about the model's mental states (Simulation-Theory
or ST). Many researchers prefer the TT account (Saxe, 2005) (Gopnik & Wellman,
1992) or at least a hybrid account that explains mind-reading as a combination of
simulation and theorizing, whereby a theory might govern how the perceptual inputs
should transformed before being fed into the cognitive mechanisms and possibly also
how to interpret the results of the simulation (Saxe, 2005) (Goldman, 2006) (Nichols
& Stich, 2003).
83
3.2.3
False Beliefs
The ability to attribute false beliefs to others has become recognized as a fundamental
capability that a ToM affords. Understanding that another might hold a belief about
a certain state of the world that is incorrect (or at least different from your own)
is called understanding false belief. False belief tasks have become a benchmark for
the development of ToM in children, as an understanding of it is believed to be an
important developmental milestone. Failure to attribute false belief to another by a
certain age could be considered a sign of some cognitive developmental deficiencies
such as autism (Perner et al., 1989). A classic false belief task is the Sally-Anne task
which was originally proposed by (Wimmer Josef, 1983). In this task, a child is shown
a cartoon strip (see figure 3-2 for all of the strips). In this story, Sally puts the ball
in the basket and leaves. Then, Anne moves the ball from the basket to the box and
leaves. After seeing the strip, the child is asked where Sally will start looking for the
ball. If the child has developed a ToM, she will understand that Sally did not witness
when Anne moved the ball and will therefore indicate that Sally will look where she
thinks the ball is, in the basket.
How we come to develop ToM and when is still a matter of some contention. Some
researchers claim that children interpret others' behavior in terms of their individual
beliefs and desires from birth (Onishi & Baillargeon, 2005), but most studies agree
that children are able to pass the false belief task at around the age of 3-5 years
(Wellman et al., 2001).
3.2.4
Mental State
The use of the word "mental states" is fairly ambiguous in common use of the language and requires further specification to be meaningful. Several theories in human
psychology and philosophy use this term to describe non-physical properties of our being, things that cannot be reduced to physical or biological states (Wikipedia, 2014).
84
This iSaly.
This is Anne
Sally has a basket
Anne has a box,
Sally goes out for a walk.
Anne takes the marble out of the basket and puts 1 into the
Now Sally comes back.
Where
She wtr
-
Sany look tor her
to
box.
play with her marble
marble?
Figure 3-2: A classic false belief situation involving the characters Sally and Anne
(Image courtesy of (Frith, 1989)).
Common interpretations include: beliefs, desires, intentions, judgements, preferences
and even thoughts and emotional states such as happiness and sadness. In this thesis, we focus on two important types of mental states, namely propositional attitudes
or beliefs and desires or goals. These two categories of mental states cover a broad
spectrum of concepts and are particularly helpful to predict the behaviors of others
(Schiffer, 2012). If one understands what another person believes to be true and false
about the world in addition to knowing what that person desires or wants to achieve,
85
then the only piece missing to fully predict their behavior is a model of how they will
choose to bring about the changes needed to transition the world from what their
beliefs say it is to what their goals say they want it to be. This can of course be
a challenging task and is further complicated with uncertainty about how predicted
behavior is affected by the behavior of other agents, perception of unknown features
of the environment, and possible goal switching or re-prioritizing, just to name a few.
3.2.5
Knowledge Representation
We now switch gears slightly and discuss different means of reasoning about and
representing knowledge.
Much classical work in Artificial Intelligence has dealt with logical reasoning
and inference with symbolic representations such as in First Order Logic (Russell
& Norvig, 2003).
First order systems provide a formal way to manipulate truth
statements about objects and their relations to other objects, and to derive or infer
the truth-values of statements that do not already exist in the knowledgebase. It is
now widely accepted in the field of Al that logical reasoning is not sufficient for making useful decisions in the real world because knowledge is almost never certain and
our models of the world are inaccurate. Three sources of uncertainty that intelligent
systems need to be able to cope with (Korb & Nicholson, 2004):
1. Ignorance: Even certain knowledge is limited.
2. Physical indeterminism: There are many sources of randomness in the real
world.
3. Vagueness: Our models cannot be specific for every possible input or outcome.
The caveat is that we also know that learning and reasoning with fully unconstrained knowledge or no bias for focusing on relevant concepts is intractable and
also somehow does not quite fit with our intuition for how human intelligence might
86
work. We believe that a truly intelligent system should know when to use purely
deterministic logical reasoning methods and when to take uncertainty into account.
Probabilistic Representation
Probabilistic information is often represented as a set of stochastic variables with
either discrete or continuous domains. The value of a stochastic variable can depend
on values of other variables in the set through a Conditional Probability Distribution (CPD) or Conditional Probability Table (CPT). In this discussion, we will only
consider variables with discrete domains.
Computing any quantity of interest in this set of variables is in the general case
very computationally expensive; for example, finding the marginal probability that a
variable takes a particular value can mean summing over
variables (Barber, 2011).
2 N-1
states for N binary
In order to be able to reason efficiently about the set
of variables, we need to constrain their interactions.
In particular, we want to be
able to leverage inherent independencies between variables and make independence
assumptions when the computational gain of doing so does not come at too severe a
cost in accuracy.
3.2.6
Bayesian Networks
Bayesian Networks (BNs) have become a widely accepted method to represent probability distributions because of how they can make the independence relations that
exist in a distribution explicit, help in visualizing the distribution, and support efficient inference. BNs are Directed Acyclic Graphs which means that they prohibit the
possibility of traversing from a variable in the network, along the directed arcs, and
arriving back at that variable. A BN is fully defined by a list of variables and their
CPTs, but it is often useful to think of them only in terms of their inter-connectivity
and then parameterize the generation of their CPTs. This parameterization is domain
dependent but often takes the form of logic gates where the truth value of a binary
87
variable could for example be defined as a (possibly noisy) OR gate of its parent
values.
Reasoning with Bayesian Networks
Making a query to a BN means to either request the probability distribution of a
variable (possibly given some evidence) or a sample from that distribution.
One
might be interested in the marginal likelihood of a variable, sometimes called "model
evidence," which means the likelihood that a variable takes a value, given only the
network it belongs to. Another quantity of interest is the posterior probability of a
variable which is the probability that a variable will take a value given the observed
values of other variables.
Four types of reasoning can be done with BNs (Korb &
Nicholson, 2004):
1. Diagnostic: An effect is observed and a query is made about its cause
2. Predictive: A cause is observed a query is made about its effects
3. Inter-causal: A cause and one of its effects are observed and a query is made
about other causes of that effect
4. Combined: A node's cause and effect are observed and a query is made about
the node
Any of these types of reasoning require probabilistic inference in the BN. This
inference can take several forms, and there are a multitude of both exact and approximate inference techniques, see (Guo & Hsu, 2002) for an exhaustive review. For
reasonably large BNs, approximate inference algorithms are a sensible choice. A family of sampling-based algorithms called Markov Chain Monte Carlo is of particular
use.
This family includes the Gibbs Sampling, Metropolis-Hastings, and Hamilton
Dynamics methods, all of which are different ways to create an estimation of the
target distribution that we wish to sample from.
88
Some novel interesting approaches to represent and sample from BNs, that have
relevance to this thesis, are (1) Methods that exploit Context Specific Independence
where a CPT can be represented more efficiently with rules rather than a table
(Boutilier et al., 1996), (Poole, 1997) and (2) ProbabilisticProgramming approaches,
where programs, rather than tables, govern a variable's CPD (Gerstenberg & Goodman, 2011).
89
3.3
Overview of Related Research
This section describes some of the prior work in the field of reasoning about agents'
beliefs, goals, and plans. Much progress has been made, and some very interesting
results gathered, but there are many challenges that still remain unaddressed. Notably, not much work exists that can reason explicitly about false beliefs (discussed in
section 3.2.3). Furthermore, much of the work describes systems that are limited to
either (1) estimating beliefs given observed actions (2) predict actions given estimated
goals or (3) plan actions that take others' actions into account. Limited work exists
that tries to combine these methods in a holistic approach to a mind-theoretic agent
system. In this thesis, we intend to frame and make considerable progress on some
of these problems.
3.3.1
ToM for Humanoid Robots
Brian Scassellati was one of the first authors to discuss the possibility of endowing
robots with ToM capabilities (Scassellati, 2002). In this work, he compared different
models of ToM, introduced by Leslie et al. (Leslie, 1994) and Cohen et al. (BaronCohen, 1995), with respect to their applicability to robots.
Scassellati discussed how Leslie's model breaks the world down into three categories of agency: mechanical agency, actional agency and attitudinal agency. The
theory further suggests that a Theory of Body mechanism (ToBY) supports people's
understanding of physical objects and mechanical causality. ToBY's inputs are theorized to be two-fold, one a three-dimensional object-centered representation of the
world which has been processed by high-level cognition, and another which is more
directly perception-based and focuses on motion. In the course of cognitive development, a Theory of Mind Mechanism emerges (ToMM). This mechanism deals with
the laws of agents, specifically their attitudes, goals, and desires. This model proposes
a hierarchical segmentation of the world into a super-class of "background," of which
90
"objects," that obey physical laws of movement and mechanics are a subclass that
can be even further be classified as "agents." The realm of "objects" is governed by
the ToBY mechanism whereas "agents" are processed using the ToMM. This seems
like a sensible organization of objects in the world as far as ToM is concerned.
By the author's account, Baron-Cohen's model of ToM breaks perceptual stimuli into two categories of interest: the first one is concerned with objects (perceived
through any modality) that appear to be self-propelled, and the second category contains all objects within the visual stimuli that appear to have eye-like shapes. In
Baron-Cohen's model of ToM there are four distinct important modules: the Intentionality Detector (ID), Eye-Direction Detector (EDD), the Shared Attention Mechanism (SAM), and the Theory of Mind Mechanism (ToMM). The processing of ToM
related cognition filters perceptual features through the ID and EDD to the SAM and
finally the ToMM modules.
The authors presented implementations of some of the lower-level cognitive or
perceptual modules required for robots if they were to possess these models of ToM,
namely a module for differentiating inanimate and animate objects as well as automatic gaze following.
3.3.2
Polyscheme and ACT-R/E
Trafton et al. analyzed multiple hours of audio recordings from astronauts' utterances
performing a training task and showed that about 25% to 31% of the utterances
involved perspective-taking (such as "on your left side" or "come straight down from
where you are" etc.) (Trafton et al., 2005). Their interpretation of the results is that
it is vital for robots to be able to take the perspective of the human to be able to
provide effective interactions.
The authors then proceeded to implement a robotic perspective-taking system
based on the Polyscheme architecture.
Their system is a symbolic reasoning and
planning system that integrates multiple representations and inference techniques to
91
produce intelligent behavior.
Trafton's lab has explored how to make robots more efficient teammates by enabling the robots to model their human teammates using what they call a "cognitively
plausible" system that mentally simulates the decision making process of the teammate (Adams et al., 2008). Their reasoning for modeling the other agent for the sake
of the quality of teamwork is that it would reduce the amount of monitoring required.
We are not sure that this reasoning holds as the simulation of the teammate can only
be as good as the inputs into its computation. We assume that for the simulation
of the teammate's decision-making strategy to be as efficient as possible, all available data should be provided to the simulation via as much monitoring as the task
allows. Therefore, we think a better reasoning for performing ToM modeling of other
agents for teamwork is to reduce the communication required rather than monitoring, as communication can be redundant if all agents can monitor and understand
the behavior of their teammates.
Their implementation of the system extended the ACT-R cognitive architecture to
handle embodied agents by incorporating a spatial reasoning module and navigation
(ACT-R/E). The researchers evaluated their work using a computer simulation of a
warehouse patrol problem with two patrollers, one robot and one human. When an
alarm is heard, the agents need to make it to two guard stations. The problem they
focused on is which agent should go to which station. They experimented with two
strategies that the robot could pursue:
1. Self-centered strategy: The agent simply moves to the closest guard station
2. Collaborative strategy: The agent attempts to predict the teammate's choice
of stations and selects the other one
A concern that immediately surfaces is that even when the robot selects the more
sophisticated strategy of predicting the teammate's choice of stations, it makes the
simplifying assumption that the human would select the less sophisticated strategy
92
of simply choosing the closest station. This might be an oversimplification of the
intricate socio-cognitive capacities of the human's decision-making process.
Finally the authors report that in simulation, the robot and simulated human
agent performed fewer steps when the robot followed the "collaborative strategy"
than when it followed the "self-centered strategy."
They claim that some actual
human-subject experiments were performed using an iRobot B21r and the results
were very similar. It would be interesting to know how the human experiments were
similar or different, as it is possible that the humans were performing a much more
sophisticated modeling of the robot's decision making than the robot was of them,
and would therefore always compensate for the robot's actions effectively rendering
the robots station-choosing irrelevant. It is even possible that in the non-simulated
case, having the robot choose the "self-centered strategy" could be more efficient as
the human is likely to be able to move much faster than the robot and therefore the
distance from the robot to its station would become the bottleneck of the system.
3.3.3
ToM Modeling Using Markov Random Fields
Butterfield, Jenkins et al. performed experiments with Markov Random Fields as a
probabilistic representation for ToM modeling (Butterfield et al., 2009).
In their setup of the MRF, each agent was represented by two vector-valued random variables: xi for the agent's internal state representing intentions, beliefs and
desires whereas; yi represents the agent's perception of the physical world, the presence or absence of certain objects etc.
The agent models the team by a network
of these variable pairs, one for each teammate, as can be seen in figure 3-3. Each
agent's state vector xi is conditioned on its perception vector yi through the local
evidence function O(xi, yi) as well as all of the other agents' state vectors through the
compatibility function
(xi, xj).
This setup was then adapted to coordinate action selection in a multi-robot sce-
93
States of other agents
X2
X3
X4
CompatibilityThe influence of agent
2's state on agent I
0
(X1, X 2)
(I, X4)
Local Evidence-
X1
State-The agent's intentions
The influence of the agent's
local perception on its state
(XI, Yi)
and knowledge of the environment
Nz'
The agent's perception
of the physical world
Figure 3-3: Shows the inter-connectivity of the MRF network. Each agent is represented with an observation vector yi and a state vector xi. From (Butterfield et al.,
2009).
nario by using a Belief Propagation algorithm where any robot's communication is
restricted to a limited set of "neighbors." The result of performing the BP algorithm
is effectively an action posterior for each agent, which they can sample to execute
actions within a joint task of the team.
The MRF framework has been evaluated and shown promise in being able to
correctly explain known patterns in ToM tasks performed by children with the appropriate local evidence and compatibility functions. The framework has well founded
support for probabilistic manipulation of data and shows promise as an action selection mechanism in a multi-robot team task but its capacity to model ToM in humans
seems limited or at least working at a very low-level of cognition.
3.3.4
Plan Recognition in Belief-Space
The purpose of plan recognition systems is to infer the goals and plans of an agent
given observations of its actions, which makes them very relevant to the presented
94
work. A harder version of the plan recognition problem is to try to infer which beliefs
an agent needs to hold so that its observed actions are likely to be a part of a rational
plan towards its estimated goal; (Pynadath & Marsella, 2004) and (Ito et al., 2007)
developed a computational model for this kind of abductive reasoning for agents
in a multi-agent domain. This model is effectively a plan-recognition system that
models the beliefs, goals and actions of other agents with a POMDP formulation (see
background section 2.2.2). They calculate optimal policies for each agent to achieve
each of its goals and then perform a maximum likelihood search for the POMDP
policy that gives the highest value for the observed sequence of actions. This gives
the observer an estimate of the belief state of the observed agent as well as what their
most likely goal might be.
This approach lacks the representational power to reason about cases where other
agents might be acting on false beliefs about the world (ie. the false-belief problem
from section 3.2.3), and it also does not model in an explicit way how an agent's
actions depend on each other or how agents could anticipate or react to each others'
behaviors. These are important features of the presented work, which differentiate it
substantially from this work.
3.3.5
Inferring Beliefs using Bayesian Plan Inversion
This section describes a class of work where researchers speculate about the internal
mechanisms that are employed by humans to predict and reason about the behavior
of other agents. Many of the experiments are conducted in such a way that human
observers are made to watch an animation of an agent travelling in a grid-like world
and are asked to make predictions about its future behavior at various time points.
The animated agent simply plays out a pre-described motion as deterministically designed by the researcher. The researcher then creates several graphical models that
employ different inference strategies and compare the prediction results of these models with those of the human observers. The relative similarities between behavioral
95
predictions made by the models and the humans provide evidence that the models
capture some aspects of the kinds of probabilistic behavioral inference that human
minds might employ.
Tauber et al. performed an experiment where they showed that people use Lineof-Sight (LoS) cues as well as Field-of-View (FoV) to assign belief states to observed
agents and then use these beliefs to reason about the agents' behaviors (Tauber &
Steyvers, 2011). Their experiment showed that the graphical model that only used
the agent's LoS and FoV created predictions that were much more similar to humans'
predictions than models that used X-Ray vision (could see through obstacles but
were bound by FoV), only proximity, or were all-knowing. This suggests that if a
mind-theoretic agent wishes to accurately predict the actions of other agents (human
or autonomous) it would benefit from reasoning about their beliefs and perspective.
Experiment 1
(a)
C
A
Judgment
Judgment
7
,...
11
A
B..-
A
A
C
Judgment
point:
C
point:
...
'7
point:
Experiment 2
(b)
Judgment
point:
..
10B1.......'
11..
1
Figure 3-4: An example visual stimuli that participants would be shown in (Baker
et al., 2009). An agent travels along the dotted line and pauses at the points marked
with a (+) sign. At that point participants are asked to rate how likely the agent
is to have each marked goal (A, B and C). In (b) the participants were asked to
retroactively guess the goal of the agent at a previous timepoint.
Baker et al.
illustrate how Bayesian Inverse Planning (BIP) can be used in
conjunction with the rationalityprinciple -the expectation that agents will act optimally/rationally towards achieving their goals- to formalize the concept of "ideal
observer" or "ideal inference agent" which are terms that are used heavily in the cog-
nitive science literature (Baker et al., 2009).
96
This approach is especially useful for
rational goal inference as demonstrated in figure 3-4. Baker et al.'s inference model
varied the richness of the goal representations -allowing agents to switch their goals
during an episode, as well as having sub-goals- by adjusting different prior probability models to goals. The researchers showed good correspondence between model
predictions and human judgments.
Baker's work was extended to make predictions about social judgments (Ullman
et al., 2010). In this scenario, two agents move around a 2D maze, which contains
a boulder that can block access through the grid and two goal-objects (a flower and
a tree).
One agent is small and the other is large. The large agent can push the
small agent and the boulder around. Participants are exposed to animated episodes
of the agents moving around in the maze and are asked to rate the goals of "flower,"
"tree," "help," or "hinder" (where "help" or "hinder" refer to whether the large agent
is trying to help or hinder the smaller agent). The predictions of the BIP model were
compared to human judgments as well as a simple movement-cue based model. The
cue-based model surprisingly generated predictions more similar to human judgments
about simple object-based goals, than the BIP model. But the cue-based model was
not able to capture the more complex social goals -helping and hindering- which the
BIP model predicted quite well.
More recently, this work was extended to represent what the authors call a "Bayesian
Theory of Mind" (Baker et al., 2011) (Jara-ettinger et al., 2012), a system of Bayesian
reasoning to infer mental states from actions. This work combines the ideas in the
previously discussed work of (Baker et al., 2009) and (Ullman et al., 2010) to create
a computational system that can generate plausible explanations for observed agent
behavior using Bayesian Inverse Planning in conjunction with belief and rationality
modeling with preferences. In this work, the authors present a model that can predict
the preferences of a college student looking for their favorite food truck. The results
show that the model can predict the agent's preferences in a compelling way.
Again, this work provides a computational model that can generate predictions
97
of agent behavior that closely resemble human judgments but does not address the
question of how to act in this domain or how an agent might manipulate the beliefs
of others. It does provide a principled way to think about and frame this problem
and shows that Bayesian inference provides a lot of flexibility to represent different
goal representations and agent preferences.
3.3.6
Game-Theoretic Recursive Reasoning
As figure 3-1 suggests, reasoning about Theory of Mind is inherently a recursive
process. Depending on the kinds of modeling one wants to perform in this domain,
recursion should be addressed to some degree. The following work discusses a way to
handle recursion gracefully and make an informed decision about how deep the model
should reach.
The Recursive Modeling Method (RMM) gained some popularity in the field of Decision Theoretic Agents in multi-agent systems (Vidal & Durfee, 1995). This method
effectively reasons about pay-off matrices, which dictate each agent's preference for
choosing an action given the action choice of all other agents. RMM arranges these
matrices in a tree-structure with a probability distribution over every set of childmatrices (see figure 3-5). The method assumes that those pay-off matrices are provided by some external entity (presumably a planning system) or are derived from
statistics of observations of actions taken previously. How to produce these matrices is actually not well explained in the paper. The authors admit that they are
pre-calculated per domain and that simplifying assumptions were used such as maintaining no history and having all mental states always be immediately derivable from
the instantaneous physical situation. The authors chose to focus on showing how to
effectively use dynamic programming methods to solve the problem of finding the best
action to take given the pay-off matrix hierarchy. Their approach is especially useful
for reasoning about when not to expend more computation on going deeper into the
recursive hierarchy of matrices and is therefore able to provide a good compromise
98
between deep social reasoning and computational cost. They call this Limited Rationality and argue that agents' reasoning capabilities can often be outstripped by the
amount of data available, which makes it necessary to meta-reason about when and
what to reason about (Durfee, 1999).
P
A
A
A -1
Q
B
2
BOO
A
B
A
Q B
A
QB
P2
P
Q
B
/\Q
1P2
A
B
A
A
B
A -i
2
AF2
B 0
0
BL910
P
7
A 1-
Q
P3
Q
B
B
A
2
2
]
k-A--I
P AF
BOO
B
iL0
Figure 3-5: An example of an expanded tree structure of pay-off matrices as perceived
by P when choosing whether to execute action A or B. With probability pi P thinks
that Q views the payoff matrix in a certain way (represented by a different matrix)
and with probability (1 - pi) in another. This recursion continues until no more
knowledge exists, in which case a real value is attributed to each action (0.5 in the
uninformed case) (Durfee, 1999).
RMM was extended to incorporate Influence Diagrams (IDs) which are able to approximate optimal policies in Simple Sequential Bayesian Games (SSBG) (SondbergJeppesen & Jensen, 2010). Framing the problem as an SSBG places it in a GameTheoretic framework and therefore should produce policies that tend to model, and
be adaptive to, the other agent's policy. An insight that the authors apply in their
implementation to approximate the solution is to both experiment with removing
arcs in their ID as well as adding them and observing the effects they have on the
99
policy. This is equivalent to assuming that the agent knows "less" or "more" about
the variables in the domain, either of which could make the computation less difficult
according to the authors' claims.
Finally it is worth mentioning the work of (Zettlemoyer et al., 2009) in this context
as they have formalized the computation of infinitely recursive belief management in
multi-agent POMDP problems. They developed an algorithm that uses finite belief
representation for infinitely nested beliefs and showed that in some cases these can be
computed in constant time per filtering step. In more complex cases, the algorithm
can prune low probability trajectories to produce approximately correct results.
3.3.7
Fluency and Shared Mental Models
Researchers have pursued many different ways of modeling "other" agents in multiagent scenarios, here we will mention some notable work where the concept of the
"other," their behavior patterns, preferences, beliefs, and goals, are represented in
different and interesting ways.
Human-robot teamwork, whether in research or industry, tends to be implemented
as a rigid turn-taking process to avoid dealing with the many issues that arise with
concurrent action execution and modeling teammates.
In the Human-Robot Inter-
action (HRI) domain, prior systems achieve some of the seamless coordination and
fluency that human teams tend to display when they have been jointly trained and
cooperate in harmony. (Hoffman & Breazeal, 2007) developed an adaptive agent system capable of adjusting to the behavior of a human partner in a simulated domain
of jointly assembling a cart from its parts. This system maintained statistics of world
transitions given the actions of the robot and employed a nafve Bayesian estimate for
the transition distribution. It would then consider the four alternative proto-action
sequences of: <pickup a tool and use it>, <return a tool and return to workbench>,
<return a tool, pick up a new tool and return to workbench> and <do nothing>,
perform an optimization of reducing the expected cost of performing any of those
100
sequences given the statistical estimates of how the world-state would be advanced
as a result.
An experiment compared this approach to a purely reactive system and demonstrated that in the best case there was a very significant difference between the objective measure of time-of-task-completion, but when averaged over all episodes there
was actually no significant effect. The experiment's post-study questionnaire showed
a significant difference in the perceived contribution of the robot to the team's fluency and success. This work demonstrates how powerful the effect a flexible and
adaptive autonomous agent can have on a teammate's subjective experience of the
agent's competencies. This work does not focus on the planning aspect of the problem
but informs how behavioral statistics could be collected and exploited to create more
adaptive and fluent robot teammates.
Abbte> Choose
acon
Figure 3-6: A view of the ABB RoboStudio Virtual Environment during task execution. The human controls the white robot (Nikolaidis & Shah, 2013).
Nikolaidis et al. took a completely different approach to a similar problem in their
work (Nikolaidis & Shah, 2013). They strived to achieve a "Shared Mental Model"
(SMM) between the human and its robot teammate by formulating the problem of
joint action selection of the team as a specialized MDP (see background section 2.2.2).
101
In this formulation, the robot's uncertainty about the human action is encoded in the
probability distributions of the transition function T of the MDP model, and similarly the human's expectation of the robot action is encoded in the reward function
R. These functions are learned through an interactive training phase between the
teammates where roles are switched to gather data to estimate both functions. When
the SMM has converged and the robot uses the optimal policy of its learned transition
and reward function, the algorithm is really selecting an action whose effects compounded with the expected response of the human (encoded in T) will lead to a good
world state as measured by the accumulated reward (by matching the expectations
of the human) of visited states. This method was shown to produce excellent results
both on objective measures of fluency such as concurrent-motion and human-idle-time
as well as on subjective measures.
3.3.8
Perspective Taking and Planning with Beliefs
"Perspective taking" has been shown to be an important cognitive ability that humans
use in various socio-cognitive as well as spatio-cognitive tasks.
This activity can
provide the perspective taker with a more intuitive understanding of the other agent's
situation and is often useful in disambiguating communication or providing support
for behavior. Perspective-taking was integrated as a core function into the cognitive
architecture Polyscheme and was used to help the robot better understand human
instruction (Trafton et al., 2005).
The idea of using perspective-taking for robots was taken one step further by
Breazeal et al. in their behavioral architecture for a robot which maintains an almost
fully replicated version of all of its reasoning mechanisms for every observed human
(Breazeal et al., 2009). These replicated behavior systems operate in parallel with that
of the robot and process the robot's sensor stream as filtered through a perspective
filter appropriate for the human's FoV and perceptual range (see Figure 3-7). This
102
Body
Motor System
Body Pos
e
e
-POW
rei
(Mindreading)
u SRecognition/Inference
d
Generation
Sensors
Belief System
r's Ballets
Perception System
"tnm"hC
Trn
fr
ation
...
y.....e.
Perception-
as perormingperspective
taigFrm(ezalta.,20)
Figure 3-7: Shows the trajectory by which data flows and decisions get made within
the "Self- as- Simulator" architecture. Demonstrates how the robot's own behavior
generation mechanisms are used to reason about observed behavior of others as well
as performing perspective taking. From (Breazeal et al., 2009).
"Self As Simulator" system implements perspective taking at a very deep level of its
behavioral architecture and was shown to improve learning sessions between a human
and a robot by disambiguating demonstrations (Berlin et al., 2006).
In his thesis work, Jesse Gray employed the behavioral architecture from (Breazeal
et al., 2009) to create a robotic "Theory of Mind" system that can realize goals which
exist both in the real world as well as in the mental states and beliefs of others (Gray
& Breazeal, 2012), effectively implementing Castelfranchi's strong social action. This
work leverages the embodied aspect of robots (as different from virtual agents) and
how it connects the physical world and its observable/ occludable features with the
hidden mental states of other agents. The system is able to come up with very highresolution motion plans for fairly short time-horizons, which execute useful actions
in the world while attempting to maintain certain beliefs in another agent's mental
state. In a demonstration of this system, the robot transports an object from one
location to another while either hiding that object from the human or making sure
103
it was observable by them, depending on study conditions. This was done to satisfy
different mental-state goals for the human while achieving the real-world goal of
transporting the object.
This research is relevant to many of the components of the presented thesis work,
but it approaches the problem from a different direction and uses very different representations and algorithms. The presented thesis is informed by the efforts made by
Gray's work but is not a direct continuation of it. In fact, the presented system will
essentially be complementary to Gray's and could be employ it to find high-resolution
solutions to sub-problems in a larger plan. The presented work will use more abstract
action- and goal-representations, which will allow it to be applied to problems with
longer time-horizons and domains. Another differentiation is that in this work we
will frame this problem using probability theory, which provides more robustness to
different action outcomes and unanticipated human behavior. This could come at a
cost of increased computational complexity and size of representation, which can be
alleviated by the use of heuristics and approximate algorithms.
Furthermore, the presented work emphasizes predicting the future actions of others and how they might influence the state of the world as a function of time. This
is different from Gray's where more consideration is give to how the future actions of
the robot will influence the mental state of agents that are relatively static in time.
3.3.9
Belief Space Planning for Sidekicks
The existing work that is most similar to the presented work in this thesis is the
dissertation of Owen Macindoe (Macindoe et al., 2012).
Macindoe presented his
system called Partially Observable Monte Carlo Cooperative Planning (POMCoP),
which is a POMDP planning system based on David Silver's and Veness' MarkovChain Monte Carlo framework POMCP (Silver & Veness, 2010). POMCoP performs
belief space planning in an fully observable world with deterministic transitions where
104
progress
per update
jwA
Time
Figure 3-8: A robot hypothesizes about how the mental state of a human observer
would get updated if it would proceed to take a certain sequence of motor actions.
This becomes a search through the space of motor actions that the robot could take
which gets terminated when a sequence is found that achieves the robot's goals as
well as the mental state goals about other agents (Gray & Breazeal, 2012).
the only unobservable variable is the agent model. Agent models are basically action
policies which are provided to the POMCoP system as input.
This work was evaluated with a simulated game where the actions of a human
player were simulated with a noisy A * path planner.
It would have been more
satisfying to see the system evaluated using real human users to see the effect of how
people perceive video game sidekicks that perform belief space planning over several
possible hypothetical player models.
105
40
C30
cc
C,)
T0
20
10250
0
(a)
100
60
75
50
40
0
(b)
_
Cts
20
-
(c)
*QMDP
MPOMCoP
POMCoP-NC
A ]POMCoP-MD
20
(d)
(e)
(b)
(a)
Figure 3-9: (a) Shows the game with a simulated human player in the upper right
corner and a POMCoP sidekick in the lower left. (b) Shows comparative results of
steps taken to achieve the goal between a QMDP planner and different configurations
of POMCoP
This work proceeds from similar motivations as the presented thesis work (in addition to framing the problem as a POMDP). After reading the work carefully and
meeting with the authors to discuss it, we have determined the main points that differentiate the two contributions. Firstly, POMCoP does not consider how the human
predictive policies are generated but expects those to be provided as inputs. The
generation of the human predictive policies is one of the core contributions of the
presented work, in particular how those predictive policies take into account their
own prediction of actions and reactions of the other. POMCoP also assumes perfect
observability of the state but takes the actions of the other agent as its observation,
whereas the presented work produces a unique observation for states that are indistinguishable from the robot's perspective. This difference is not so significant as we
are sure that POMCoP would only need slight modification to incorporate this, also
POMCoP's perception model can easily be implemented in the presented system.
Lastly and very importantly, POMCoP does not take into account how the human
agent might be acting on false beliefs about the environment but also assumes that
they can perfectly observe the true state at all times. This is another important
contribution of the presented work.
106
3.4
Earlier Approaches to Problem
Here we will give a brief synopsis of a couple of approaches that we experimented with
to implement the MTP system. This section simply serves to show how we arrived
at the final formulation which is presented in the next section 3.5.
3.4.1
Deterministic Mind Theoretic Planning
Our initial idea for implementing MTP was to create a system that used Classical
Planning (CP) methods (see section 2.2.1) to create plans from the perspective of the
other agent which could then be used as predictions of how the other agent will act
and be incorporated in a plan for the robot (also generated by CP methods). This
approach was meant to benefit from the relative speed of certain CP methods while
producing plans that incorporate anticipation of the other's actions.
This approach is described generally with pseudocode in Algorithm 4. An offthe-shelf PDDL-capable BASEPLANNER is used to solve sub-problems of the MTP
domain. This planner is used to create simple predictions of how agents will act.
Those predictions are simulated and iteratively constructed when found to be inconsistent inconsistent. Once a prediction is complete, the planning domain, along with
the prediction, is compiled into a Quasi-Dynamic domain which enforces the action
choice by the others while allowing flexible action choice by the planning agent. This
problem is finally solved with the BASEPLANNER or any other CP system which produces a plan for the robot that takes into account a prediction of the actions of others
and their effects.
This approach has been demonstrated to successfully create plans which take into
account predictions of other's actions, possibly based on false beliefs about the environment and reactions to perception. Algorithm 4 can be used to create plans for the
robot for every false belief or goal hypothesis for the human, but this implementation
still leaves many features of a mind-theoretic planner to be desired. One of which
107
Algorithm 4: Pseudocode for a simplified deterministic approach to MTP
1 DETERMINISTICMTP
2
4
foreach agent i do
Use BASEPLANNER to create initial plan hypothesis pi from (possibly
false) initial belief state of agent i to its goal;
6
Align each plan pi temporally to form P;
7
foreach timestep t in P do
9
11
12
14
16
18
20
21
23
25
27
29
31
33
Simulate executing all agents' actions at time t;
Make note of when each agent perceives value of new state literals;
if agent i's action a can't be taken then
Backtrack to t' when agent i first perceived offending state literal;
User BASEPLANNER to create re-plan p' for i from t';
Replace agent i's actions after t' in P with p';
Bring simulation back from t to t';
L
if agent i perceives error of its false belief then
User BASEPLANNER to re-plan from t using true state;
Replace agent i's actions after t' in P with p';
L
Remove actions taken by robot from P
Compile planning domain and P into Quasi-Dynamic domain QD
Use BASEPLANNER to solve QD and create plan p, for robot which takes
into account static prediction of the others' actions
return p,
108
is that it lacks robustness to randomness or stochasticity in the action choice of the
human, this next section introduces an approach that was designed to overcome this
limitation.
3.4.2
The Belief Action Graph
To increase robustness of the deterministic MTP approach presented above, we designed a system that stores the state literals which are previously fully determined,
and actions as stochastic variables in a Bayesian Network (BN) (see section 3.2.6).
The Belief Action Graph (BAG) is a BN containing stochastic variables for state
literals and action variables. Action predictions are created in the same way as in
deterministic MTP except that the BAG is used in replacement of the predictive P
structure. Each action's pre-conditions and effects on the state variables are modeled
as dependencies in the BN, and the Conditional Probability Tables (CPTs) are generated using rules which ensure that actions cannot be taken unless pre-conditions are
met. To represent agent beliefs, each state variable has a belief variable associated
with each agent, these belief variables will take the same value as the actual state variable only when the corresponding agent can perceive that part of the state. Similarly
an agent will act based on the values of its belief variables but the success of those
actions is determined by the values of the "true" corresponding state variables. An
instructional view of the BAG as well as an example of an actual BAG representation
of a navigation problem can be seen in figure 3-10.
The BAG is constructed in the following manner:
1. In the first layer of the BAG (t = 0), create a probability variable for every
state variable in so. Set the prior probabilities of the state variables according
to the confidence their assignments.
2. For each agent aj, create a goal variable G with Domain(Gj) = [1, P] where
Pi is the number of goal hypotheses for aj. Similarly create an initial belief
109
variable Is with Domain(Is) = [1,
Qj]
where
Qj
is the number of hypotheses
for ai's initial belief. Set the prior probabilities for Gi and 1i according to the
confidence in those hypotheses.
3. For each agent aj, create a belief variable for every (b9)j and add sQ as the corresponding belief variable's parent. Connect each (b?) 3 to 1i and set P((b)j 1)I)
according to what that agent's initial belief dictates.
4. Create a layer for t + 1, for each state variable s' create s'+' and add s as its
parent. Similarly, for each agent ai and variable
j,
create (b'+
1
)
and add (bt),
and s as its parents.
5. For each agent aj, and each plan candidate of that agent m E [1, M], create
a variable for action (p')m and add the agent's goal variable Gi as its parent.
Find all variables in s' and bt that correspond to the action's preconditions:
pre ((p )m) and add them to its parents. Now find all variables in s'+1 and b'+1
that correspond to the action's effects: add ((p'),) and del ((p ))
and add the
action variable to their set of parents.
6. Increment t and repeat steps 4) through 6) until there are no more actions in
the plan candidates for the agents.
Once the initial BAG has been constructed, several iterations are performed of
evaluating the network using standard BN inference mechanisms, detecting either
failed actions or perceptive events that might alter action choice, re-planning and
inserting new partial plans into the BAG.
The BAG approach can be used to improve the robustness of the deterministic
MTP approach but still lacks many fundamental capabilities that we would expect
from a mind-theoretic planning system. It also requires frequent, and fairly computationally expensive, inference operations. Even though this implementation can
110
--_
-- -.....
...
.............
...............
......
..........
......
...............
..........
......
..........
.....
Other agent
Planning agent')
s02
So1
b_1
inta
b02
E
pa01
ao1
paO2
S11
aO2
s12
4I)
E
(a)
(b)
Figure 3-10: (a) An demonstrative example of a simple BAG. (b) An example of a
BAG instantiated for an actual navigation problem.
compute plans that can anticipate actions of others and both avoid their negative effects as well as exploit the positive ones, it suffers from the fundamental limitation of
not being able to produce behavior that pro-actively seeks to learn or disambiguate
between the mental states of the other. Similarly it lacks a principled method for
probabilistically reasoning about which mental state the other truly has and therefore
what set of actions should be anticipated and how to hedge against that uncertainty.
These limitations led us to make a heavier commitment to probabilistic representations and adopt the Markov Decision Process formalisms for our state and action
representations.
111
3.5
Mind Theoretic Planning
In the previous sections, we described earlier attempts at implementing a system
to perform mind-theoretic planning. In pursuing those projects and learning more
about the challenges involved, we developed better instincts about how to approach
the problem, which lead to the formulation presented in this chapter.
The most
significant insights that were employed in this solution are that the behavior of others
can be predicted by knowing how they place value on states in the world and then to
leverage that knowledge in our own forward model of the environment for planning
purposes. This implies the need for value functions, policies, and transition functions.
The approach taken in this work is to predict the actions of others by reasoning
about possible beliefs they may have about the environment and goals they wish to
achieve. We will construct mechanisms called goal situations and mental situations
that can be used to predict which actions the other agent is likely to take in any
state. These predictions will be leveraged by the transition function of a specialized
POMDP that relies on a customized observation function to help with the mental
state inference of the other. The product of the POMDP planning process will be
an action policy for the robot that produces behavior conducive to learning about
the beliefs and goals of the other agent, attempt to correct false beliefs if useful, and
finally assist in the completion of the determined goals.
A serious effort is expended to make these calculations tractable. A heuristic based
search algorithm, B 3 RTDP (see section 2.5), was developed mainly for this purpose
and is employed by the system. Value functions at various levels of the system are
initialized with heuristics extracted from previous calculations wherever possible and
several approximation techniques offered by the B3 RTDP system are taken advantage
of (such as action convergence and belief discretization).
112
3.5.1
Definitions of Base States and Actions
In this section, the notation and formalisms that are used to specify the mechanism
of the MTP system will be defined. We begin by defining the basic building blocks
of our MTP system. These are the actual representations of the real and physical (as
opposed to mental) features and dynamics of the environment in which mind-theoretic
reasoning about others is being performed. These functions and representations are
domain-dependent and would be crafted for any environment, as is common practice
in the automatic planning literature. We define the following:
* Sb: The actual state representation for the physical problem being solved. Any
state s - Sb should contain any relevant information about the environment and
importantly about all of the agents (for example their locations, orientations
etc.)
* {A
, Ar }: Sets of deterministic actions that are available to the robot and the
human (note that different agents might have different capabilities)
" 'Tb: Since all actions in A
are deterministic, the transition function is not very
interesting. A more useful notation to have defined is the resulting state when
taking action a from state s. We will refer to actions as functions a : Sb
-+ Sb
such that a(s) = s'
"
Cb: The action cost function simply defines the base cost of expended resources
for any action. Since we are using the Stochastic Shortest Path (SSP) model of
MDPs this function needs to be strictly positive
When encoding the base state space Sb, care should be taken to keep it as small
as possible while still encoding all of the relevant features of the world required for
the task at hand. More specifically, when encoding a state representation that will
be used for mind-theoretic reasoning, it is important to encode any feature of the
agents' configurations that could be useful for mental state reasoning. For example,
113
in a task where navigation is important and the type of mental state reasoning being
performed is goal inference, then observing agent orientation might be an important
visual cue which might otherwise not need to be encoded.
The base actions in A h/r should simply encode the actual "real-world" effects of
taking that action and contain no information about anticipated behavior of other
agents etc. For example, the action MoveForward(i) should only affect the location
of agent i and change no other feature of the state, Pick Up(i,k) should only affect the
part of the state that refers to what agent i is holding etc.
These base states and actions are treated as basic building blocks by MTP system which combines them in different ways to construct specialized transition and
observation functions.
3.5.2
Types of Mental States
As was previously discussed in the background section (see Section 3.2.4), we will
focus only on two kinds of mental states, namely beliefs and desires which we will
refer to as false beliefs and goals.
Goal Mental States
The goal mental state is one that fits particularly well with the existing planning
metaphor of MDPs and POMDPs, especially the more restricted Stochastic Shortest
Path versions of those models. An agent's goal hypothesis will simply be a boolean
function over the base state space, evaluating to true when the goal is satisfied in
the state and false otherwise.
For the human we define gj(s) to be its j-th goal
hypothesis.
gj
Sb '- {true, falsc}
114
(3.1)
g (s)
3
{
true
if the agent's j-th goal is satisfied in s
f alse
otherwise
(3.2)
False Belief Mental States
The belief mental state requires a little more adaptation to the MDP metaphor.
We will restrict ourselves to only representing belief mental states that refer to the
physical state space of the world as opposed to other possibilities such as reasoning
about agent beliefs about actions and their effects, other agents and their capabilities
and so on.
We will define a false belief fk to represent two concepts: (1) mapping the true
state to the false state and (2) dictating in what true state the error of the false belief
can be perceived by the agent holding it. These two functions are defined as follows:
convertToFalse : Sb
canPerceiveFalse: Sb
convertToFalse(fk, st) =
-
H-+
Sb
{true, f alse}
where sf is the "false" version of the true
Sf
(3.3)
(3.4)
state st for the agent's k-th false belief
{
true
canPerceiveFalse(fA,St)
false
3.5.3
if the error of the agent's
k-th false belief can be perceived
(3.5)
by the agent in st
otherwise
Inputs to the Mind Theoretic Planner
As previously discussed, the MTP system attempts to predict the behavior of other
agents, based on their beliefs and goals, and plan actions that aid in both better
115
understanding which goal the other agents have as well as assist in accomplishing
them. Since the MTP system should be relatively domain-independent, it requires
the domain model as input (as discussed in Section 3.5.1).
The types of mental
states that the MTP system is concerned with are false beliefs and goals (the formal
descriptions of which are explained in the above Section 3.5.2).
The MTP system
requires as input distributions over hypotheses of both which goal the other agent
might have as well as which initial false beliefs it might hold (one of which will be the
true belief). Lastly the MTP system requires a perception function, which dictates
what features of a state any given agent can perceive. This basically provides the
robot with an understanding of how its perception (as well as that of other agents)
of the environment is limited by its sensors' ranges and other limitations. This will
allow the robot to incorporate action sequences in its plans whose goal is simply to
bring parts of the state space into the perceptual range of the robot so it can make
judgements about how to better proceed. For example, these could be features which
best distinguish between different possible mental states of the other agents.
The more significant inputs to the MTP system are the following:
* Sb,
{Ah,
" Sinit
"
Ar}, C: The base state space, actions spaces, and cost function.
C Sb: The initial base state
Pr(gl:G): A distribution over G goal function hypotheses for the other agent
" Pr(fl:F): A distribution over F false belief function hypotheses for the other
agent
" canPerceive(h/r,V,
Sb):
A boolean perception function which specifies what
features v of state sb can be perceived by either h/r.
The system also has various tuning parameters which can be specific to the particular structures that will be introduced in the subsequent sections:
116
*
cZ: Probability that should be assigned to predicting random action choice by
the other agent
* /:
Dictates how much preference should be placed on predicted actions from
higher levels in the predictive stacks of the goal situations versus the lower levels
* L: The number of predictive levels that should be used in the goal situations
* Any parameters needed for the B 3 RTDP POMDP solver and the BRTDP MDP
solver
3.5.4
Mental State as Enumeration of Goals and False Beliefs
Given our existing representations of false belief functions and goals of the human
agent, we define a mental state index to be simply the index of any combinatorial
assignment of false beliefs and goals to the other agent. As an example, in a domain
where we have two goal hypotheses go and gi and one false belief hypothesis fi (in
addition to the NULL or true belief ft) then we have the following set of mental state
indeces:
Goal
False Belief Mental State
go
ft
0
go
fi
1
91
ft
2
gi
fi
3
We are able to retrieve either the goal function or false belief from the mental
state index by simply using integer division and the modulus operator. For a given
mental state index m the goal function is picked out by gfloor(m/F) and the false
belief function is found using
f(m % F).
These function simply provide a way to go
from having the mental state index to having the actual goal and false belief that it
represents.
117
3.5.5
Action Prediction
Goal Situation
We define a goal situation for each of the other agent's goal hypotheses g3 (see Figure
3-11). The purpose of this structure is to predict the other agent's action given certain
information about its goal.
Each goal situation contains a stack of predictive models for each agent. A stack
is composed of levels, the number of which is a parameter for the MTP system. Each
level / contains a value function
7T/h
The transition functions
Ir,
a transition function
7
/h and an action policy
are constructed according to Algorithm 5. Note
7
that action predictions from the higher levels in the stack are given higher probability
than predictions from lower levels. This is achieved with a discounting factor 3 (when
/ = 1 all prediction levels get assigned the same probability, when it is close to zero
only the highest level gets assigned any value).
We define the following function to retrieve a predictive policy for other agent's
action selection at level 1 from a particular goal situation:
getPredictivcPolicy(gj,1) = h(36)
We also define the following simple function to determine whether the goal is
accomplished in a goal situation:
goal (g,
{)
true
false
If the goal encoded by gj is satisfied in sb
Otherwise
Lastly we define a function to extract the state values of the highest-level robot
MDP, this will be used later to provide heuristic values to a POMDP solver:
get RobotValue Function(gj) = V'
118
(3.8)
Algorithm 5 creates a transition distribution from each state s for every action a
such that the agent's action a's effects are achieved in every possible outcome with
full certainty. In addition to the effects of action a, several possible actions of the
other agent are predicted and their outcomes added to the transition with varying
probability weights. Some probability is assigned to the other agent taking any of
its actions randomly, but higher probability is assigned to actions from predictive
policies of lower levels than the current one.
The algorithm is initiated to compute either the robot's or the human's transition
function and it iterates over each state s and each applicable action a of the agent
in question (lines 5 and 6). Line 8 asserts that every transition from s by taking a
has zero probability initially. The deterministic successor state s' is retrieved from
the base transition function 7b in line 12. The algorithm then proceeds to iterate
over all actions of the other agent that are applicable in s', retrieve their successive
deterministic states s" and attribute some minimum probability to those outcomes
in the transition function in line 19. It then iterates through each lower level of the
goal situation, picks out the predicted action of the other agent in that level in line
24 and assigns probability to the deterministic outcome state of that action (giving
higher probability to action outcomes from higher levels) in line 28. It finally makes
sure that every row in the transition function is normalized in line 30.
Lowest Level Prediction
As was discussed earlier, each goal situation defines a predictive stack of MDP models
for each agent where the transition function at every level references the predictive
policies of all lower levels from the other agent's stack. Obviously that can only be
done for levels that have any other levels beneath them. Special care needs to be taken
for the lowest level of the predictive stacks, especially since the behavior produced by
the predictive policies of that level "seeds" the predictive stack with an over-simplified
119
Algorithm 5: Pseudocode for constructing the transition functions
7
7
h/r, at
levels I > 0, of the goal situations. Note that the subscript h/r denotes that this
works for either agent's predictive stacks but the order of h/r versus r/h marks
that if one refers to the human then the other refers to the robot and vice versa.
2 Input: a C [0, 1] Probability assigned to random action choice;
4 Input: / E]0, 1] Preference factor for predicted actions from higher levels;
5
6
8
10
12
14
foreach s E Sb do
foreach a c A h/r do
(h/r(s,
a,:) to 0;
// Apply our own base action to state
s'= a(s);
// Assign small Drobabilitv to random action choice by
Initialize all entries in
other agent
15
foreach a' E Ar/h do
Ss"
= a'(s')
17
,h/r(s,
19
;
a, s")= O
21
//
22
foreach I' E 0, l[ do
Pick predicted action of other agent from every lower
level and apply with preference for higher levels
24
a' =
26
s" = a'(s')
28
(s, a, s") - N ih/r
30
rh()
Normalize
1)
,hr(,a:-
120
Robot predictive stack
Heuristic value
Human predictive stack
Goal
Situation
c~J
a)
Predicted action
7h
Th
-J
7Fh
V~h
Th
I
1
ci)
-j
V
1
TF
C
a)
-J
T
Thr
h
0~
C
.5
Vjoirit
Figure 3-11: Shows how a goal situation is composed of stacks of predictive MDP
models for each agent. Each model contains a value function, a transition function
and a resulting policy. Each transition function takes into account predictions from
lower level policies for the actions of the other agent. Value functions are initialized
by heuristics that are extracted from the optimal state values from the level below,
this speeds up planning significantly. Since every level of the stack depends on lower
levels, special care needs to be taken for the lowest level. In the MTP system, we
have chosen to solve a joint centralized planning problem as if one central entity was
controlling both agents to optimally achieve both of their goals, since this is a good
and optimistic approximation of perfect collaborative behavior.
prediction that gets improved and made more sophisticated with every level that is
added.
Because of the intended collaborative nature of the MTP system we thought it
appropriate to make the simplifying assumption in the lowest level that every agent
should act optimally with respect to the joint set of goals for all agents and with
perfect information about which actions the other will take. This is equivalent to
121
making the simplifying assumption that there exist perfect trust, benevolence, and
communication between the agents. This is an optimistic simplification since in the
real world each agent generally acts greedily with respect to achieving its own goals.
It does not have any certain knowledge of the other's actions and often not even
information about their state and lastly communication is often limited or costly.
We implement this simplified prediction with a centralized planner that has access
to the actions of all the agents and uses them to calculate a joint value function. This
planner then defines a greedy policy for each agent that chooses greedily from the
joint value function using the restricted set of only that agent's actions.
Mental State Situation
We now define a mental state situation (MSS) to be a complete assignment of a goal
hypothesis and a false belief hypothesis (which can be the "true" false belief) from
their respective input distributions (see Figure 3-12).
This means that for an MTP
system there will be as many MSSs as there are mental states. An MSS can be queried
for what action should be expected from the other agent (from any of its predictive
levels). It will produce the action predicted by inputting the appropriate false projection of the state to the requested level's policy of the other agent's predictive stack
from the appropriate goal situation. This is performed in the following sequence:
1. The appropriate false belief is picked out using the mental state index by taking
modulus
f = f( m
% F)
2. The incoming state is transformed to the false state:
s5 = convertToFalse(f,s)
via equation 3.3
122
3. The appropriate goal is picked out using the mental state index:
g
gfloor(m/F)
4. The action policy at level I of the other agent's predictive stack in goal situation
g is selected (Equation 3.6) and its prediction from the false state is returned
r = getPredictivePolicy(g,1)
a =r(sf)
This provides us with a method for predicting actions of an agent if we know with
certainty the false belief that they hold and the goal state that they desire.
3.5.6
POMDP Problem Formulation
Up until now, we have discussed how agent actions can be predicted given that we
know their mental state (what they believe and what they desire), and this is useful
in coordinating actions with those of the other or which changes to expect in the
state as a consequence of the anticipated actions of the other. But we actually know
the mental state of others with certainty and we therefore need a way to reason
probabilistically about which mental state others hold and learn to act not only in
coordination with them but in a way that can:
1. Seek better knowledge of their mental state
2. Having identified that an agent holds a false belief, seek to help correct it if
useful
3. Take actions that are good with respect to all currently possible mental states
while avoiding taking ones that are detrimental to any of them
123
Input: True state and
mental state index
Output: Predicted
human actions
ft
L1f
Input: False state
Output: Predicted
human actions
r(i7h-
4-17
O2r
91
go
0m~
OVh
1
9)
Vioint
iointp
Figure 3-12: This figure shows an example MTP action prediction system with two
goal hypotheses and a single false belief hypothesis (in addition to the true belief),
resulting in four distinct mental situations. An action prediction from any level can
be queried by following the enumerated steps in section 3.5.5 under "Mental State
Situation".
4. Weigh the expected benefits of the above activities with the cost of taking
actions, in a principled way
Partially Observable Markov Decision Processes (POMDPs) (described in Section
2.2.2) are a natural choice for this problem as they operate on probability distributions
over possible states rather than states themselves. These probability distributions are
called beliefs and should not be confused with the type of mental state that we have
been calling false beliefs, which really refer to a misconception about the state of the
world rather than a distribution over possible states. So if we say that an agent has
some particular false belief about the world, then we are stating that agent thinks
with full certainty that the state of the world is in some false configuration. Another
124
agent might be unsure about which false belief that first agent has and therefore
maintains a belief over all possible mental states of that agent.
Because POMDPs do not assume that the true world state can be directly observed
(otherwise there would never be any uncertainty about the state and beliefs would
be unnecessary), they require that actions emit observations when taken.
These
observations can depend on the state that the action transitioned to as well as the
action itself. This is what gives POMDPs the expressiveness to produce actions that
seek information about the world. This is an important feature for a mind-theoretic
agent as it can often be advantageous to act with the specific purpose of trying to
learn the true mental state of the other agent to be able to better predict their future
actions.
Beliefs Over Augmented States
Up until now, we have been referring informally to base states and mental state
indeces. As was discussed in Section 3.5.4, a mental state index simply refers to a
unique combination of a false belief and goal of the other agent. Once the mental
state index is known with full certainty, then what false belief the other agent holds
and their goal are also fully known.
We now define a new augmented state s',
which is simply the combination of
those two types of states (base state sb and mental state index m):
sb := {sb, m}
(3.9)
This augmented mental state sm will now serve as our state representation for
the subsequent sections of this chapter. We have also defined a special mental state
index which does not correspond to any false belief or goal index but is simply used
to represent the absence of goal-directed behavior. This is useful when interacting
with agents that are people as they not always act on explicit goals but sometimes
just wander or explore. This "extra" mental state, which we will denote as s , serves
125
as a first order approximation of recognizing that behavior.
The POMDP beliefs are defined as probability distributions over these augmented
mental states. The initial belief is constructed to contain one state per mental state
index, each initialized with the same base state sin.it The initial probability of any
mental state is defined by the product of the probabilities of the false belief and
goal corresponding to the mental index from their respective input distributions. As
an example, this is what the initial belief looks like for an MTP with three goal
hypotheses and one false belief hypothesis:
so
s1n
s2s
:Pr(go) -Pr(ft)
: Pr(go) Pr(fi) : Pr(gi) Pr(ft) -
sinit
: Pr(gi) Pr(f)
S3ng
: Pr(g
2 ) - Pr(ft) -
sinit
:Pr(g
m
Sinit
2)
Pr(fi) -
.1
-7
Transition Function
The transition function for the MTP POMDP is used to model our prediction of how
the other agent will act in any mental state. We have already created and described
exactly the mechanism for such prediction in Section 3.5.5 as a mental state situation
which takes a base state and mental state index (the exact content of a mental state
s'), converts the base state through the appropriate false belief transformation, and
picks out the predicted action from any level of the appropriate goal situation. This
process is explained graphically in Figure 3-12.
The construction of TPOMDP is similar to the construction of the transition functions of goal situations (which are explained in Algorithm 5), except in this case the
state is augmented with the mental state index so we need to extract the correct mental state situation to predict the other agent's action. The construction is explained
126
in detail in Algorithm 6.
Algorithm 6: Pseudocode for constructing the transition function TPOMDP for
the MTP POMDP.
Input: a c [0, 1] Probability assigned to random action choice;
4 Input: # c]0, 1] Preference factor for predicted actions from higher levels;
5 foreach sm E S" do
6
foreach a C Ar do
Initialize all entries in TPOMDP(s", a,:) to 0;
8
// Apply our own base action to the base state
10
2
12
S' = a(s);
14
//
15
17
1D
19
21
22
24
26
27
29
31
33
35
37
Assign small probability to random action choice by the
other agent
foreach a' C A' do
s" = a'(s'/);b
a;
TPOMDP (sj7n, a, {s"' m})
// If we have a goal-oriented mental state index
if m 7 ?n then
// Use appropriate false belief conversion (eq.
s'f =convertToFalse(f(m% F), Sb
3.3)
foreach ' C [0, l[ do
// Find the appropriate goal situation given nqi and
get policy for level (eq. 3.6)
w = getPredictivePolicy
Ploor(m/F)
, 1) ;
// We pick the action given the false state
a'=7 g
-
//
And then apply the action to the true state
39
41
43
TPOMDP (s, a, {sb, rn})
+=
011
Normalize TPOMDP(Sb, a,:);
Observation Function
The observation function of the MTP system serves two purposes. Firstly, it models
the perceptual perspective of the robot by generating a unique observation for any
unique configuration of the features of a state that are currently perceivable to the
127
robot. This is visualized in Figure 3-13. Secondly, the observation function is used
to expel false belief hypotheses about the other agent once it should have perceived
the error of the false belief in the true state.
The MTP system is agnostic to which kind of perception is needed for different
kinds of problems.
It only requires as input a boolean function describing what
features of the state can be observed by a particular agent given the true state.
Even though the system accepts any such perceptual function, it is useful to think
about a domain where agents perceive the environment with a camera and perceptual
availability is limited by the field of view and line of sight. We will use this scenario
for demonstrative purposes.
Figure 3-13: Shows two states that would produce the same observation because they
are undistinquishable within the perceptual field of the robot (which is limited by
field of view and line of sight in this domain). If the other agent would move slightly
into the white space in the state on the right, then the observation function would
produce a different observation.
We chose to make the MTP system use a deterministic observation function. This
design decision was made to increase the tractability of the problem. Given how we
use the observation function, the space of possible observations will be fairly large for
any kind of interesting domain. For example, if sensor noise was also being modeled by
the function, then that would require significantly more computational effort. Since
128
dealing with sensor noise is not the main goal of this work but reasoning about other's
mental states is, we decided to spend our computational effort on the more relevant
parts of the problem.
Lastly we have chosen to have the observation function only depend on the resulting state from taking an action but be independent of both the originating state and
the action itself. This reduces the observation function to a mapping function from
state to a positive natural observation number where states that are indistinguishable within the perceptual range of the robot get assigned the same number and not
otherwise. This is visualized in Figure 3-13.
The second role of the observation function is to expel false belief states from the
POMDP belief when the other agent is able to perceive the error of the false belief.
We encode the observation function to emit a special unique observation when the
mental state is such that the mental state index indicates that the other agent holds
a false belief, and the error of that false belief can be perceived by it in the true base
state. This is formalized as follows:
Vi C- [1, F] ((O(sm) = of) A (m % F = i) A canPerceiveFalse(fi,st))
(3.10)
In actuality, these special false belief observations of will never be emitted, which
forces the Bayesian belief update (see equation 2.3) to assign zero probability to any
state in a belief that should have emitted this observation (namely the false belief
states).
To demonstrate this, we will use an example scenario where the robot and a human
are in a room and the robot knows that the human wants to exit the room as well
as the location of the exit. The robot is uncertain whether the human knows the
location of the exit so it generates two false belief hypotheses (one being the true
belief) with equal probability (see Figure 3-14).
In this demonstrative scenario the human might take several actions to move
129
P=0.5
P=0.5
00
IN]
0
0C)
Figure 3-14: Shows an example scenario where the robot knows that the human's
goal is to exit the room, and the robot also knows the location of the exit. The
robot is uncertain about whether the human knows where the location of the exit
and therefore creates two false belief hypotheses, one representing the true state and
another representing a false alternative.
towards the exit until it is facing north and only one turn action away from learning
the truth about the location of the exit. At this point the robot holds a belief with
two mental states, one that predicts that the human will turn left because it has the
false belief and another that predicts a right turn (see Figure 3-15). If the human
chooses to turn left, the belief update will expect the special false belief observation
if the false belief mental state should be true, but a regular perceptual observation if
the non-false belief state should be true. Obviously the false belief state is false and
therefore the perceptual observation will be emitted and the resulting belief will only
contain the true mental state since it is the only one that could have produced that
observation. This will have accomplished what we wanted which is to model how the
130
4
(a) The false state
(b) The true state
Figure 3-15: Shows the stage where the human is one action away from learning what
the true state of the world is
human can come to perceive the error of its false beliefs.
3.5.7
Putting it All Together
In the previous sections, we have presented a way to frame the problem of MindTheoretic Planning as a Partially Observable Markov Decision Process. A method
has been described to do action prediction using probabilistic reasoning about false
beliefs and goals of the other agent, that creates a layered structure of predictive
MDP models. We also presented a formulation of an observation function, which can
predict how agents can perceive their environment and also the error of their false
beliefs.
A few pieces are still needed to calculate the robot's POMDP action policy, which
will be discussed in the subsequent sections.
Robot's Goal Function
The MTP system is designed to support collaborative human-robot teaming. Therefore it is central to its design that it should generate helpful and collaborative be-
131
4
4
(a) The false state
(b) The true state
Figure 3-16: When the human agent has turned left, it will expect to see either the
exit or the wall depending on its mental state. In the false belief state where it
expects to see the exit, a special observation of is also expected since in this state the
agent should be able to perceive the error of its false belief. Since this observation will
actually never be emitted by the MTP system, the belief update will attribute zero
probability to any state in the subsequent POMDP belief where that observation was
expected.
havior. To achieve this, we designed a goal function for the robot which stipulates
that the other agent should accomplish their goal, whatever it may be, within a given
probability threshold e. The inherent challenge in this encoding is clearly that the
robot is initially uncertain about which goal the other agent has. But this is exactly
the core challenge which the MTP system is designed to solve, namely to generate
behavior that seeks to learn which goals and false beliefs the other agent holds and
then act to assist them in achieving those goals.
We define a formal goal function for the robot that uses a notational convenience
function goalSatisfied (which, in turn, relies on the goal function from Equation
3.7):
.f
goalSatisfied(sb)
1
when goal(gfloor(m/F), Sb)
0 otherwise
goalpoMDP(b)
s"cb
b(sb) - goalSatisfied(s
132
(3.11)
1 /
This goal function simply sums up the belief probabilities of the mental states
whose goals are satisfied. If that sum is higher than 1 - c then we say that our goal
is satisfied in this belief.
Input: Beliefs
Output: Robot action
~--~-- ---- --- ---
Mind Theoretic POMDP
9MDP
Input: Augmented
mental state
Output: Predicted
human actions
- - -- - --- - ---- -Input: False state
Output: Predicted
human actions
ft
Vr--j"
vr2J
f Fr)
U 2jD
.Dv9
go
~00
-
-
7r
h
I
_02)S9yI
I
Vojoin
Figure 3-17: Shows the complete MTP system on an example problem with two goal
hypotheses and one false belief hypothesis. On top sits a POMDP with an observation function that produces perceptually limited observations with the addition of
specialized false belief observations when appropriate. The POMDP transition function is deterministic in the action effects of the robot but uses the lower level mental
state situations to predict which actions the other agent is likely to take and models
the effects of those stochastically. The figure also shows how the value functions at
lower levels serve as initialization heuristics for higher-level value functions. The value
function of the highest level of the robot's predictive stack is used as an initialization
to the QMDP heuristic for the POMDP value function.
133
Solving the POMDP using a QMDP Heuristic
Now that we have defined the goal function, the transition function, and observation
function, we have all that is needed to go ahead and solve the POMDP. We use our
B 3 RTDP algorithm which was described in Section 2.5. B3 RTDP is a heuristic search
based algorithm, and its performance can be greatly improved if the heuristic that it
is provided with is good. It is common practice to use what is called a QMDP heuristic
for POMDPs (see Section 2.5.2). This heuristic solves the underlying MDP problem
of the POMDP by ignoring the observation function. This is equivalent to assuming
that the state of the problem will be fully observed upon taking the first action.
As can be seen in Figure 3-17, we use a QMDP belief heuristic for our MTP
POMDP. Furthermore we use a Bounded-RTDP (BRTDP) (see Section 2.4.2) MDP
solver to solve for our QMDP.
BRTDP is also a heuristic based search algorithm
so it also benefits greatly from a good heuristic. Incidentally since we have computed predictive MDP value functions for both agents, originally for the purpose of
predicting the other agent's actions, we now have great heuristic values to provide
the BRTDP solver for our QMDP belief heuristic. We therefore define the following
heuristic function:
h(s') = getRobotValueFunction(floor(m/F))(Sb)
(3-12)
we use it to initialize the QMDP problem which, in turn, provides a belief heuristic
for the final POMDP planning.
3.5.8
Demonstrative Examples
In this section, we will demonstrate with graphical examples the resulting behavior
and inferences of the MTP system.
134
False Belief Uncertainty
We will begin by examining a navigational domain where the robot and a human
navigate a constrained environment to arrive at their respective goal locations. The
actions available to them are: TurnNorth, TurnSouth, TurnEast, TurnWest, Move
and Wait. Figure 3-18 shows the true environment configuration and the agent goals.
In this example, the robot is certain about the human's goal but is unsure whether
the human is aware of the obstacle that is immediately east of that goal.
Figure 3-18: Shows the configuration of the environment of this example. Gray areas
represent obstacles.
For demonstrative purposes, we apply two kinds of color overlay functions. Firstly,
we will slightly "gray out" the grids that cannot be perceived by the robot at any given
time to illustrate the parts of the world that can be observed. Secondly, we will color
grids in tones of green depending on the robot's certainty that the human occupies
that grid. We exaggerate the level of color to better visualize areas of low probability
with the following non-linear function: green = 1
-
e5Probability.
0.8
0.6
robot0.4
0.2
True False No goal
belief belief
(a)
(b)
Figure 3-19: Simulation at t = 0. The robot can perceive the human but is initially
uncertain of their mental state.
135
0.8
0.6
0.4
True False No goal
belief belief
(a)
(b)
Figure 3-20: Simulation at t = 11. The robot has moved out of the human's way
but did not see if they moved east or west. Robot maintains both hypotheses with
slightly higher probability of the false belief since the human did not immediately
turn east at t 1.
1
f
root0.8
0.6
-
0.4
0.2
0 L
-
-
...
.
True False No goal
belief belief
(a)
(b)
Figure 3-21: Simulation at t = 20. The robot now expects that if the human originally
held the false belief that they would have perceived its error by now and is confident
that they currently hold the true belief. The robot expects that if the human originally
held the false belief then it should pass by the robot's visual field in the next few
time steps. Notice how the robot has been waiting for that perception to happen or
not happen (indicating that the human held the true belief the whole time) before it
proceeds to move to its goal.
136
1
0.8
0.6
0.4
0.2
0
True False No goal
belief belief
(a)
(b)
Figure 3-22: Simulation at t = 28. Finally once the robot has seen the human pass
by it proceeds to follow it and subsequently both agents accomplish their goals. Even
if the robot had not seen the human pass by, eventually once it would become sure
enough, it would proceed in exactly the same manner thinking that the human had
originally held the true belief.
Goal Uncertainty
We now explore a different navigational domain. In this scenario, the robot is not
certain which of eight goals the human might have. The robot's goal is simply to
learn the human's goal and provide assistance (which in this domain mostly consists
of getting out of the way).
137
(b)
(a)
1
1
0.8
0.6
0.4
0.2
0
0.8
0.6
0.4
0.2
0
0-00-0------
00000000
bb0----00
0O
00
z
0
0
0W
z
(d)
(c)
Figure 3-23: (a) and (c) refer to the simulation at t = 0, (b) and (d) refer to the
simulation at t = 1. We can see that initially the robot is completely uncertain
about the mental state of the human but after seeing that the human took no action,
it assumes that goals 5 and 6 are most likely (the ones that the robot is currently
blocking access to).
138
(a)
(b)
1
1
0.8
0.6
0.4
0.2
0
0.8
0.6
0.4
0.2
0
o
r-i rJ
m
i
0
00
0
0O0
Ln 1.0
r-.
(DWWD(.D0
(
-U
0 0
00
z
-Fa U (U
0
-F 0
0
00
z
(d)
(c)
Figure 3-24: (a) and (c) refer to the simulation at t = 13, (b) and (d) refer to the
simulation at t = 18. Once the robot has retreated to allow the human to pursue the
two goal hypotheses that are most likely, it chooses to make one goal accessible. If
the human does not pursue that goal given the opportunity, the robot assumes that
the other one is more likely and creates passageway for it to pursue.
139
(a)
(b)
1
0.8
0.6
0.4
1
0.8
0.6
0.4
0.2
0.2
0
0
0
-
00
0 000
om4 r
DP--U0r-4
4Cno
00
0t
0 00
,
D r- -F
0 00
z
z
(d)
(c)
Figure 3-25: (a) and (c) refer to the simulation at t = 28, (b) and (d) refer to the
simulation at t = 36. If the human moves away while the robot cannot perceive
it, the robot uses its different goal hypotheses to predict the most likely location of
the human. The robot then proceeds to find human in the most likely locations.
In this case, its first guess was correct and by using our Mind-Theoretic Reasoning
techniques, it was able to find the human immediately.
140
Chapter 4
Evaluation of Mind Theoretic
Reasoning
141
4.1
Simulator
We decided to develop a first-person perspective 3D simulator, both for supporting
development of the MTP system as well as to provide an environment in which we
can run user studies to both evaluate the system and learn about how it can better
support human-robot teamwork.
4.1.1
Different Simulators
When pursuing research in robotics, it can often be very useful to have access to good
simulators to speed up the development and testing cycles since actual robot hardware
can be cumbersome and error-prone to run. We have used existing robot simulators,
and developed our own, for the various tasks that have been under development in
the past years in our research group.
Figure 4-1: A snapshot from the USARSim simulator which was used to simulate
urban search and rescue problems. On the right, we can see a probabilistic Carmen
occupancy map created by using only simulated laser scans and simulated odometry
from the robot.
The first robot simulator we evaluated was USA RSim, a robotic simulator built on
top of the Unreal TournamentTM game engine. This simulator is widely used within
the robotics community and proved useful to test and share code and projects across
collaborations with other researchers on previous projects. We ended up choosing not
to use USARSim since it is known to be a bit heavy to run, often buggy, and difficult
142
to customize.
When we moved away from using USARSim, we chose to develop a robot simulator
that could be tightly integrated with our existing JavaTM codebase. We decided to
build a simulator using the Java Monkey EngineT M which is an open source game
engine. This framework was sufficient for developing our simulations and Figure 42 (a) shows a team of MDS robots performing joint navigation to build a tower of
blocks. We ended up having to stop using this environment because of its lack of
documentation and rather small user base (leading to sparse forum posts and general
lack of support).
(a)
(b)
Figure 4-2: Screenshots from the (a) Java Monkey EngineT M and (b) Unity3D sim-
ulators that were developed to evaluate our robot systems.
At this point, we decided move our development onto a better supported environment and chose a world leading commercial platform called Unity3DTM. Figure 4-2
(b) shows the first simulator we developed in this environment. In this figure, the
robot is actually controlled by an early version of the deterministic MTP system. In
this scenario, the robot is reasoning about a possible false belief the human construction worker might have (whether or not he knows about the fire blocking one of the
exits).
Finally, we created another version of the Unity3DTM game with much more
emphasis on making it a good platform to run user studies (see Figure 4-3). With
143
that in mind, we created a simulator, or video game, that would compile to run either
in desktop mode or in a web browser. In this game, the user controls a graphical
human character and operates in a grid-world environment with a robot. The robot
is controlled from a different terminal of the game and can either be puppeteered
by an AI system or controlled by another user. A trick that we employed so that
we could have two people play each other while both thinking the other one was an
autonomous robot, was to make the controlled character from either terminal always
render as the human model and the "other" character render as the robot.
Figure 4-3: Shows the video game from the perspective of the human user. The
character can navigate to adjacent grids, interact with objects on tables, and push
boxes around. The view of the world is always limited to what the character can
currently perceive, so the user needs to rotate the character and move it around to
perceive more of the world.
144
4.1.2
On-line Video Game for User Studies
The user can control the character either by using the keyboard or by clicking with
the mouse within the game window. The actions available to them are to rotate the
character left or right, move forward, pick up or put down items in front of character,
and apply tools to the engine base if applicable. At any given time, the user can only
see the features in the environment that are visible to their character. As the user
moves the character around, different features of the environment "fade in" as they
become perceptible.
The main goal-oriented task that has been implemented in this simulator, other
than simple navigation tasks, is an engine assembly task. For this purpose, we have
included several graphical assets for items that are relevant to assembling an engine.
Figure 4-4 shows the different parts and tools that can be interacted with.
145
(a) Enaine base
(b) Engine block
(c) Air filter
(d) Screwdriver
(e) Socket wrench
(f) Wrench
Figure 4-4: These are the objects in the world that can be picked up and applied to
the engine base. To fully assemble an engine, two engine blocks and an air filter need
to be placed on the engine base. After placing each item, a tool needs to be applied
before the next item can be placed. The type of tool needed is visualized with a small
hovering tool-tip over the engine.
The engine assembly task has a very sequential structure, in that a very specific
order of bringing items and tools to the engine base is required.
First an engine
block needs to be placed on the base and a socket wrench should be applied. The
socket wrench needs to be returned and another engine block placed on the base but
this time a regular wrench is required. Finally an air filter should be placed on the
base and a screwdriver is required to finish the assembly. This assembly process can
be seen in Figure 4-5 which was used to help study participants understand how to
achieve the goal.
To recruit many people to participate in our studies, we designed the game to run
in an internet browser so that study participants can play the game from wherever
they are without the hassle of coming into the lab to participate. The reason for this
146
U..,..
Figure 4-5: This figure demonstrates the sequence of items and tools that need to be
brought to and applied to an engine base to successfully assemble it.
design was so that we could easily recruit many people to participate in our studies.
We developed a framework which can handle many concurrent users playing the
game while simultaneously gathering behavioral data from the game as well as data
from a post-game questionnaire (see Figure 4-6).
A central web server manages
the connectivity between the different components of the system and serves up all
web pages that are needed (main sign-up page, game instructions, user specific task
links, etc).
But it also maintains connectivity between the user's browser and the
game servers during game-play to detect error conditions, whether or not the user is
actually playing the game, etc. The web server can either sign users up using their
email (and send confirmation email to verify identity) or it can accept game requests
from Amazon Mechanical TurkTM (in which case the user identity is obtained by using
their MTurk workerId). A commercial multiplayer cloud service called PhotonTM is
used to synchronize all game activity between the game servers and the game which
is running in the user's browser.
147
Photon Cloud Multiplayer network
Local network
photon
Web server
RELTM
http://prg-robot-study.media.mit.edu
User in a web browser
Game
controller
(_Gam~e
controller
Game
_0 log
OF
Gam
controller
GameGame
0g
og
Survey Monkey
Game servers
Amazon Mechanical Turk
Figure 4-6: This figure demonstrates the connectedness of different components to
create the on-line study environment used in the evaluation. A user signs up for
the study either by going to the sign-up website (possibly because they received a
recruitment email or saw an advertisement) or because they are an Amazon MTurk
user and accepted to play. The web server assigns the user to a study condition and
finds a game server that is not currently busy and assigns it to the user. The game
server initializes the Unity game and puppeteers the robot character according to the
condition of the study assigned to the user. The game state, character actions, and
environment are synchronized between the game server and the user's browser using
a free cloud multiplayer service called PhotonTM. Study data is comprised of both
the behavioral data in the game logs as well as the post-game questionnaire data
provided by the Survey Monkey service.
148
4.2
Human Subject Study
Human subject study evaluations of robotic systems can present many challenges.
Research robots require a lot of maintenance and can have frequent malfunctions.
Their sensory systems are imperfect and navigation and manipulation are usually
slow and error-prone.
These factors make it difficult to construct a study where
certain details about the robot's behavior are being manipulated while other variables
are held fixed (confounding factors in HRI are discussed further in (Steinfeld et al.,
2006)). Studies involving complicated research robots that are difficult to operate also
tend to make it difficult to gather sufficient data to confidently support or contradict
the research claims.
Simulations and virtual environments can often be used to provide an alternative
to physical experiments with robots. Computer simulations offer the benefit of repeatability and consistency across participant experiences. They are not constrained
by the limiting resource of physical robotic hardware, allowing many participants
to interact with the same simulated robot at the same time. This, combined with
making it easier for participants to take part in robot studies (allowing them to participate from their homes rather than having to travel to a location with a robot)
can allow the researcher to gather much more data than otherwise possible. One limitation of using simulations to evaluate systems for human-robot interaction is that
users generally do not get as engaged in the interaction with the simulated robot as
they would with a physical robot (Kidd & Breazeal, 2004).
This can significantly
impact people's perception of the systems being evaluated. These limitations should
be weighed against the benefits of using simulations for evaluating HRI systems on a
case-by-case basis.
We chose to evaluate the presented work using the on-line simulator presented
in Section 4.1.
Our reasoning for using a simulation to evaluate this work is to
demonstrate the benefit of mind-theoretic reasoning, the complexity of the task needs
to be significant, this can be difficult to accomplish with an actual robotic system
149
while gathering the required number of data points. The choice of using a simulator
is likely to impact our ability to accurately measure people's subjective impressions
of interacting with the robot.
4.2.1
Hypotheses
Our research hypotheses are categorized into two groups: attitudinal,relating to the
human's perceived traits of their autonomous teammate, and performance, relating
to the objective measures of performance and improvements in task metrics.
Attitudinal Hypotheses
We posit that the following hypotheses hold when people cooperate with a mindtheoretic agent rather than an alternate kind of autonomous agent:
H1 They perceive the MT agent to be:
(a) more competent at the task
(b) more helpful
(c) more team-minded
(d) more intelligent
H2 They attribute the MT agent with:
(a) a higher degree of Theory of Mind
(b) more human-like features
H3 Their experience with the MT agent is perceived to be:
(a) less mentally loading
(b) more enjoyable
150
Performance Hypotheses
Similarly, we posit that the following hypotheses hold when people cooperate with a
mind-theoretic agent rather than an alternate kind of autonomous agent:
H4 Team fluency is improved:
(a) by reducing the mean time between human actions
(b) by decreasing the rate of change (within task) of the time between
human actions
(c) by reducing the functional delay ratio (ratio of total human wait time
between robot taking action and human taking action and total task duration)
(d) by decreasing the rate of change (within task) of the wait times in (c)
H5 Task efficiency is improved:
(a) by reducing the total time of task completion
(b) by reducing the number of actions taken by the human and the robot
4.2.2
Experimental Design
Task
For the MTP system to have any substantial benefit to a mixed team of humans
and autonomous agents, the task should require tightly coordinated operation of the
team-members.
This means that through some environmental constraints, agents'
actions will affect features of the world state that are important to others. We created
an engine assembly task that includes navigation within a constrained space, which
requires coordination, as well as access to shared resources.
In this scenario, two
agents navigate a grid-world where some grids are not navigable because they contain
walls or tables. The tables can contain either an engine base on which an engine can
151
be built, one of two engine items: engine block or air filter, or one of three tools
required: regular wrench, socket wrench or a screwdriver. The items and tools can
be picked up and applied to the engine and if done in a particular sequence then the
engine will be fully assembled (see Figures 4-4 and 4-5).
The study had three different configurations of this task with various rounds of
each configuration where item and tool placements were randomized:
1. One engine base and only the human agent can take actions. The robot is
immobilized. There was only one round of this level, and it was used to acclimate
users to the environment. Data from this level was not used in the analysis.
2. One engine base and both agents active.
This level had three rounds with
randomized item and tool placements. This environment was very constrained
and required navigational coordination to complete successfully.
3. Two engine bases and both agents active. This level had four rounds with
randomized item and tool placements. In two of the rounds, the participant
was told that more "reward" was given for assembling one engine over the other.
In the remaining rounds those instructions were switched. Human and robot
agent initial locations were switched between rounds.
Experimental Conditions
The experiment followed a between-participant design. We wanted to use this study
to better understand the MTP system itself as well as to see how it compares to
an alternative autonomous system and a human operator. To accomplish this, we
created four different experimental conditions:
C1 MTP-POMDP: Agent is controlled by the full MTP POMDP stack (see
Figure 3-17)
C2 MTP-QMDP: Agent is controlled by an action policy generated from the
QMDP
value function (used as a POMDP heuristic Figure 3-17)
152
C3 Handcoded+A*: Agent has access to a fully observable world state and
is controlled by a hand-crafted optimal rule-based policy that uses an A* search
algorithm for navigation
C4 Human: Agent is controlled by an expert human confederate that has been
instructed to perform task as efficiently and helpfully as possible
C1 and C2 both take advantage of the predictive abilities of the MTP system.
Both of them use the observation model of the POMDP problem formulation to
perform a Bayesian belief update. The difference between them is that the C2 agent
operates under the inherent assumption of the QMDP heuristic, which is to assume
that after taking the first action, all subsequent states will be directly observable.
This generally leads to behavior that is overly "confident" in its mental state estimate
of the human and will never choose to take any actions purely for the purpose of
information gain, or "sensing actions."
C3 follows a hand-coded policy that ensures the following while using an A * path
planner for navigation:
1. If holding object that is not currently needed for engine, return it to nearest
available location
2. If holding an object that is currently needed for engine, navigate to engine and
apply it
3. If not holding any object then navigate towards nearest item that is currently
needed and pick it up
4. If item needed for engine is unavailable, navigate to "safe location"
5. If there are more than one engine, do not take any action until first item is
applied to either engine, then designate that engine as the target engine to
build
153
This policy is given the (admittedly unfair) advantage of perfect observability of the
world state. Item 4 in the hand-coded policy was added after pilot testing to prevent
the robot from creating a stand-off situation if it is blocking the access to engine
when the human is holding the next tool to be applied but cannot apply it. C3 was
developed to be a near-optimal (optimal when human stays out of its way) policy
that is strictly task-oriented. This policy benefits greatly from two advantages: (1)
perfect observability and (2) not ever building wrong engine.
C4 is a special case where the simulated robot assistant is actually operated by
an expert human confederate. In this condition, the human confederate operator is
instructed to simply accomplish the task with the participant as fast and efficiently
as possible while being helpful to the study participant.
Procedure
All pages and information presented to the participants is documented in appendix
section B.2. Once a participant signed up for the study by providing an email to
the study website, they would be randomly assigned to any of the four conditions. If
they were assigned to C4 they would receive a confirmation email notifying them that
they would soon receive another email to schedule their participation. If they were
assigned to conditions one through three they would receive an email asking them to
(1) read game instructions, (2) play the simplest level to practice, (3) play all three
rounds of task one and all four rounds of task two and (4) fill out the post-study
questionnaire. Once enough participants had been assigned to condition C4, they
would be sent a scheduling email where they could sign up for 20 minute time slots.
Once signed up, they would receive an email with the same instructions as the one
above except that item (3) had to be completed within the scheduled slot.
154
Participants
Out of approximately 360 participants that signed up, only 86 (57 male) completed
the post-task questionnaire and passed our exlusion criteria (the most relevant being
how many tasks they completed). The mean age of our participants was 25.4 (o- = 6).
Participants were randomly assigned to conditions but the distribution of users into
conditions after exclusion criteria was the following: C1=18, C2=23, C3=35, C4=10.
The reason for the uneven distribution mostly stems from the fact that more people
experienced technical difficulties in conditions CI and C2 and were therefore excluded,
and in C4 many of the participants that had signed up did not respond to later
scheduling emails.
4.2.3
Metrics
Attitudinal Measures
A post-study questionnaire was used to acquire the attitudinal measures for this
study. Every effort was made to use known and validated surveys and metrics, with
only minor adaptations to fit this particular scenario. Because of some of the unique
aspects of this study, we also created a few adhoc questions we felt were relevant.
All questions used a seven point Likert scale with the exception of a few free text
responses.
The full questionnaire used can be found in the Appendix B.1 of this
thesis.
To measure the participant's perception of the robot's competence, we used the
qualification factor from Berlo's Source Credibility Scale (Berlo et al., 1969).
For
evaluating the participant's perception of the robot's team-mindedness, we used the
goal sub-scale of the Working Alliance for Human-Robot Teams (Hoffman, 2014).
We also used selected questions from Hackman's Team Diagnostic Survey (Wageman
et al., 2005). The robot's perceived intelligence was measured using the Intelligence
sub-scale of the Godspeed metrics (Bartneck et al., 2009).
155
We used a few selected
questions from a study of perception of robot's personality to evaluate the perception
of the robot's social presence (Lee et al., 2006). We were interested in the degree to
which the participants attributed a Theory of Mind to the robot so we selected and
adapted some relevant questions from the Theory of Mind Index (Hutchins et al.,
2012). For measuring attribution of human traits to the robot, we used the Anthropomorphism and Animacy sub-scales of the Godspeed questionnaire (Bartneck et al.,
2009), enjoyment and likeability were measured using Likeability from Godspeed and
the Bond sub-scale from the Working Alliance for Human-Robot Teams (Hoffman,
2014). Lastly we measured task load using the standard Nasa Task Load Index (this
was the first section in the questionnaire so people would answer it as soon after doing
the task as possible) (Hart & Staveland, 1988).
Behavioral Measures
All behavioral metrics were extracted from game log files that are recorded during
game play. We used two kinds of measures for the overall efficiency of the task: Time
to task completion, which is the time between when the first action taken by either
agent and the time when the goal is achieved, and total action count, which is the
total number of actions taken by both of the agents during the task time.
We used several behavioral metrics to measure fluency. We were interested in
both the action interval time of the participant, which is defined to be simply the
average time between actions, as well as the action interval rate of change, which
is the slope of a linear regression fit to the action intervals within each task. The
slope dictates how much change there was in this measure across the task duration
and could indicate a level of adjustment or learning. We were also interested in the
functional delay measurement, which Hoffman et al. defines to be the ratio of the
accumulated wait times between the robot finishing its action and the participant
taking their action, over the total task time (Hoffman, 2014).
In our task, it was
more appropriate to measure the wait time between when the robot starts its action
156
and when the participant starts theirs, because the participant could actually start
theirs before the robot's action had finished, and it was generally very obvious which
action the robot was taking once it had started.
Similarly to the action interval
measure, we were also interested in looking at the rate of change of this quantity
within a session.
4.2.4
Exclusion Criteria
In this study, we used the following criteria for completely excluding all data regarding
a given participant:
1. If a participant completed any fewer than six out of the seven total rounds of
tasks (allowing them to forget to complete one)
2. If participants answered with a five or higher on a seven point scale for the
question: "Did you experience any errors or technical troubles while playing the
game?"
3. If people did not select the only correct option out of seven possibilities for the
question: "Please select a tool that you used in the tasks"
4. If their task completion time on any of the rounds exceeded 150 seconds (average
task completion time was about 60 seconds with fairly low variation).
5. If they ever took more than 30 seconds to take the next action (average action
interval was roughly 1.2 seconds with low variation)
We applied the following criteria to exclude a particular feature from the data
from a given participant:
1. In both the action interval metric and the functional delay metric, we excluded
any single interval if it exceeded 15 seconds.
157
2. In both of the within-task temporal metrics, we omitted the slope of the linear
regression fit if the data points were fewer than 5 or if the absolute value of the
slope was higher than 150 (averages were in the range of [-5, -20])
158
4.3
Study Results
4.3.1
Statistical Analysis Methods
To evaluate whether data from participants in different experimental conditions was
actually statistically significant, we used pair wise one-way ANOVA tests and looked
both at the produced p value and computed the effect size q2.
Each pair of con-
ditions was evaluated using a one-way ANOVA, resulting in six separate comparisons. To correct for the effect of multiple statistical tests (the general effect of more
tests increasing chances of rare occurrences), we performed the Bonferroni correction
(Hochberg, 1988), which is a very conservative correction method. By this correction, if an ANOVA p value would otherwise be considered to indicate a significant
difference in the means if it was lower than 0.05 then it would now need to be lower
than 0.05/6 ~ 0.008 to indicate that same difference. We therefore use the following
thresholds in all graphs and discussion in this session:
Weak significant difference: p < 0.02 indicated with *
Significant difference: p < 0.008 indicated with **
Strong significant difference: p < 0.0002 indicated with *
All tables report both the mean and standard error (SE) of the measured quantities for all conditions. All graphs plot the means of the measured quantities for
all conditions.
Graph error bars represent a 95% confidence interval, which have
been suggested to prompt better interpretation of statistical significance than nullhypothesis testing (Cumming, 2013). When discussing significance of differences we
will also provide the Effect Size as the classical
2
T1
metric (Brown, 2008) defined as
follows:
2
SSef fect
SStotai
159
__
Ssbetween
SStotal
SSeffect represents the sum of squared errors between the independent variable means
and the total mean (we use SSetween from the ANOVA results table). SSttai represents the total sum of squared errors between all data points and the total mean.
4.3.2
Behavioral Data
Time to task completion
The duration of a task or time to task completion is an important metric for efficiency.
This time is measured from when the first action is taken by either agent, and the
time when the goal is achieved. In this metric, a lower score is better.
Task completion time
x 104
9
- -.
. -.
..
8
..........
7
**
6
0
0 5
a)
E 4
3
2
1
0'
Figure 4-7:
Task 1
Task2
Shows the mean task completion times of all rounds of each task
p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval)
160
(*
C2
C1
C3
C4
p
r1/2
p
9.6e-04
0.26
0.580
6e-03
0.350
0.04
9.6e-05
0.26
0.033
0.15
0.433
0.02
C2
p
C3
Table 4.1: Task 1 task completion time. ANOVA p value and effect sizes r/2 for all
pairwise comparisons of conditions.
C2
C3
p
C1
0.271
C4
p
0.04
C2
p
0.289
0.02
0.004
0.30
0.015
0.11
0.046
0.16
2.8e-05
0.34
C3
Table 4.2: Task 2 task completion time. ANOVA p value and effect sizes r/ 2 for all
pairwise comparisons of conditions.
Figure 4-7 demonstrates the mean task completion times (over all rounds) for
both tasks. We can see that in the first task, C2 MTP-QMDP dominates all of the
other conditions with the lowest number of actions (significance not detected for C4
Human ). In the second task C4 Human is lowest.
Round 1
Round 2
Round 3
P
SE
PI
SE
C1
89352
5041
75942
4638
62516
3834
C2
70559
3382
66740
3421
53959
1637
C3
79991
2947
75378
2536
68503
1447
C4
89180
6845
69862
5465
55911
1917
P1
SE
Table 4.3: Task 1 completion time in milliseconds
161
Tables 4.3 and 4.4 show the mean completion times and standard errors from each
round of each task. This data underlies the means reported in Figure 4-7.
Round 1
Round 3
Round 2
Round 4
pt
SE
1
SE
p
SE
p
SE
C1
50057
2675
68706
5445
49835
2943
51347
5323
C2
54644
2692
50548
1675
59259
3849
46225
1624
C3
6 3919
1627
53434
1580
50186
1315
48415
1490
C4
5 2323
3813
47442
2085
48136
2158
46949
954
Table 4.4: Task 2 completion time in milliseconds
Total number of actions
Another important efficiency metric is the total number of actions taken by the robot
and the participant to accomplish the goal. This is simply measured as the sum of
the actions taken by either agent between the start of the trial and until the goal is
achieved. In this metric, a lower score is better.
162
Total number of actions
1501
-
........I............
T
T
**
**
100
T
I~I
**
501
0
Task 1
Figure 4-8: Shows the mean number of actions taken by both agents over all rounds of
each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence
interval)
C2
C3
p
C1
C2
1.le-08
C4
p
0.60
p
0.967
4e-05
0.197
0.07
1.8e-15
0.71
1.5e-07
0.62
0.045
0.09
C3
Table 4.5: Task 1 total number of actions. AN OVA p value and effect sizes q2 for all
pairwise comparisons of conditions.
163
C3
C2
p
C1
0.001
0.30
C2
C4
p
T/2
p
0.082
0.06
0.527
0.02
3.3e-04
0.23
0.007
0.28
0.525
9e-03
C3
Table 4.6: Task 2 total number of actions. ANOVA p value and effect sizes rj 2 for all
pairwise comparisons of conditions.
In Figure 4-8 we can see the mean total number of actions across all rounds for
each task. In task one, it is clear that C2 MTP-QMDP confidently dominates the
other conditions and those results are repeated in task two but with slightly less
confidence.
Round 1
Round 2
Round 3
/I
SE
/1
SE
y,
SE
C1
131.47
5.31
117.28
4.93
101.67
4.64
C2
98.81
2.58
95.87
1.68
86.61
1.35
C3
122.15
2.57
117.58
2.04
110.85
1.73
C4
129.85
5.24
107.05
5.42
94.87
2.35
Table 4.7: Task 1 total number of actions
Tables 4.7 and 4.8 show the mean number of actions and standard errors from
each round of each task. This data underlies the means reported in Figure 4-8.
164
Round 1
Round 2
Round 4
Round 3
1t
SE
1t
SE
1t
SE
C1
85.22
2.67
104.50
6.85
85.72
3.11
78.50
1.96
C2
81.27
2.60
78.05
1.40
82.69
3.61
72.55
1.14
C3
91.32
1.46
86.65
1.45
84.80
1.63
74.54
1.37
C4
91.89
5.04
81.40
2.29
87.10 2.67 84.00
um
o
Ion
1.96
ISE
Table 4.8: Task 2 total number of actions
Human action interval
We measured the time intervals between human actions within each session. The
interval times are simply measured as the time between successive actions taken by
study participant. Here we present the mean of those intervals across each round of
each task.
165
Human action interval
1500
---
- --
SI
-- ..... ...1
-
-
..
. ..
T
II
T
1000
5001
-
0'
Task 1
Figure 4-9: Mean action intervals of participants across all rounds of each task (*
p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval)
C2
p
C1
C2 I
0.492
C4
C3
p
0.01
p
0.897
4e-04
0.237
0.06
0.316
0.02
0.783
3e-03
0.136
0.05
C3
Table 4.9: Task 1 human action interval. ANOVA p value and effect sizes q 2 for all
pairwise comparisons of conditions.
166
C3
C2
p
C1
0.792
C2
C4
p
2e-03
p
0.810
le-03
0.153
0.09
0.879
5e-04
0.299
0.05
0.111
0.06
C3
Table 4.10: Task 2 human action interval. ANOVA p value and effect sizes
pairwise comparisons of conditions.
ij2
for all
Figure 4-9 shows the mean action intervals of participants across all rounds of
each task. This graph shows that there is no significant difference between the means
of the conditions. There is a trend towards C4 Human being lower than others.
Within-task human action interval rate of change
This metric was calculated by gathering all action interval times for a given round of
a task and performing linear regression on those data points. The slope of the fitted
line was used as the metric.
167
Human action interval rate of change
10
**
5
**
**
0
-5
-10
-15
'1
-...I
C1: MTP-POMDP
C2: MTP-QMDP
C3: Handcoded+A*
C4: Human
Task 1
Task 2
Figure 4-10: Shows the mean rates of change in action intervals averaged over all
rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95%
confidence interval)
C2
C3
P
C1
C2
0.441
C4
P
0.02
P
2.9e-04
0.25
0.005
0.28
3.8e-04
0.22
0.002
0.28
0.656
5e-03
C3
Table 4.11: Task 1 human action interval rate of change. ANOVA p value and effect
sizes q 2 for all pair wise comparisons of conditions.
168
C2
C1
C4
C3
p
772
p
0.028
0.15
0.010
4.le-06
C2
p
71 2
0.13
0.309
0.05
0.35
0.011
0.25
0.223
0.03
C3
Table 4.12: Task 2 human action interval rate of change. ANOVA p value and effect
sizes /2 for all pairwise comparisons of conditions.
Figure 4-10 shows the mean rates of change in action intervals averaged over all
rounds of each task. A rather clear trend can be seen where C1 MTP-POMDP
and C2 MTP-QMDP dominate the other conditions, this is more pronounced in task
one.
Round I
Round 2
Round 3
1t
SE
Pt
SE
p
SE
C1
-6.84
1.99
-6.95
3.63
-27.44
3.13
C2
-8.58
2.54
-9.61
2.13
-16.84
1.91
C3
3.46
4.42
-5.52
1.26
-10.67
1.35
C4
-7.18
3.21
-0.36
3.70
-6.50
3.96
Table 4.13: Task 1 human action interval rate of change
Tables 4.13 and 4.14 show the mean rates of change in participant action intervals
and standard errors from each round of each task. This data underlies the means
reported in Figure 4-10.
169
Round 1
Round 2
Round 3
Round 4
,y
SE
pt
SE
fp
SE
t
SE
C1
-1.90
1.68
-2.75
3.48
-0.83
3.62
-0.16
1.45
C2
-10.07
3.27
-13.55
4.36
-6.69
1.89
-4.41
1.58
C3
0.25
1.51
7.35
1.71
6.37
1.95
5.65
1.94
C4
0.12
1.74
1.91
3.53
3.53
4.67
3.27
2.60
Table 4.14: Task 2 human action interval rate of change.
Human functional delay ratio
We define functional delay to be the time between when the robot takes its action
and when the participant takes its next action. This time indicates a wait period
during which the participant might be trying to understand the action that the robot
just took. A game session will produce a sequence of these delays, and in this section
we use as a metric the ratio of the sum of those functional delays to the time of task
completion.
170
Human functional delay ratio
T
0.6
**
-
-..
-
0.5
-
.-
0.4
0.31
-
0.2
MC1:
C2:
C3:
C4:
0.1
0
Task 1
MTP-POMDP
MTP-QMDP
Handcoded+A*
Human
Task 2
Figure 4-11: Shows the mean participant functional delay ratios across rounds of
each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence
interval)
C2
p
C1
C2
0.019
C4
C3
p
0.14
p
0.927
2e-04
0.149
0.08
0.008
0.13
0.526
0.01
0.142
0.05
C3
Table 4.15: Task 1 human functional delay ratio. AN OVA p value and effect sizes r72
for all pair wise comparisons of conditions.
171
C3
C2
p
C1
4.5e-04
0.34
C2
C4
p
T1 2
p
9.0e-04
0.20
0.242
0.06
0.026
0.10
0.002
0.34
6.8e-04
0.24
C3
Table 4.16: Task 2 human functional delay ratio. ANOVA p value and effect sizes 12
for all pair wise comparisons of conditions.
Figure 4-11 shows the functional delay ratios across rounds of each task. We can
see that in task one C2 MTP-QMDP generates the lowest value although no significant difference is detected for C4 Human. In the second task C2 MTP-QMDP still
produces the lowest value but is not significantly different from C3 Handcoded+A*
Round 1
Round 2
Round 3
p
SE
py
SE
p
SE
C1
0.53
0.02
0.60
0.03
0.61
0.02
C2
0.41
0.04
0.52
0.03
0.59
0.02
C3
0.50
0.02
0.57
0.02
0.67
0.01
C4
0.50
0.05
0.52
0.03
0.59
0.03
Table 4.17: Task 1 human functional delay ratio.
Tables 4.17 and 4.18 show the functional delay ratios and standard errors from
each round of each task. This data underlies the means reported in Figure 4-11.
172
Round 1
Round 2
Round 3
Round 4
SE
P
SE
1
SE
0.52
0.02
0.53
0.02
0.42
0.02
0.04
0.51
0.04
0.32
0.03
0.39
0.02
C3
0.40 1 0.01
0.51
0.01
0.50
0.01
0.38
0.02
C4
0.52
0.53
0.03
0.58
0.03
0.49
0.03
yp
SE
C1
0.53
0.02
C2
0.43
0.03
P1
Table 4.18: Task 2 human functional delay ratio
Within-task human functional delay rate of change
This metric was calculated by gathering all functional delay times for a given round
of a task and performing linear regression on those data points. The slope of the
fitted line was used as a metric.
173
101
---..
.
5
-.
Human functional delay rate of change
. --.. .
- -.
..
... -- --..
.I......
-..
- -..
-.-.± 1
-.
-- .
0
-5
-10
-15
-201-25
-30
C1: MTP-POMDP
C2: MTP-QMDP
C3: Handcoded+A*
C4: Human
-35
I
-40
Task 1
Task 2
Figure 4-12: Shows the mean rates of change in participant functional delays averaged
over all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate
a 95% confidence interval)
C2
p
p
C1
C2
0.108
C4
C3
0.07
p
0.300
0.02
0.439
0.02
0.006
0.14
0.091
0.10
0.792
2e-03
C3
Table 4.19: Task 1 human functional delay rate of change. ANOVA p value and effect
sizes r72 for all pair wise comparisons of conditions.
174
C3
C2
p
C1
0.280
C4
p
0.05
C2
p
0.038
0.08
0.899
7e-04
0.006
0.16
0.650
0.01
0.126
0.05
C3
Table 4.20: Task 2 human functional delay rate of change. ANOVA p value and effect
sizes r2 for all pair wise comparisons of conditions.
Figure 4-12 shows the mean rates of change in participant functional delays averaged over all rounds of each task. In both tasks we can see that C2 MTP-QMDP
produces the lowest value but only significantly so compared to C3 Handcoded+A*
4.3.3
Attitudinal Data
None of the attitudinal measures from the post-study questionnaire showed any significant differences after the Bonferroni correction had been applied.
175
4.4
4.4.1
Discussion
Task Efficiency
Figure 4-7 shows us two things, firstly that C2 MTP-QMDP confidently outperforms all of the other conditions except C4 Human , and secondly that C1 MTPPOMDP and C3 Handcoded+A* show no significant difference. Basically the
same results can be interpreted from Figure 4-8 where it is shown that C2 MTP-
QMDP
takes significantly fewer actions to accomplish the goal than all other conditions
and C1 MTP-POMDP is not statistically different from the others. This is not
surprising, as we would expect the number of actions taken to correlate with time of
task completion.
Given that C2 MTP-QMDP takes full advantage of all mechanisms of the MTP
system except for the final layer of POMDP planning, we interpret these results as a
general success for the MTP approach, with the reservation that the value added by
the POMDP layer needs to be further justified since it comes at a cost to the task
efficiency of the team. This should be further investigated with a follow-up study
where we try to better understand the difference between behavior produced by the
POMDP and QMDP versions of the MTP system and how they can be better tuned.
We are also pleased to see that there is no significant measured difference between C1 MTP-POMDP and C3 Handcoded+A* or even C4 Human (except
in time of task completion for task two).
Especially in light of the fact that C3
Handcoded+A* is controlled by a policy that has the unfair advantage of perfect
observability of the world state at all times and uses a set of rules to accomplish the
task that would produce optimal behavior in the single agent case.
We therefore conclude that our hypothesis H5 Improved Task Efficiency is
supported. Even in comparison with C4 Human , which was unexpected.
176
4.4.2
Team Fluency
The behavioral metrics we chose to look at for measuring team fluency were human
action interval and
functional delay as well
as the within-task rates of change in these
metrics.
In Figure 4-9, we can see that there is not any significant difference in the average
action intervals in each task across the conditions. This is can be confirmed by looking
at the p values and effect sizes in Tables 4.9 and 4.10.
On the other hand, Figure 4-10 and tables 4.11 and 4.12 show clearly that when
users play in C1 MTP-POMDP and C2 MTP-QMDP their action intervals get
shorter across an episode of a task, and at a significantly faster rate than C3 Hand-
coded+A* and C4 Human.
When a participant has a negative rate of change in their action interval, it means
that as the task progresses they take actions quicker. This might suggest that they
are learning how to work with their teammate and therefore needing less and less
time to choose their actions as the task progresses. The fact that the MTP agents
produce an improvement in this metric might suggest that participants are quicker
to learn how the agent operates; since the produced behavior better matches their
expectations, they require less and less contemplation as the task progresses.
Looking at the functional delay ratios in Figure 4-11 and Tables 4.15 and 4.16 we
can see C2 MTP-QMDP has generally lower functional delay ratios than the other
conditions and this difference is often significant. We also see that in task two, C3
Handcoded+A* scores lower than C1 MTP-POMDP .
The mean rate of change in the functional delay across the tasks can be seen in
Figure 4-12. We can see that there is a lot of variation in this metric but we can still
that C1 MTP-POMDP and C2 MTP-QMDP both have consistently negative rates
of change, and C2 MTP-QMDP is significantly lower than C3 Handcoded+A* in
both tasks.
We cautiously interpret the results shown by action intervals, functional delays,
177
and rates of change, to suggest that an MTP controlled agent produces behavior
which a human teammate can model and understand more quickly, leading to a more
fluent collaboration.
We conclude that hypothesis H5 Team Fluency is Improved is semi-upheld as
sub-hypotheses (b) and (d) were upheld, (c) partially upheld and (a) not supported.
4.4.3
Attitudinal Data
We were disappointed to find that the questionnaire data did not reveal any statistically significant differences between experimental conditions after the Bonferroni
correction had been applied. The fact that behavioral differences were observed but
not self-reported attitudinal differences is not an uncommon occurrence with online
studies. It seems that people's engagement in the task in such studies is often significantly lowered, leaving the participant with an impression of the experience that is
less pronounced.
We did observe some trends in the questionnaire data that mostly favor C1 MTPPOMDP over C2 MTP-QMDP , especially along dimensions such as liking the robot
and perception of the robot liking and appreciating the participant. We believe that
those indications could be made more pronounced if the study were modified to follow
a within-participant design, where each participant experiences the different types of
agent controllers and can better assess the differences in subjective experience.
Consequently, the attitudinal hypotheses H1,
H2 and H3 are not supported by
the data and would require further investigation to confirm.
4.4.4
Informal Indications from Open-Ended Responses
We included a few open-ended questions where participants could share their thoughts
about various parts of their experience. It is helpful to look at this data to be able
to better understand the dynamics of the game as the participants experienced them
and how that might inform future studies.
178
Game Controls
Because of the discrete nature of Al planners, we were required to put in place a few
artificial limitations to the game which would allow the planner to cooperate with the
participant. For example the environment was discretized to grid locations and the
game characters could only turn in the four directions. Secondly we enforced that
a character would have to finish an "animation" (such as moving between grids or
rotating) before accepting another user command. This was done so that the game
would always be in a decided state when the planner decided which action it should
take but resulted in a slightly clunky user experience.
Many participants would
report this in the open ended questions and somewhat confuse this "inefficiency" with
inefficiencies that we are more interested in measuring, such as that caused by a bad
teammate. Some representative quotes are:
"the reaction of the character was too
slow", "...the robot was simply faster in a lot of ways due to a bit of lag-centered
controls", "The major obstacle was the interaction with the controls; I found myself
turning more than I meant to because the game was slow to respond" and "For the
most part robot was helpful; only inefficient because for me there was a slight lag after
pressing keys".
Lack of Communication and Tendency to Lead
Many participants reported that they were annoyed that there was no way to explicitly communicate with the robot: "It was intelligent because it knows what's the
next step. It would be better if we had some communication.", "Intelligent enough
to understand the task, not intelligent enough to understand basic communication"
... effectiveness could have been improved if communication were possible. ". People
have a strong tendency to want to communicate verbally and neither the game nor any
of the planners were designed to allow that. Participants resorted to attempting to
communicate through their avatar's movement and behavior:
". . .
it was unresponsive
to anything I tried to do to it, or anywhere I tried to lead it", "Sometimes, blocking
179
the way to the lower point engine worked and sometimes it didn't" but this would
conceivably only further confuse the planners as they were not equipped to model the
intentions behind that behavior. We think that a great advantage could be afforded
by giving autonomous robots the ability to model this behavior explicitly.
Differences Between Conditions
C4 Human : The responses from people in this condition were useful to determine a
baseline for the game experience, since the agent they interacted with was controlled
by an expert human and should therefore not have contributed much to any frustration with the game.
It was interesting to see that not many participants reported
that they believed the robot to be controlled by a human except this one who figured
it out because of how the robot recovered from error:
"Somehow, I feel like I was
playing with a real person since when we competed for a item, the one that failed to
get that will get away from the engine or step to the next item."
C1 MTP-POMDP : Many participants from this condition reported their satisfaction with the robot's ability to infer their intention and anticipate their actions:
"The tasks were performed relatively efficiently. The robot in general was able to infer
the user's intention and anticipated the next step in the assembly process." "It seemed
reasonably competent. It seemed to anticipate the action that was needed.'; "The robot
would try to anticipate what I was doing, but at the same time, I would try to an-
ticipate what part I should be going for based on what I thought it was going to do.",
"The robot in general was able to infer the user's intention and anticipated the next
step in the assembly process. ". We also noticed that some participants from this condition complained that the robot sometimes took a while to respond to their action.
This is probably due to the fact that this condition requires the most planning by
the autonomous agent, which usually happens so quickly that it can not be detected
by the participant but can sometimes halt the game for a second or two (we used a
planning timeout of about five seconds). Effort should be expended in the design of
180
future studies to neutralize this confounding effect, possibly by fully pre-calculating
policies before study trials or otherwise speeding up policy look-up (possibly by employing a computer cluster to improve planning speed). Representative quotes: "the
reaction of the character was too slow", "Slow response time; startup time was very
unpredictable".
C2 MTP-QMDP : Participants in this condition also commented on the robot's
ability to anticipate their actions but it seemed that the robot would more often
start building the wrong engine and they would have to settle for a sub-optimal
reward outcome: "On the second task, it seems like it didn't know which engine was
more important, so once it would place a part on any engine I had to follow", "Once
he(she it?) started to assemble the wrong engine, so I just went along", "...
but I had
to make sure to put in the first part myself or it would choose the wrong engine to
assemble". Participants also noted that this agent would often retrieve objects that
were not needed immediately but rather a few moves later: "it also sometimes jumped
several steps ahead in the assembly process"
...
it wasn't great at deciding whether
to bring the next item or the one after it", "The only time it seemed incompetent was
one of the tasks, when it started by picking up the air filter, so I had to do all of
the steps before that one". This is consistent with our qualitative understanding of
how the QMDP agent operates which is generally to agressively take advantage of the
predictive capabilities of MTP but often overestimate its confidence about the mental
state of the other agent.
C3 Handcoded+A* : The participants in this condition generally reported that
the robot was very efficient but several commented that it didn't seem to consider
them very much: "a few times the robot moved directly in front of the avatar to return
something, ratherthan taking the alternate route which did not cross the space directly
in front of the avatar" "Sometimes the robot blocked my way, but for the most of the
time I felt that working with my assistant robot was efficient", "Ifeel I would be more
efficient, if the robot also knew to walk around me more. However, did a good job
181
knowing next step.", "Very efficient. The robot is quite capable of doing the task on
its own.". This is fairly consistent with our intuition for how this autonomous agent
would behave, very efficiently with respect to the task but without any model of the
behavior of the other.
4.4.5
Funny Comments
A few of the open-ended responses were humorous and it would be a shame not
to share any of them in this thesis. The following two comments were particularly
amusing: "If I could provide more input to direct its goals, then yes, the robot would
make a fine teammate for mechanical tasks. I wouldn't want to take it out for a beer,
though." and "The robot tried to block me! It was annoying because I tried this game
for money. I think the robot is evil!".
182
Chapter 5
Conclusion
183
5.1
Thesis Contributions
This thesis makes contributions both to the field of probabilistic planning as well as
Human-Robot Interaction (HRI). In fact one of its contributions is taking an HRI
challenge and formulating it with the appropriate representations from probabilistic
planning so that it may be solved in a computationally principled way.
Introduction of a novel general purpose POMDP solver. We have presented a novel algorithm called B3 RTDP which extends the Real Time Dynamic Programming approach to solving POMDP planning problems. This approach employs
a bounded value function representation which it takes advantage of in novel ways.
Firstly, it calculates action convergence at every belief and prunes actions that are
dominated by others within a certain probability threshold. This technique is similar
to a branch and bound search strategy but calculates action convergence probabilities
such that actions may be pruned before convergence is achieved. Secondly, B3 RTDP
introduces the concept of a Convergence Frontier which serves to improve convergence time by taking advantage of convergence of early action selection in the policy.
The B 3 RTDP algorithm was evaluated against a state-of-the-art POMDP planner
called SARSOP on two standard benchmark domains and showed that it can garnish
higher Adjusted Discounted Reward with a shorter convergence time.
Introduction of a novel approach to predictive planning based on mindtheoretic reasoning. We presented the development of a novel planning system for
social agents. This Mind Theoretic Planning (MTP) system employs predictive models of others' behavior based on their underlying mental states. The MTP system
takes as input distributions over possible beliefs that the other agent might have
about the environment as well as possible goals they might have. It then proceeds
to create predictive models called mental state situations which construct stacks of
Markov Decision Process models, each level taking advantage of policies and value
functions computed in levels below, producing improved predictive power. The MTP
system leverages the predictive mental state situations to compute a forward transi184
tion function for the environment that includes anticipated effects of the other agent
as a function of their mental state. Finally, a perceptually limiting observation function is used in conjunction with the predictive transition functions to formulate a
POMDP that is solved using the B 3RTDP algorithm.
Development of evaluation environment and simulator. To evaluate the
contributions as well as assist in the development of this thesis, we developed a robot
simulator environment that is oriented towards simulated human-robot interaction
experiments.
This simulator is designed so that it can be deployed online and be
played by participants through a web-browser. In the game, two agents are paired
together to accomplish a task, presumably one is the study participant and the other
can either be an autonomous system using the character control Java API, or another
human controller. The game uses a 3-D grid-based environment and the character is
controlled in first-person chase-camera mode. The environment is visually filtered to
only show features that are available to the character, which is particularly important for the evaluation of mind-theoretic systems as it enforces participants to take
perception actions.
Human subject evaluation of MTP system. An online user study was performed with approximately a hundred participants. In the study, participants would
interact with an autonomous agent, in the simulator discussed above, to accomplish a
task of assembling an engine from parts. The study showed that the MTP system can
significantly improve task efficiency and team fluency over an alternative autonomous
system and even a human expert controller in some cases.
5.2
5.2.1
Recommended Future Work
Planning with personal preferences
The MTP system currently takes as input the possible beliefs and goals that a human
agent might have about the environment, and it computes predictions of how that
185
agent might go about attempting to achieve its goals given those beliefs.
Clearly
there are many ways to skin a cat and often several different plans that can achieve
the agents' goals. The MTP system uses approximate, but optimal, task planners to
create these action policies and therein lays an assumption of perfectly rational behavior. This assumption often holds, especially in very task- or goal-driven scenarios, and
even when incorrect it can provide good approximations of actual human behavior.
We know that people do not always behave optimally or necessarily rationally. The
are many reasons for people's sub-optimal behavior other than simple ignorance of
how to perform better, such as personal preference, superstition, curiosity, boredom
or simply creativity. We believe that the MTP system could be improved if it were
able to model some of the most common sources or variance in people's deviations
from optimal behavior, especially if it were able to learn those parameters specific to
each individual based on history of interactions.
5.2.2
Planning for more agents
The presented approach to mind theoretic reasoning is in no way inherently limited to
planning only for one human teammate, but it is also not particularly designed to scale
well to planning for many agents. This does not mean we intended for it scale poorly,
but rather that we focused on demonstrating the concept and its impact in the simpler
case before thinking about optimization for scaling. The planning problem generally
grows exponentially with number of agents, but we believe that clever optimizations
might be used to get a better leverage on the problem.
In any given task, there
might be, for example, large regions in the state space where reasoning about all of
the agents' mental states or predicting their actions is completely irrelevant. There
might be an opportunity to apply state-space abstraction methods here.
186
5.2.3
State-space abstractions and factoring domains into MTP
and Non-MTP
State space abstractions can be used to significantly reduce planning time; this reduction is gained by encoding which parts of the planning problem are relevant to the
current goal and which parts are not. An abstraction in a navigational domain might
for example recognize areas of the environment that should be treated as the same,
since their differences are irrelevant to the particular navigational target. This technique might have a huge impact on mind-theoretic planning since often only a small
(but important) part of the complete state space contains features that are significant
from a mind-theoretic perspective. If the MTP solver could be sensitive to this fact
and have an efficient way to identify those areas of the problem, it could possibly
solve them in an easier way without having to consider mind-theoretic consequences
of actions or predictions of others' reactions etc. Similarly, using a factored representation for the state space might produce significantly smaller transition and reward
functions as much of the action space might not have any mental state consequences
and mental state variables could therefore be considered independent of those actions
in their Dynamic Bayes Net (DBN) encodings.
5.2.4
Follow-up user study
In the human subject study we performed to evaluate this thesis, we found some
very interesting trends where an non-POMDP MTP agent using a QMDP action
selection strategy mostly outperformed all other conditions in terms of efficiency and
team fluency. Although we did not see significantly different results in our attitudinal
measures, which were acquired using a post-task questionnaire, they contained a trend
which suggested that the QMDP agent was less liked than the POMDP agent. We
would like to investigate this interaction further and see how the benefits of the two
strategies could be achieved by adjusting the parameters of the POMDP model. We
187
propose that this experiment should follow a "within-participant" design where each
participant experiences interacting with both types of agents and provides relative
attitudinal judgments.
We hope this will help to really underline the important
perceived differences and provide guidance for how to improve the MTP system.
188
Appendix A
Existing Planning Algorithms
189
A.1
Various Planning Algorithms
Algorithm 7: Pseudocode for the GRAPHPLAN algorithm. The algorithm operates
in two steps, graph creation and plan extraction. The EXTRACTPLAN algorithm is a
level-by-level backward chaining search algorithm that can make efficient use of mutex
relations within graph.
1 GRAPHPLAN (SI :
3
4
6
8
10
12
13
15
16
17
18
19
Set, SG
Set)
Create So from sj;
foreach i c N do
Add NoOp "actions" to Ai for each state literal in Si;
Add all actions that apply in Si to Aj;
Construct Sj+j from Aj;
Inspect Mutex relations in Sj+;
if sG exists non-mutexed in Sj+j then
Attempt EXTRACTPLAN;
if Solution found then
L
return plan;
else if Solution impossible then
L return failure;
190
Algorithm 8: The RTDP algorithm interleaves planning with execution to find the
optimal value function over the relevant states relatively quickly.
1
3
5
6
8
: GoalSet, h(s) : Heuristic)
RTDP (s, : State,
// Initialize value function to admissible heuristic
V(s) <- h(s);
while not converged do
L RTDPTRIAL(sl, G);
9 RTDPTRIAL (s, : State, G : GoalSet)
s=s ;
11
depth = 0;
13
14
while (s 7 0) A (s
G) A (depth < MAXdepth) do
18
//
a
20
//
22
V(s):= Q(s, a);
24
26
// Sample next state for exploration from 7
s := PICKNEXTSTATE(s,a)
28
depth := depth +
16
29
31
2.1
Pick action greedily, equation:
argmina/
Q(s, a'
Perform the Bellman value update
;
PICKNEXTSTATE (s : State, a: Action)
L return s' ~ T(s,a, );
191
Algorithm 9: The BRTDP algorithm. Uses a bounded value function and
search heuristic that is driven by information gain.
1 BRTDP (s, : State, G : GoalSet)
2
while not converged do
4
BRTDPTRIAL(sI, G);
BRTDPTRIAL
7
s = SI ;
5
9
10
(s, : State, G : GoalSet)
depth = 0;
while (s / 0) A (s
G) A (depth < MAXdepth) do
12
//
14
a := argmina'EAL(s,
16
// Perform the Bellman value updates for both boundaries
VL (s)
L (s, a) ,
VH(s)
mina'EAQH(s, a')
// Sample next state for exploration according to highest
expected information gain
18
20
22
Pick action greedily from lower boundary
a')
24
s:= PICKNEXTSTATEBRTDP(s,
26
depth := depth + 1;
27
a);
PICKNEXTSTATEBRTDP (s, s, : State, a: Action)
31
// Create vector of transition probability weighed value gaps
Vs' E S, k(s') := T(s, a, s') (VH(s') - L s
33
K :=
35
//
29
36
38
40
42
k(s');
Terminate when state of relative certainty, compared to sl,
has been reached
if K < (VH(sI) - VL(sI)) /T then
K
return 0;
// Sample from normalized vector
return s'
k(-)/K;
192
Algorithm 10: The RTDP-Bel algorithm from (Geffner & Bonet, 1998) and
(Bonet & Geffner, 2009)
1 RTDP-BEL (b0 : Belief, G: GoalSet)
2
while not converged do
4
d:= 0;
6
b:= bo;
~ b(-)
while (d < MAXdepth) A (b
8
9
G) do
15
// Select action greedily
a := argmina'cAb (b, a');
// Perform Bellman value update
17
17(b) := Q(b, a) // Update value of belief;
19
//
11
13
21
Sample next state and observation
S' - 'T(s, a, );
23
0 ~
25
29
// Update belief via equation 2.3
b:= ba
S := S
31
d:= d+ 1;
27
Q(a, s', -);
193
194
Appendix B
Study Material
195
B.1
Questionnaire
The questionnaire data was gathered using the online service Survey Monkey, which
constrained some of the question formatting. Section headings were omitted.
Task Load
The following questions required free text responses.
1.
Please enter your email. Use the same email address you used when you signed up for the
study.
2. Describe how efficiently or inefficiently you feel the tasks were performed. Please explain why
you think that was the case.
The following six questions used a seven point Likert scale between Very low and Very high and
were sourced from (Hart & Staveland, 1988).
3. How mentally demanding were the tasks ?
4. How hurried or rushed was the pace of the tasks ?
5. How successful were you in accomplishing what you were asked to do ?
6. How hard did you have to work to accomplish your level of performance ?
7. How insecure, discouraged, irritated, stressed, and annoyed were you ?
8. How often did your team assemble the engine that you originally intended to assemble ?
The following question used a seven point Likert scale between Never and Always.
9. How often did your team assemble the engine that you originally intended to assemble ?
Fluency
The following questions used a seven point Likert scale between Strongly disagree and Strongly agree
and were largely sourced from (Hoffman, 2014).
10. The robot tended to ignore me
11. The human-robot team improved over time
12. The human-robot team worked fluently together.
13. I tended to ignore the robot
14. The robot's performance improved over time
15. The robot contributed to the fluency of the interaction.
16. The human-robot team's fluency improved over time
17. What the robot did affected what I did
196
18. What I did affected what the robot did
Competency
The following question required a free text response.
19. Briefly describe how competent or incompetent you felt the robot was and why.
The following questions used a seven point likert scale where
1 was designated with the first quoted
word and 7 with the second. These were partly sourced from (Bartneck et al., 2009) and partly from
(Berlo et al., 1969).
20. Please rate your impression of the robot on a scale between "incompetent" and "competent"
21. Please rate your impression of the robot on a scale between "untrained" and "trained"
22. Please rate your impression of the robot on a scale between "unqualified" and "qualified"
23. Please rate your impression of the robot on a scale between "unskilled" and "skilled"
Team-Mindedness
The following question required a free text response.
24. Describe how you felt about having the robot as a team-mate. Would you like to have the
robot on your team in the future ?
The following questions used a seven point likert scale where
1 was designated with the first quoted
word and 7 with the second.
25. Please rate your impression of the robot on a scale between "unhelpful" and "helpful"
26. Please rate your impression of the robot on a scale between "inconsiderate" and "considerate"
27. Please rate your impression of the robot on a scale between "selfish" and "selfless"
28. Please rate your impression of the robot on a scale between "ego-oriented" and "teamoriented"
The following questions used a seven point Likert scale between Completely unaware and Completely
aware.
29. How aware of your plans do you think the robot was
The following questions used a seven point Likert scale between Strongly disagree and Strongly agree
and were largely sourced from (Hoffman, 2014) and (Wageman et al., 2005).
30. The robot was committed to the success of the team
31. I was committed to the success of the team
32. If it were possible then I would be willing to team up with the robot on other projects in the
future
33. The robot had an important contribution to the success of the team
197
34. The robot and I are working towards mutually agreed upon goals
35. The robot does not understand what I am trying to accomplish
36. The robot perceives accurately what my goals are
37. The robot was cooperative
Intelligence
The following question required a free text response.
38. Briefly describe how intelligent or unintelligent you felt the robot was and why.
The following questions used a seven point likert scale where 1 was designated with the first quoted
word and 7 with the second and were largely sourced from (Bartneck et al., 2009).
39. Please rate your impression of the robot on a scale between "ignorant" and "knowledgeable"
40. Please rate your impression of the robot on a scale between "unintelligent" and "intelligent"
41. Please rate your impression of the robot on a scale between "uninformed" and "informed"
Theory of Mind
The following questions used a seven point Likert scale between Strongly disagree and Strongly agree
and were adapted from from (Hutchins et al., 2012).
42. The robot understands that people's beliefs about the world can be incorrect
43. The robot understands that people can think about other peoples' thoughts
44. If I put my keys on the table, leave the room, and the robot moves the keys to a different
room the robot would understand that when I returned, I would begin by looking for my keys
where I left them
Humanness
The following questions used a seven point likert scale where 1 was designated with the first quoted
word and 7 with the second.
45. Please rate your impression of the robot on a scale between "machinelike" and "humanlike"
46. Please rate your impression of the robot's behavior on a scale between "predetermined" (or
programmatic) and "interactive" (or responsive)
47. Please rate your impression of the robot on a scale between "introverted" (shy, timid) and
"extroverted" (outgoing, energetic)
The following question used a seven point Likert scale between Strongly disagree and Strongly agree.
48. I think it is possible that the robot was controlled by a person behind the scenes
Multiple choice: "Brown saw", "Orange pliers", "Green plastering tool", "Black chisel", "Red screwdriver", "Gray hammer" and "Blue level".
198
49. Please select a tool that you used in the tasks
Enjoyment
The following question required a free text response.
50. Briefly describe whether or not you enjoyed working with the robot (and why or why not)
The following questions used a seven point likert scale where 1 was designated with the first quoted
word and 7 with the second. Largely sourced from (Bartneck et al., 2009).
51. How much did you enjoy performing the tasks with the robot ? (1=Not at all, 7=Very much)
52. Please rate how much you liked or disliked the robot ? (l=Dislike, 7=Like)
53. Please rate your impression of the robot on a scale between "unfriendly" and "friendly"
The following questions used a seven point Likert scale between Strongly disagree and Strongly agree
and were adapted from from (Hoffman, 2014).
54. I am confident in the robot's ability to help me
55. The robot and I trust each other
56. I believe the robot likes me
57. The robot and I understand each other
58. I feel that the robot appreciates me
59. I feel uncomfortable with the robot
Demographics
The following questions required free text responses.
60. In what country do you currently live in ?
61. What is your age in years ?
The following questions were multiple choice.
62. What is your gender ? ("Male", "Female", "Other/Don't want to answer")
63. What is your level of education ? ("Some high school", "High school degree", "Some college",
"College degree", "Some graduate school", "Graduate degree")
64. How often do you play video games where characters are controlled in 3D environments (which
are different from 2D games such as Tetris and Angry Birds etc.) ? ("Never", "Less than once
a month", "1-4 times a month", "5-10 times a month", "11-20 times a month", "More than 20
times a month")
65. Do you own or have you used a robotic toy or appliance (e.g. Sony AIBO, iRobot Roomba)
? ("Never", "Used them once or twice", "Used them many times", "Own one or more")
66. Did you experience any errors or technical troubles while playing the game ? ("No errors",
199
"Very few", "Some", "Fairly many", "A lot")
The following question asked "How much do you know about:" using a seven point Likert scale
between Nothing and A lot.
67. Computers
68. Robotics
69. Artificial Intelligence
The following question required a free text response.
70. Do you want to report any technical difficulty you had with the game or errors that you
experienced ?
200
B.2
Study Pages
Sign-up website:
MIT Robot Study
Pers..a Robats wmmp
MIT MEDIA LAB
Welcome and thank you for signing up for our robot study !
Before participating in the study, please read the "Consent to Participate in Non-Biomedical Research" section below. We
recommend printing this page for your records [press here to print. If you have any questions regarding the study which you
would like answered before you participate, please feel free to email them to mit .robot. studv@gmail cor
Please enter your email, which will be used as your username for this study (We will never send spam or release this email
address to a 3rd party). Please note that this same email should be used when filling out the questionnaire, and is where we will
send you your Amazon gift card code if you complete all the tasks and questionnaire.
We will send further instructions on how to participate in this study to your inbox.
Please note that you must be 18 years old or older to participate and you can only participate once.
Email:
Please read the following consent form and provide your consent by checking the checkbox and pressing the 'submiV button on
the bottom of the page.
Consent to participate in non-biomedical research
Collaboration and Learning with Local and Remote Robot Teams
You are asked to participate in a research study conducted by Sigurdur Orn Adalgeirsson, M.Sc. and Cynthia Breazeal,
Ph.D., from the Media Lab at the Massachusetts Institute of Technology (M.I.T.) You were selected as a possible
participant in this study because you are a proficient speaker of English. You should read the information below, and ask
questions about anything you do not understand, before deciding whether or not to participate.
Participation and withdrawal
Your participation in this study is completely voluntary and you are free to choose whether to be in it or not. If you choose
to be in this study, you may subsequently withdraw from it at any time without penalty or consequences of any kind. The
investigator may withdraw you from this research if circumstances arise which warrant doing so.
Purpose of the study
The purpose of this study is to learn about the strategies that people employ when working together with teams of one to
more robots to solve situated, physical tasks. We are interested in how people try to teach new skills to robot teams, as
well as in how they collaborate with robot teams to solve situated tasks in the real world or simulation. We are
constructing robots that can team up to interact with and learn from people, and we hope that the results of this study will
help us to improve the design of these robots' collaborative abilities.
Procedures
You will be controlling a human character in a video game and work collaboratively with a simulated robot to achieve an
engine assembly task.
We will be asking you to participate in a 3-4 rounds of two collaborative tasks with our simulated robot. You will be
interacting with a robot via a graphical simulation interface similar to a video game. In each task, you will have a different
201
goal to achieve and the robot will attempt to be helpful.
After you have finished the tasks, you will be asked to complete a questionnaire
Each round of the tasks will take about 2-4 minutes, and the questionnaire will take about 15-20 minutes, so the total time
for this experiment will be approximately 30-50 minutes.
Potential risks and discomfort
There are no risks that are anticipated while participating in this study.
Potential benefits
There are no specific benefits that you should expect from participating in this study; however, we hope that you will find
the experience to be enjoyable and engaging.
Your participation in this study will help us to build robots that are better able to interact with and learn from humans.
Payment for participation
You will receive a value of at least $5 in the form of an Amazon gift card for having completed participation in this
experiment (playing the training round, all rounds of both tasks and filling out the questionnaire).
An additional $50 gift card will be given to the three participants that complete the tasks most efficiently (in the shortest
time).
Finally there will be a lottery for one $100 gift card.
The $5 gift card will be delivered to you via your email address within a week of your complete participation. To be
eligible for the awards and lottery, completed participation is required. The lottery and best performance awards will be
delivered at the end of running this study (within approximately two months).
Confidentiality
Any information that is obtained in connection with this study and that can be identified with you will remain confidential
and will be disclosed only with your permission or as required by law.
No data that would describe an individual participant will be used, we will only use aggregate data from all participants.
At any time during or after the experiment you can request that all data collected during your participation be destroyed.
Identification of Investigators
If you have any questions or concerns about the research, please feel free to contact:
Associate Professor Cynthia Breazeal
617-452-5601
MIT Media Lab, E15-468
Cambridge, MA 02139
cynthiab@rnedia.mit.edu
Sigurdur Orn Adalgeirsson
617-452-5603
MIT Media Lab, E15-468
Cambridge, MA 02139
siggi@media.mit.edu
Emergency care and compensation for injury
Ifyou feel you have suffered an injury, which may include emotional trauma, as a result of participating in this study,
please contact the person in charge of the study as soon as possible.
In the event you suffer such an injury, M.I.T. may provide itself, or arrange for the provision of, emergency transport or
medical treatment, including emergency treatment and follow-up care, as needed, or reimbursement for such medical
services. M.I.T. does not provide any other form of compensation for injury. In any case, neither the offer to provide
medical assistance, nor the actual provision of medical services shall be considered an admission of fault or acceptance
of liability. Questions regarding this policy may be directed to MITs Insurance Office, (617) 253-2823. Your insurance
carrier may be billed for the cost of emergency transport or medical treatment, if such services are determined not to be
directly related to your participation in this study.
Rights of research subjects
You are not waiving any legal claims, rights or remedies because of your participation in this research study. If you feel
you have been treated unfairly, or you have questions regarding your rights as a research subject, you may contact the
Chairman of the Committee on the Use of Humans as Experimental Subjects, M.I.T., Room E25-143B, 77 Massachusetts
Ave, Cambridge, MA 02139, phone 1-617-253 6787.
I understand the procedures described above. My questions have been answered to my satisfaction, and I agree to
participate in this study. I have been given an opportunity to print this form.
Gubit
202
Participation email:
Gaii
MIT Robot Study
mit-robot-study@media.mit.edu <mit-robot-study@media.mit.edu>
To: siggioa@gmail.com
Mon, Apr 14, 2014 at 5:57 PM
Thank you for signing up for our study and helping to make our robots smarter!
PREPARATION:
1. Please start by making sure you have the Unity3D plugin installed in your browser:
http://unity3d.com/webplayer
2. Read through the instructions on how to play the game:
http://prg-robot-study. media. mit. edu/?action=instructions
3. Get familiar with the game, use this test level to navigate the space and test the controls:
http://prg-robot-study. media. mit.edu/?email=siggioa@gmail.com&eConfCode=d3897vou854toghbq4ugopap9&
action=test&type=DaL9K
PERFORM STUDY TASKS:
1. Complete the first task:
http://prg-robot-study.media.mit.edu/?email=siggioa@gmail. com&eConfCode=d3897vou854t1oghbq4ugopap9&
action=test&ty pe=R57Zv
2. Complete the second task:
http://prg-robot-study.media.mit.edu/?email=siggioa@gmail.com&eConfCode=d3897vou854t1oghbq4ugopap9&
action=test&type=qNKK6
AFTER PLAYING GAME:
1. Fill out this questionnaire immediately after completing all rounds of all tasks. In the questionnaire, please
enter the same email address as used here.
https://www.surveymonkey.com/s/JNSQKW9
2. Once all data has been verified (questionnaire and game data) then your amazon gift card will be emailed to
you. This should happen within a few days of participation.
If you have any questions, please reply to this email.
Best regards
-Siggi
203
Instruction pages:
,*f
P-ser
PIIT
MIT Robot Study
Rosr-
MIT MEDIA LAB
Game Instructions
You will be controlling a human avatar in an environment that also has a robot. The robot's purpose is to provide assistance to
you but it doesn't always know your goals.
Please NEVER use your browser's back, forward, refresh/reload or simply re-enter the game url into the url entry, once a
game has started loading. If you want to exit a game or reload it, please close the browser tab or window where the game is
currently playing and follow the link in your email again.
Perception
When moving around the rooms, objects will fade in and fade out as they become visible to you. If you want to see more of the
space you need to tum/move around to look.
Objects
There are a few different objects in the environment that can be picked up and used. One of the tasks is to assemble an engine
so all of the items and tools are relevant to that task.
Note: Ifyou want put down an object, it needs to be retumed to the same table it was picked up from.
Engine parts
The engine block and the air filter can be picked up. They can also be added to the engine base if standing in front of it when
putting them down.
Engine base: This is the site of
an unassembled engine.
Engine block
Engine air filter
Tools
These can be picked up and applied to the engine only if there is a tool-tip visible above the engine indicating the tool in
question. Once the tool has been applied to the engine the tool-tip disappears, indicating that the tool isn't needed anymore.
Yellow screwdriver
Wrench
page 1 of 3
Next page
204
Red screwdriver NOTE: From
certain angles, the red handle
isn't very visible and only the
metal part can be clearly seen.
MIT Robot Study
MIT MEDIA LAB
Actions
You can move around the rooms by either using your mouse or keyboard.
The available actions are:
"
"
"
*
Move forward (Keyboard: 'Up arrow' or clicking on forward arrow): Moves forward if it is possible (nothing is blocking)
Turn left (Keyboard 'Left arrow' or clicking on left arrow): Rotates to the left.
Turn right (Keyboard 'Right arrow' or clicking on right arrow): Rotates to the right
Action (Keyboard 'Space bar' or clicking on the item in front of character): Performs the following actions:
" Pick up: If you aren't holding anything, and you are standing in front of an item then you will pick it up
" Put down: If you are holding something and standing in front of the table where you picked it up, you will put it
down
" Apply to engine: If you are holding the right item or tool for the engine when standing in front it, that item or tool can
be applied to the engine.
Game processing
Please note that sometimes the game needs to process information. During that time you can not take any actions. If you notice
that you can't move your character, please check to see if the icon in the top left corner of the game is saying "Please waW' or
" Procead".
Keyboard navigation
For keyboard navigation to work, the game area of the webpage needs to be selected. This can be accomplished by simply
clicking with the mouse anywhere within the Unity game.
Use the arrow keys for navigating forward or turning left and right. Use space bar to perform actions.
Mouse navigation
The character can be controlled with the mouse as well, there
are small "arrow" icons underneath the character which can be
clicked to navigate. When the character is in front of a table
with an object or the enginebase or a valid put-down location
when holding an object, a mouse click on that location will
perform the appropriate action (pick up, put down or apply).
page 2 of 3
Previous page - Next page
205
,o
PIlT
MIT Robot Study
MIT MEDIA LAB
Engine assembly
The engine base can be assembled into a complete engine I the right Items are placed on it and the appropriate tools used.
The following sequence is needed:
1.
2.
3.
4.
5.
6.
Pick up
Pick up
Pick up
Pick up
Pick up
Pick up
an engine block and place it onto the engine base
the yellow screwdriver and apply It to the engine base
another engine block and place it onto the engine base
the wrench and apply It to the engine base
the air filter and place it onto the engine base
the red screwdriver and apply it to the engine base
This diagram will be provided next to the game window when playing the game.
zidlliD
Game screens
This is the screen that you will see
Once the level successfully loads, the
Once the game finishes (either successfu
or not) you will see this screen. In white
immediately after the game has loaded and
game should look something like this
before the level has finished loading.
(depending on which level you are joining). letters you will see whether you succeed(
or not and what to do next.
NOTE: Ifthis screen stays for more than
about a minute then a problem might have
occurred, and you should close the window
and re-click your game link
page 3 of 3
Previous page
206
Intermediate "rounds" page (three in first task, four in the second):
MIT Robot Study
MIT MEDIA LAB
Your Task
Your task is to assemble the engine as fast as you can. The robot will attempt to assist you.
Please finish ALL of the following 3 rounds of this task.
1. Round nr. 1
2. Round nr. 2
3. Round nr, 3
Game screen for task 1:
MIT Robot Study
MIT MEDIA LAB
htp//unitv3d.com
ebplaver/
nstructions
Your Task
Your task is to assemble the engine as fast as you can. The robot will attempt to assist you.
Game goes here
207
If you can
see the character but the level hasnt loaded and it says "Proceed" in the upper left comer for more than 20 seconds
then please close the browser tab and follow the link again (initial connection error).
Engine assembly instructions (refresher)
I
ZIZZID
Game screen for task 2:
i'rr
MIT Robot Study
MIT MEDIA LAB
htlpilunitv3d com/webolaver/
Instructions
Your Task
This level will have two engine bases but only enough parts for assembling one.
Your task is to assemble a single engine as fast as you can. The robot will attempt to assist you. Here are the points for this
level (the robot is unaware of the points):
" Engine on the LEFT assembled is worth 15 points
" Engine on the RIGHT assembled is worth 20 points
Game goes here
208
- C.eCe Wth Ila& -
If you can see the character but the level hasn't loaded and it says "Proceed" in the upper left corner for more than 20 seconds
then please close the browser tab and follow the link again (initial connection error).
Engine assembly instructions (refresher)
I
iziliD
I
E
209
210
References
Adams, William, Trafton, JG, Bugajska, M.D., Schultz, A.C., & Kennedy, W.G.
2008. Incorporating Mental Simulation for a More Effective Robotic Teammate.
In: Twenty-third conference on artificial intelligence (AAAI 2008). AAAI Press,
Menlo Park. Storming Media.
Baker, C L, Saxe, R, & Tenenbaum, J B. 2009. Action understanding as inverse
planning. Cognition, 113(3), 329-349.
Baker, C L, Saxe, R R, & Tenenbaum, J B. 2011. Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution. Proceedings of the Thirty-Second Annual
Conference of the Cognitive Science Society.
Barber, D. 2011. Bayesian reasoning and machine learning.
Baron-Cohen, S. 1995. Mindblindness. MIT Press Cambridge, MA.
Barrett, Anthony, & Weld, Daniel S. 1994. Partial-order planning: Evaluating possible efficiency gains. Artificial Intelligence, 67(1), 71-112.
Bartneck, Christoph, Kuli6, Dana, Croft, Elizabeth, & Zoghbi, Susana. 2009. Measurement instruments for the anthropomorphism, animacy, likeability, perceived
intelligence, and perceived safety of robots.
robotics, 1(1), 71-81.
211
International journal of social
Barto, Andrew G, Bradtke, Steven J, & Singh, Satinder P. 1995. Learning to act
using real-time dynamic programming. Artificial Intelligence, 72(1), 81-138.
Bellman, Richard. 1957a. A Markovian decision process. Tech. rept. DTIC Document.
Bellman, Richard E. 1957b.
Dynamic Programming. Princeton, N.J.: Princeton
University Press.
Berlin, Matthew, Gray, Jesse, Thomaz, A L, & Breazeal, C. 2006. Perspective taking:
An organizing principle for learning in human-robot interaction. Page 1444 of:
Proceedings of the National Conference on Artificial Intelligence, vol. 21. Menlo
Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
Berlo, David K, Lemert, James B, & Mertz, Robert J. 1969. Dimensions for evaluating
the acceptability of message sources. Public Opinion Quarterly, 33(4), 563-576.
Blakemore, S J, & Decety, J. 2001. From the perception of action to the understanding
of intention. Nature Reviews Neuroscience, 2(8), 561-567.
Blum, A L. 1995. Fast Planning Through Planning Graph Analysis. Tech. rept. DTIC
Document.
Bonet, Blai, & Geffner, Hector. 2003. Labeled RTDP: Improving the convergence of
real-time dynamic programming. Pages 12-21 of: ICAPS, vol. 3.
Bonet, Blai, & Geffner, H6ctor. 2009. Solving POMDPs: RTDP-Bel vs. Point-based
Algorithms. Pages 1641-1646 of: IJCAI.
Boutilier, Craig, Friedman, Nir, Goldszmidt, Moises, & Koller, Daphne. 1996.
Context-Specific Independence in Bayesian Networks.
Boutilier, Craig, Dearden, Richard, & Goldszmidt, Mois6s. 2000. Stochastic dynamic
programming with factored representations. Artificial Intelligence, 121(1aA$2),
49-107.
212
Breazeal, C, Gray, Jesse, & Berlin, Matthew. 2009. An embodied cognition approach
to mindreading skills for socially intelligent robots.
International Journal of
Robotics Research, 28(5), 656-680.
Brown, James Dean. 2008. Effect size and eta squared. JALT Testing & Evaluation
SIG Newsletter, 12(April), 38-43.
Butterfield, J, Jenkins, 0 C, Sobel, D M, & Schwertfeger, J. 2009. Modeling aspects
of Theory of Mind with Markov random fields. International Journal of Social
Robotics, 1(1), 41-51.
Castelfranchi, Cristiano. 1998. Modelling social action for Al agents. Artificial Intelligence, 103(1-2), 157-182.
Christian Keysers, Valeria Gazzola. 2008. Unifying Social Cognition.
Chap. 1 of:
Pineda, J.A. (ed), Mirror Neuron Systems: the Role of Mirroring Processes in
Social Cognition. Springer.
Csibra, G, & Gergely, G. 2007. 'Obsessed with goals': Functions and mechanisms
of teleological interpretation of actions in humans. Acta Psychologica, 124(1),
60-78.
Cumming, Geoff. 2013. The New Statistics: Why and How. Psychological Science.
Dearden, Richard, & Boutilier, Craig. 1997. Abstraction and approximate decisiontheoretic planning. Artificial Intelligence, 89(1), 219-283.
Dunbar, Robin. 2005. Why God won't go away.
Durfee, E H. 1999. Practically coordinating. AI Magazine, 20(1), 99.
Frith, Uta. 1989. Autism: Explaining the enigma.
Gallese, V., & Goldman, A. 1998. Mirror neurons and the simulation theory of mindreading. Trends in cognitive sciences, 2(12), 493-501.
213
Geffner, H6ctor, & Bonet, Blai. 1998.
Solving Large POMDPs using Real Time
Dynamic Programming. In: In Proc. AAAI Fall Symp. on POMDPs.
Gerstenberg, Tobias, & Goodman, Noah D. 2011. Ping Pong in Church : Productive
use of concepts in human probabilistic inference. 1, 1590-1595.
Ghallab, M, Aeronautiques, C, Isi, C K, Wilkins, D, & Others. 1998. PDDL-the
planning domain definition language.
Ghallab, M, Nau, D S, & Traverso, P. 2004. Automated Planning: theory and practice.
Morgan Kaufmann Publishers.
Goldman, A.I. 2006. Conceptualizing Simulation Theory.
Chap. 2 of: Simulating
minds: The philosophy, psychology, and neuroscience of mindreading. Oxford
University Press, USA.
Gopnik, Alison, & Wellman, Henry M. 1992. Why the child's theory of mind really
is a theory. Mind & Language, 7(1-2), 145-171.
Gray, Jesse, & Breazeal, Cynthia. 2012. Manipulating Mental States Through Physical Action. Pages 1-14 of: Social Robotics. Springer.
Guo, H, & Hsu, W. 2002. A survey of algorithms for real-time Bayesian network infer-
ence. In: AAAI/KDD/UAI02 Joint Workshop on Real-Time Decision Support
and Diagnosis Systems. Edmonton, Canada.
Hart, Sandra G, & Staveland, Lowell E. 1988. Development of NASA-TLX (Task Load
Index): Results of empirical and theoretical research.
Advances in psychology,
52, 139-183.
Hauskrecht, Milos. 2000.
Value-function approximations for partially observable
Markov decision processes. J. Artif. Int. Res., 13(1), 33-94.
214
Helmert, M. 2006. The fast downward planning system. Journal of Artificial Intelligence Research, 26(1), 191-246.
Hochberg, Yosef. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4), 800-802.
Hoffman, G, & Breazeal, C. 2007. Cost-based anticipatory action selection for humanrobot fluency. Robotics, IEEE Transactions on, 23(5), 952-961.
Hoffman, Guy. 2014. Measuring Fluency in Human-Robot Collaboration : Objective
and Subjective Metrics. In: IROS.
Hoffmann, J, & Nebel, Bernhard. 2011. The FF planning system: Fast plan generation
through heuristic search. arXiv preprint arXiv:1106.0675, 14, 253-302.
Howard, Ronald A. 1960. Dynamic Programming and Markov Processes.
Hutchins, Tiffany L, Prelock, Patricia A, & Bonazinga, Laura. 2012. Psychometric
evaluation of the theory of mind inventory (ToMI): A study of typically developing children and children with autism spectrum disorder. Journal of autism and
developmental disorders, 42(3), 327-341.
Ito, J Y, Pynadath, D V, & Marsella, S C. 2007. A decision-theoretic approach to
evaluating posterior probabilities of mental models. In: AAAI-07 workshop on
plan, activity, and intent recognition.
Jara-ettinger, Julian, Baker, Chris L, & Tenenbaum, Joshua B. 2012. Learning What
is Where from Social Observations.
Kaelbling, Leslie Pack, Littman, Michael L, & Cassandra, Anthony R. 1998. Planning
and acting in partially observable stochastic domains.
101(1-2), 99-134.
215
Artificial Intelligence,
Kidd, C D, & Breazeal, C. 2004. Effect of a robot on user perceptions. In: IEEE/RSJ
InternationalConference on Intelligent Robots and Systems, 2004. (IROS 2004).
Proceedings, vol. 4.
Korb, K B, & Nicholson, A E. 2004. Bayesian artificialintelligence. cRc Press.
Kurniawati, Hanna, Hsu, David, & Lee, Wee Sun. 2008. SARSOP: Efficient PointBased POMDP Planning by Approximating Optimally Reachable Belief Spaces.
Pages 65-72 of: Robotics: Science and Systems.
Lee, Kwan Min, Peng, Wei, Jin, Seung-A, & Yan, Chang. 2006. Can robots manifest
personality?: An empirical test of personality recognition, social responses, and
social presence in human-robot interaction. Journal of communication, 56(4),
754-772.
Leslie, A M. 1994. ToMM, ToBy, and Agency: Core architecture and domain specificity. Mapping the mind: Domain specificity in cognition and culture, 119-148.
Littman, Michael L, Cassandra, Anthony R, & Kaelbling, Leslie Pack. 1995. Learning
policies for partially observable environments: Scaling up. Pages 362-370 of:
ICML, vol. 95. Citeseer.
Macindoe, 0, Kaelbling, L P, & Lozano-P6rez, T. 2012.
POMCoP: Belief Space
Planning for Sidekicks in Cooperative Games. In: Eighth Artificial Intelligence
and Interactive Digital Entertainment Conference.
McMahan, H Brendan, Likhachev, Maxim, & Gordon, Geoffrey J. 2005. Bounded
real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. Pages 569-576 of: Proceedings of the 22nd international
conference on Machine learning. ACM.
Nau, D S, Au, T C, Ilghami, 0, Kuter, U, Murdock, J W, Wu, D, & Yaman, F. 2003.
SHOP2: An HTN planning system. J. Artif. Intell. Res. (JAIR), 20, 379-404.
216
Nichols, S, & Stich, S.P. 2003. Pieces of Mind: A Theory of Third-Person Mindreading. Chap. 3 of: Nichols, S. And Stich, S.P. (ed), Mindreading: An integrated
account of pretence, self-awareness, and understandingother minds. Oxford University Press, USA.
Nikolaidis, Stefanos, & Shah, Julie. 2013. Human-Robot Cross-Training: Computational Formulation, Modeling and Evaluation of a Human Team Training Strategy. Pages 33-40 of: Proceedings of the 8th ACM/IEEE internationalconference
on Human-robot interaction. IEEE Press.
Onishi, K H, & Baillargeon, R. 2005.
Do 15-month-old infants understand false
beliefs? Science, 308(5719), 255.
Perner, J, Frith, U, Leslie, A M, & Leekam, S R. 1989. Exploration of the autistic
child's theory of mind: Knowledge, belief, and communication.
Child develop-
ment, 689-700.
Pineau, Joelle, Gordon, Geoff, Thrun, Sebastian, & Others. 2003. Point-based value
iteration: An anytime algorithm for POMDPs. Pages 1025-1032 of: IJCAI, vol.
3.
Poole, D. 1997. Probabilistic partial evaluation: Exploiting rule structure in probabilistic inference. Pages 1284-1291 of: INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 15. LAWRENCE ERLBAUM
ASSOCIATES LTD.
Pynadath, D V, & Marsella, S C. 2004. Fitting and compilation of multiagent models
through piecewise linear functions. Pages 1197-1204 of: Proceedings of the Third
InternationalJoint Conference on Autonomous Agents and Multiagent SystemsVolume 3. IEEE Computer Society.
Ross, St6phane, Pineau, Joelle, Paquet, S6bastien, & Chaib-Draa, Brahim. 2008.
217
Online Planning Algorithms for POMDPs. J. Artif. Intell. Res. (JAIR), 32, 663704.
Russell, S, & Norvig, P. 2003.
Artificial Intelligence: A Modern Approach - 2nd
Edition.
Sanner, Scott, Goetschalckx, Robby, Driessens, Kurt, & Shani, Guy. 2009. Bayesian
real-time dynamic programming. In: Proc. of IJCAI, vol. 9.
Saxe, R. 2005. Against simulation: the argument from error. Trends in Cognitive
Sciences, 9(4), 174-179.
Scassellati, Brian. 2002. Theory of Mind for a Humanoid Robot. Autonomous Robots,
12(1), 13-24-24.
Schiffer, Stephen. 2012. Propositions, What Are They Good For ? de Gruyter.
Sebanz, Natalie, Bekkering, Harold, & Knoblich, Gunther. 2006. Joint action: bodies
and minds moving together. Trends in Cognitive Sciences, 10(2), 70-76.
Silver, David, & Veness, Joel. 2010. Monte-Carlo planning in large POMDPs. Pages
2164-2172 of: Advances in Neural Information Processing Systems.
Singer, T, Seymour, B, O'Doherty, J, Kaube, H, Dolan, R J, & Frith, C D. 2004.
Empathy for pain involves the affective but not sensory components of pain.
Science, 303(5661), 1157.
Singer, T, Seymour, B, O'Doherty, J P, Stephan, K E, Dolan, R J, & Frith, C D.
2006. Empathic neural responses are modulated by the perceived fairness of
others. Nature, 439(7075), 466-469.
Smith, Trey, & Simmons, Reid. 2004. Heuristic search value iteration for POMDPs.
Pages 520-527 of: Proceedings of the 20th conference on Uncertainty in artificial
intelligence. AUAI Press.
218
Smith, Trey, & Simmons, Reid. 2006. Focused real-time dynamic programming for
MDPs: Squeezing more out of a heuristic. Page 1227 of: Proceedings of the National Conference on Artificial Intelligence, vol. 21. Menlo Park, CA; Cambridge,
MA; London; AAAI Press; MIT Press; 1999.
Sondberg-Jeppesen, N, & Jensen, F V. 2010. A PGM framework for recursive modeling of players in simple sequential Bayesian games. International Journal of
Approximate Reasoning, 51(5), 587-599.
Sondik, Edward Jay. 1971. The optimal control of partially observable Markov processes. Tech. rept. DTIC Document.
Steinfeld, A, Fong, T, Kaber, D, Lewis, M, Scholtz, J, Schultz, A, & Goodrich, M.
2006. Common metrics for human-robot interaction. Pages 33-40 of: Proceedings
of the 1st ACM SIGCHI SIGART conference on Human-robot interaction.ACM.
Tauber, S, & Steyvers, M. 2011. Using Inverse Planning and Theory of Mind for
Social Goal Inference. Proceedings of the Thirtieth Third Annual Conference of
the Cognitive Science Society.
Trafton, J.G., Cassimatis, N.L., Bugajska, M.D., Brock, D.P., Mintz, F.E., & Schultz,
A.C. 2005. Enabling effective human-robot interaction using perspective-taking
in robots. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE
Transactions on, 35(4), 460-470.
Ullman, T D, Baker, C L, Macindoe, 0, Evans, 0, Goodman, N D, & Tenenbaum,
J B. 2010. Help or hinder: Bayesian models of social goal inference. Advances in
Neural Information Processing Systems (NIPS), 22.
Vidal, J M, & Durfee, E H. 1995. Recursive agent modeling using limited rationality.
Pages 376-383 of: Proceedings of the First International Conference on Multi-
Agent Systems (ICMAS-95).
219
Wageman, Ruth, Hackman, J Richard, & Lehman, Erin. 2005. Team Diagnostic Survey Development of an Instrument. The Journal of Applied Behavioral Science,
41(4), 373-398.
Wellman, Henry M., Cross, David, & Watson, Julanne. 2001.
Meta-Analysis of
Theory-of-Mind Development: The Truth about False Belief.
Child Develop-
ment, 72(3), 655-684.
Wikipedia. 2014. Philosophy of mind -
Wikipedia, The Free Encyclopedia.
Wimmer Josef, H. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children's understanding of deception. Cognition,
13(1), 103-128.
Woodward, A L. 2009. Infants' grasp of others' intentions. Current Directions in
Psychological Science, 18(1), 53.
Zettlemoyer, L S, Milch, B, & Kaelbling, L P. 2009. Multi-agent filtering with infinitely nested beliefs.
220
Download