Mind-Theoretic Planning for Social Robots by Sigur6ur Orn A6algeirsson MSc. Media Arts and Sciences, MIT (2009) BSc. Electrical and Computer Engineering, University of Iceland (2007) Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Media Arts and Sciences MASSACHUSETsS Irr OF TECHNOLOGY at the JUL 1 4 2O14 MASSACHUSETTS INSTITUTE OF TECHNOLOGY LIBRARIES June 2014 @ Massachusetts Institute of Technology 2014. All rights reserved. Author Si gnature redacted_ __ Program in Media Arts and Sciences May 2, 2014 Certified by Signature redacted Dr. Cynthia Breazeal Associate Professor of Media Arts and Sciences Program in Media Arts and Sciences Thesis Supervisor Accepted by Signature redacted rPattie Maes Associate Academic Head Program in Media Arts and Sciences E 2 Mind-Theoretic Planning for Social Robots by Sigurbur Orn Algeirsson Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, on May 2, 2014, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Media Arts and Sciences Abstract As robots move out of factory floors and into human environments, out from safe barricaded workstations to operating in close proximity with people, they will increasingly be expected to understand and coordinate with basic aspects of human behavior. If they are to become useful and productive participants in human-robot teams, they will require effective methods of modeling their human counterparts in order to better coordinate and cooperate with them. Theory of Mind (ToM) is defined as people's ability to reason about others' behavior in terms of their internal states, such as beliefs and desires. Having a ToM allows an individual to understand the observed behavior of others, based not only on directly observable perceptual features but also an understanding of underlying mental states; this understanding allows the individual to anticipate and better react to future actions. In this thesis a Mind-Theoretic Planning (MTP) system is presented which attempts to provide robots with some of the basic ToM abilities that people rely on for coordinating and interacting with others. The MTP system frames the problem of mind-theoretic reasoning as a planning problem with mixed observability. A predictive forward model of others' behavior is computed by creating a set of mental state situations (MSS), each composed of stacks of Markov Decision Process (MDP) models whose solutions provide approximations of anticipated rational actions and reactions of that agent. This forward model, in addition to a perceptual-range limiting observation function, is combined into a Partially Observable MDP (POMDP). The presented MTP approach increases computational efficiency by taking advantage of approximation methods offered by a novel POMDP solver B3 RTDP as well as leveraging value functions at various levels of the MSS as heuristics for value functions at higher levels. For the purpose of creating an efficient MTP system, a novel general-purpose online POMDP solver B3 RTDP was developed. This planner extends the Real- Time Dynamic Programming (RTDP) approach to solving POMDPs. By using a bounded 3 value function representation, we are able to apply a novel approach to pruning the belief-action search graph and maintain a Convergence Frontier,a novel mechanism for taking advantage of early action convergence, which can greatly improve RTDP's search time. Lastly, an online video game was developed for the purpose of evaluating the MTP system by having people complete tasks in a virtual environment with a simulated robotic assistant. A human subject study was performed to assess both the objective behavioral differences in performance of the human-robot teams, as well as the subjective attitudinal differences in how people perceived agents with varying MTP capabilities. We demonstrate that providing agents with mind-theoretic capabilities can significantly improve the efficiency of human-robot teamwork in certain domains and suggest that it may also positively influence humans' subjective perception of their robotic teammates. Thesis Supervisor: Dr. Cynthia Breazeal Title: Associate Professor of Media Arts and Sciences, Program in Media Arts and Sciences 4 Mind-Theoretic Planning for Social Robots by Sigurbur Orn Aalgeirsson The following people served as readers for this thesis: Signature redacted Thesis Reader___ Dr. Julie Shah Assistant Professor of Aeronautics and Astronautics Massachusetts Institute of Technology 6-7 TIesis Reader Signature redacted Dr. Leila Takayama Senior User Experience Researcher Google[x] r Acknowledgments I am so grateful for all the help and support I have received from friends, family, and colleagues during my time here at MIT. First of all I would like to thank my advisor, professor Cynthia Breazeal, for taking a chance on my all of those years ago and admitting me into her awesome group. Since then, she has always supported me in all of the things I have wanted to explore and learn about as well as provided me with very insightful advise and guidance in my research. She has trusted me to have a great level of autonomy as a researcher while simultaneously making herself available to discuss ideas and share wisdom when requested. One couldn't ask for a better research group to be a part of, where people are as concerned with the success of their fellow grad students as they are with their own. I want to thank all of my predecessors in the Personal Robots Group for creating the legacy of the group and leaving a wonderfully collaborative and helpful group culture which I have done my best to pass on. The great group atmosphere is in no little part due to our administrator Polly Guggenheim, she is the beating heart of this group and a surrogate mother to us all (watch out for her affectionate/bone-crushing jabs). Of the students that were present for my "formative" years as a grad student, I particularly want to acknowledge my good friends Matthew Berlin and Jesse Gray for their friendship and general helpfulness with everything. Any ability I have to problem-solve and think abstractly about programming was acquired in constant attempts to keep up with those guys. My friend Philipp Robbel has been equally important in helping me through discussions about my research and comparing different approaches to solving problems as he was in helping me forget all of that and simply have fun and enjoy life. I have had countless discussions with my office mate Nick DePalma about both of our research, which have often been very productive and helpful in airing out and developing ideas. Jin Joo Lee has also listened to me talk about my research more 7 times than I care to count, and not only did she listen but actually diligently proofread my proposal as well as this thesis and provided very insightful and valuable feedback and assistance. I am particularly thankful for her help with designing the human subject study for this thesis. A benefit of working in a great research group at MIT is that it attracts incredibly intelligent and capable post-doctoral researchers to work with us. I have learned a lot by getting to work with both Sonia Chernova and Brad Knox. I am inspired by Brad's style of continuous learning and self-improvement. I feel that he applies his incredibly critical and rigorous academic thinking equally to his research as well as his personal life. I aspire to attain his wisdom and critical thought and can only hope that I will handle it with the same level of casual humility and humor that he has. I would like to thank my general examination committee as well as my thesis committee. Professors Rosalind Picard and Rebecca Saxe spent a significant amount of their limited time to help me understand the relevant literature and hone in on my final dissertation topic, while serving on my general examination committee. My research was influenced by Leila Takayama's work well before I asked her to serve on my thesis committee. I have attended several workshops she has organized and read her papers, in fact the evaluation task of my master's thesis was adapted from the one she used in her dissertation work. Conversations with her were absolutely invaluable to the development of my thesis and in particular the evaluation part of the work. No less than her actual contribution to the thesis work, has her personal support throughout the process been just incredible. About six years ago when I took professor Brian William's class on Cognitive Robotics, one of his grad students, Julie Shah, gave a great guest lecture in the class. That was the first time I was thoroughly impressed with her work and devotion to creating systems to support human-robot teamwork, but far from the last. I was incredibly excited when she agreed to be on my committee, and her guidance and support with the more computational parts of my thesis was invaluable. 8 I thank my Icelandic friends H6ssi and Siggi P6tur (my namesake) for helping me to take my mind off my research with various shenanigans and for helping me not forget the mother tongue. My dear girlfriend and love Nancy deserves much more praise than I could write down here, both for her direct involvement in both my M.S. and PhD dissertation work, and for her endless patience and support, especially in this last year. Lastly, I am so grateful for my wonderful family for all of their love and support throughout the years. Mamma, Villi pabbi and Alli pabbi have spent my whole lifetime preparing me, motivating and inspiring me to do whatever I want, and I am so very thankful for all that they have given me. The same goes for my grandparents and all of my siblings, it is a rare privilege to have such a wonderful family to lean on and learn from. Eg elska ykkur 611 ! My research has been funded by Media Lab Consortia, and both MURI8 and MURI6 grants from the Office of Naval Research. 9 10 Contents Abstract 1 2 3 Introduction 29 1.1 M otivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.2 A Mind-Theoretic Robot Planning System . . . . . . . . . . . . . . . 32 1.2.1 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.4 Overview of This Document . . . . . . . . . . . . . . . . . . . . . . . 35 Autonomous Planning 37 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2.1 Classical Planning . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2.2 Decision Theoretic Planning . . . . . . . . . . . . . . . . . . . 42 2.3 POMDP Planning Algorithms . . . . . . . . . . . . . . . . . . . . . . 44 2.4 Real-Time Dynamic Programming . . . . . . . . . . . . . . . . . . . . 46 2.4.1 RT D P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.2 Extensions to RTDP . . . . . . . . . . . . . . . . . . . . . . . 47 Belief Branch and Bound Real-Time Dynamic Programming . . . . . 49 2.5.1 RTDP-Bel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.2 Bounded Belief Value Function . . . . . . . . . . . . . . . . . 52 2.5 11 2.6 2.7 3 2.5.3 Calculating Action Selection Convergence . . . . . . . . . . . 55 2.5.4 Convergence Frontier . . . . . . . . . . . . . . . . . . . . . . . 60 2.5.5 Belief Branch and Bound RTDP . . . . . . . . . . . . . . . . . 63 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.6.1 Rocksample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.6.2 T ag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 74 2.7.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . 74 2.7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Mind Theoretic Reasoning 77 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.1.1 Mind-Theoretic Planning . . . . . . . . . . . . . . . . . . . . . 79 3.1.2 Overview of Chapter . . . . . . . . . . . . . . . . . . . . . . . 80 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.2.1 Theory of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.2.2 Internal Representation of ToM . . . . . . . . . . . . . . . . . 83 3.2.3 False Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.2.4 Mental State . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.2.5 Knowledge Representation . . . . . . . . . . . . . . . . . . . . 86 3.2.6 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . 87 Overview of Related Research . . . . . . . . . . . . . . . . . . . . . . 90 3.3.1 ToM for Humanoid Robots . . . . . . . . . . . . . . . . . . . . 90 3.3.2 Polyscheme and ACT-R/E . . . . . . . . . . . . . . . . . . . . 91 3.3.3 ToM Modeling Using Markov Random Fields . . . . . . . . . 93 3.3.4 Plan Recognition in Belief-Space . . . . . . . . . . . . . . . . 94 3.3.5 Inferring Beliefs using Bayesian Plan Inversion . . . . . . . . . 95 3.3.6 Game-Theoretic Recursive Reasoning . . . . . . . . . . . . . . 98 3.3.7 Fluency and Shared Mental Models . . . . . . . . . . . . . . . 100 3.2 3.3 12 3.4 3.5 4 3.3.8 Perspective Taking and Planning with Beliefs 102 3.3.9 Belief Space Planning for Sidekicks . . 104 Earlie r Approaches to Problem . . . . . . . . 107 3.4.1 Deterministic Mind Theoretic Planning 107 3.4.2 The Belief Action Graph . . . . . . . . 109 Mind Theoretic Planning . . . . . . . . . . . . 112 3.5.1 Definitions of Base States and Actions 113 3.5.2 Types of Mental States . . . . . . . . . 114 3.5.3 Inputs to the Mind Theoretic Planner . 115 3.5.4 Mental State as Enumeration of Goals and False Beliefs 117 3.5.5 Action Prediction . . . . . . . . . . . . 118 3.5.6 POMDP Problem Formulation . . . . . 123 3.5.7 Putting it All Together . . . . . . . . . 131 3.5.8 Demonstrative Examples . . . . . . . . 134 Evaluation of Mind Theoretic Reasoning 141 4.1 142 4.2 4.3 4.4 Sim ulator . . . . . . . . . . . . . . . . . . . 4.1.1 Different Simulators . . . . . . . . . 142 4.1.2 On-line Video Game for User Studies 145 Human Subject Study . . . . . . . . . . . . 149 4.2.1 Hypotheses . . . . . . . . . . . . . . 150 4.2.2 Experimental Design . . . . . . . . . 151 4.2.3 M etrics . . . . . . . . . . . . . . . . 155 4.2.4 Exclusion Criteria . . . . . . . . . . . 157 Study Results . . . . . . . . . . . . . . . . . 159 4.3.1 Statistical Analysis Methods . . . . . 159 4.3.2 Behavioral Data . . . . . . . . . . . . 160 4.3.3 Attitudinal Data . . . . . . . . . . . 175 D iscussion . . . . . . . . . . . . . . . . . . . 176 13 4.4.1 Task Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.4.2 Team Fluency . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.4.3 Attitudinal Data . . . . . . . . . . . . . . . . . . . . . . . . 178 4.4.4 Informal Indications from Open-Ended Responses . . . . . . 178 4.4.5 Funny Comments . . . . . . . . . . . . . . . . . . . . . . . . 182 5 Conclusion 183 5.1 Thesis Contributions 5.2 Recommended Future Work 184 . . . . . . . . . . . . . . . . . . . . . 185 . . . . . . . . . . . . . . . . . 5.2.1 Planning with personal preferences . . . . . . . . . 185 5.2.2 Planning for more agents . . . . . . . . . . . . . . . 186 5.2.3 State-space abstractions and factoring domains into 5.2.4 MTP and N on-M T P . . . . . . . . . . . . . . . . . . . . . . . 187 Follow-up user study . . . . . . . . . . . . . . . . . 187 Appendices 189 A Existing Planning Algorithms 189 A.1 Various Planning Algorithms . . . . . . . . . . . . . . . . . . . . . . B Study Material 190 195 B .1 Q uestionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 B .2 Study Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 211 References 14 List of Figures 2-1 (a) Demonstrates how the state tree can be traversed by selecting actions and transition links to successor states according to the transition function T(s, a, s'). (b) Shows how traversing the belief tree is similar to traversing the state tree except that when an action is taken in a belief b we use equation 2.4 to determine the "belief transition probability" to the successor beliefs, through observation probabilites, which can be calculated with equation 2.3. . . . . . . . . . . . . . . . . . . . 2-2 50 Demonstrates how the transition function for the discounted Tiger POMDP is transformed into a Goal POMDP. From: (Bonet & Geffner, 2009)........... 2-3 ............ 2-4 .................................... 52 ....................................... Shows the Q boundaries 56 for two example actions. The value of the true Q*(a) is uniformly distributed between the bounds for both actions. . 2-5 In addition to the Q*(a') Q 59 distributions, the probability function Pr(q < g) is plotted. This function always evaluates to the probability mass of the Q(a') function that exists between q and QH(a') which for uniform distributions is a piecewise linear function of a particular shape. 59 2-6 Finally this figure shows the function whose integral is our quantity of interest Pr(Q*(a) < Q*(a')). This integral will always simply be the sum of rectangle and triangle areas for two uniform 15 Q distributions. . 60 2-7 Demonstrates how action choice can converge over a belief, creating effectively a frontier of reachable successive beliefs with associated probabilities. This effect can be taken advantage of to shorten planning. . 2-8 61 Shows the ADR of B 3 RTDP in the RockSample_ 7 8 domain. Algorithm was run with D = 10 and a = 0.75 and ADR is plotted with error bars showing 95% confidense intervals calculated from 50 runs. . 2-9 70 Shows the ADR of B3 RTDP in the Tag domain as a function of the action pruning parameter a and discretization D. ADR is plotted with error bars showing a 95% confidence intervals calculated from 20 runs of the algorithm s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2-10 Shows the convergence time of B3 RTDP in the Tag domain as a function of the action pruning parameter a and discretization D. ADR is plotted with error bars showing a 95% confidence interval calculated from 20 runs of the algorithms. We can see that the convergence time of B3 RTDP increases both with higher discretization as well as a higher requirement of action convergence before pruning. This is an intuitive result as the algorithm also garnishes more ADR from the domain in those scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2-11 Shows the ADR of B3 RTDP in the Tag domain. The algorithm was run with D = 15 and a = 0.65 and ADR is plotted with error bars showing 95% confidence intervals calculated from 50 runs. We can see that B 3RTDP converges at around 370 ms, at that time SARSOP is far away from convergence but has started to produce very good ADR values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 Demonstrates the recursive nature of ToM. Adapted from (Dunbar, 200 5). 3-2 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A classic false belief situation involving the characters Sally and Anne (Image courtesy of (Frith, 1989)). . . . . . . . . . . . . . . . . . . . . 16 85 3-3 Shows the inter-connectivity of the MRF network. Each agent is represented with an observation vector y, and a state vector xi. From (Butterfield et al., 2009). . . . . . . . . . . . . . . . . . . . . . . . . . 3-4 94 An example visual stimuli that participants would be shown in (Baker et al., 2009). An agent travels along the dotted line and pauses at the points marked with a (+) sign. At that point participants are asked to rate how likely the agent is to have each marked goal (A, B and C). In (b) the participants were asked to retroactively guess the goal of the agent at a previous timepoint. . . . . . . . . . . . . . . . . . . . . . . 3-5 96 An example of an expanded tree structure of pay-off matrices as perceived by P when choosing whether to execute action A or B. With probability p, P thinks that Q views the payoff matrix in a certain way (represented by a different matrix) and with probability (1 - pi) in another. This recursion continues until no more knowledge exists, in which case a real value is attributed to each action (0.5 in the uninformed case) (Durfee, 1999). . . . . . . . . . . . . . . . . . . . . . . . 3-6 99 A view of the ABB RoboStudio Virtual Environment during task execution. The human controls the white robot (Nikolaidis & Shah, 2013).101 3-7 Shows the trajectory by which data flows and decisions get made within the "Self-as-Simulator" architecture. Demonstrates how the robot's own behavior generation mechanisms are used to reason about observed behavior of others as well as performing perspective taking. (Breazeal et al., 2009). From . . . . . . . . . . . . . . . . . . . . . . . . . . 17 103 3-8 A robot hypothesizes about how the mental state of a human observer would get updated if it would proceed to take a certain sequence of motor actions. This becomes a search through the space of motor actions that the robot could take which gets terminated when a sequence is found that achieves the robot's goals as well as the mental state goals about other agents (Gray & Breazeal, 2012). . . . . . . . . . . . . . . 3-9 105 (a) Shows the game with a simulated human player in the upper right corner and a POMCoP sidekick in the lower left. (b) Shows comparative results of steps taken to achieve the goal between a QMDP planner and different configurations of POMCoP . . . . . . . . . . . . . . . . 106 3-10 (a) An demonstrative example of a simple BAG. (b) An example of a BAG instantiated for an actual navigation problem. . . . . . . . . . . 111 3-11 Shows how a goal situation is composed of stacks of predictive MDP models for each agent. Each model contains a value function, a transition function and a resulting policy. Each transition function takes into account predictions from lower level policies for the actions of the other agent. Value functions are initialized by heuristics that are extracted from the optimal state values from the level below, this speeds up planning significantly. Since every level of the stack depends on lower levels, special care needs to be taken for the lowest level. In the MTP system, we have chosen to solve a joint centralized planning problem as if one central entity was controlling both agents to optimally achieve both of their goals, since this is a good and optimistic approximation of perfect collaborative behavior. . . . . . . . . . . . . 18 121 3-12 This figure shows an example MTP action prediction system with two goal hypotheses and a single false belief hypothesis (in addition to the true belief), resulting in four distinct mental situations. An action prediction from any level can be queried by following the enumerated steps in section 3.5.5 under "Mental State Situation". . . . . . . . . . 124 3-13 Shows two states that would produce the same observation because they are undistinquishable within the perceptual field of the robot (which is limited by field of view and line of sight in this domain). If the other agent would move slightly into the white space in the state on the right, then the observation function would produce a different observation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3-14 Shows an example scenario where the robot knows that the human's goal is to exit the room, and the robot also knows the location of the exit. The robot is uncertain about whether the human knows where the location of the exit and therefore creates two false belief hypotheses, one representing the true state and another representing a false alternative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3-15 Shows the stage where the human is one action away from learning what the true state of the world is . . . . . . . . . . . . . . . . . . . . 131 3-16 When the human agent has turned left, it will expect to see either the exit or the wall depending on its mental state. In the false belief state where it expects to see the exit, a special observation of is also expected since in this state the agent should be able to perceive the error of its false belief. Since this observation will actually never be emitted by the MTP system, the belief update will attribute zero probability to any state in the subsequent POMDP belief where that observation was expected. ......... ................................. 19 132 3-17 Shows the complete MTP system on an example problem with two goal hypotheses and one false belief hypothesis. On top sits a POMDP with an observation function that produces perceptually limited observations with the addition of specialized false belief observations when appropriate. The POMDP transition function is deterministic in the action effects of the robot but uses the lower level mental state situations to predict which actions the other agent is likely to take and models the effects of those stochastically. The figure also shows how the value functions at lower levels serve as initialization heuristics for higher-level value functions. The value function of the highest level of the robot's predictive stack is used as an initialization to the QMDP heuristic for the POMDP value function. . . . . . . . . . . . . . . . . 3-18 Shows the configuration of the environment of this example. 133 Gray areas represent obstacles. . . . . . . . . . . . . . . . . . . . . . . . . . 135 3-19 Simulation at t = 0. The robot can perceive the human but is initially uncertain of their mental state. . . . . . . . . . . . . . . . . . . . . . 135 3-20 Simulation at t = 11. The robot has moved out of the human's way but did not see if they moved east or west. Robot maintains both hypotheses with slightly higher probability of the false belief since the human did not immediately turn east at t = 1. . . . . . . . . . . . . . 136 3-21 Simulation at t = 20. The robot now expects that if the human originally held the false belief that they would have perceived its error by now and is confident that they currently hold the true belief. The robot expects that if the human originally held the false belief then it should pass by the robot's visual field in the next few time steps. Notice how the robot has been waiting for that perception to happen or not happen (indicating that the human held the true belief the whole time) before it proceeds to move to its goal. 20 . . . . . . . . . . . . . . 136 3-22 Simulation at t = 28. Finally once the robot has seen the human pass by it proceeds to follow it and subsequently both agents accomplish their goals. Even if the robot had not seen the human pass by, eventually once it would become sure enough, it would proceed in exactly the same manner thinking that the human had originally held the true b elief. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 7 3-23 (a) and (c) refer to the simulation at t = 0, (b) and (d) refer to the simulation at t = 1. We can see that initially the robot is completely uncertain about the mental state of the human but after seeing that the human took no action, it assumes that goals 5 and 6 are most likely (the ones that the robot is currently blocking access to). . . . . . . . 138 3-24 (a) and (c) refer to the simulation at t = 13, (b) and (d) refer to the simulation at t = 18. Once the robot has retreated to allow the human to pursue the two goal hypotheses that are most likely, it chooses to make one goal accessible. If the human does not pursue that goal given the opportunity, the robot assumes that the other one is more likely and creates passageway for it to pursue. . . . . . . . . . . . . . . . . 139 3-25 (a) and (c) refer to the simulation at t = 28, (b) and (d) refer to the simulation at t = 36. If the human moves away while the robot cannot perceive it, the robot uses its different goal hypotheses to predict the most likely location of the human. The robot then proceeds to find human in the most likely locations. In this case, its first guess was correct and by using our Mind-Theoretic Reasoning techniques, it was able to find the human immediately. 4-1 . . . . . . . . . . . . . . . . . . 140 A snapshot from the USARSim simulator which was used to simulate urban search and rescue problems. On the right, we can see a probabilistic Carmen occupancy map created by using only simulated laser scans and simulated odometry from the robot. . . . . . . . . . . . . . 21 142 4-2 Screenshots from the (a) Java Monkey EngineT M and (b) Unity3D simulators that were developed to evaluate our robot systems. 4-3 . . . . 143 Shows the video game from the perspective of the human user. The character can navigate to adjacent grids, interact with objects on tables, and push boxes around. The view of the world is always limited to what the character can currently perceive, so the user needs to rotate the character and move it around to perceive more of the world. 4-4 . . . 144 These are the objects in the world that can be picked up and applied to the engine base. To fully assemble an engine, two engine blocks and an air filter need to be placed on the engine base. After placing each item, a tool needs to be applied before the next item can be placed. The type of tool needed is visualized with a small hovering tool-tip over the engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5 146 This figure demonstrates the sequence of items and tools that need to be brought to and applied to an engine base to successfully assemble it. 147 22 4-6 This figure demonstrates the connectedness of different components to create the on-line study environment used in the evaluation. A user signs up for the study either by going to the sign-up website (possibly because they received a recruitment email or saw an advertisement) or because they are an Amazon MTurk user and accepted to play. The web server assigns the user to a study condition and finds a game server that is not currently busy and assigns it to the user. The game server initializes the Unity game and puppeteers the robot character according to the condition of the study assigned to the user. The game state, character actions, and environment are synchronized between the game server and the user's browser using a free cloud multiplayer service called PhotonT^. Study data is comprised of both the behav- ioral data in the game logs as well as the post-game questionnaire data provided by the Survey Monkey service. 4-7 . . . . . . . . . . . . . . . . 148 Shows the mean task completion times of all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8 160 Shows the mean number of actions taken by both agents over all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) . . . . . . . . . . . . . . . . . . . . . . . . 4-9 163 Mean action intervals of participants across all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4-10 Shows the mean rates of change in action intervals averaged over all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) 23 . . . . . . . . . . . . . . . . . . . 168 4-11 Shows the mean participant functional delay ratios across rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) . . . . . . . . . . . . . . . . . . . . . . . . . 171 4-12 Shows the mean rates of change in participant functional delays aver- aged over all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) 24 . . . . . . . . . . . . . 174 List of Tables 2.1 Results from RockSample_ 7_ 8 for SARSOP and B3 RTDP in various different configurations. We can see that B3 RTDP confidently outperforms SARSOP both in reward obtained from domain and convergence time (* means that the algorithm had not converged but was stopped to evaluate policy). The ADR value is provided with a 95% confidence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 70 Results from Tag for SARSOP and B3 RTDP in various different configurations (* means that the algorithm had not converged but was stopped to evaluate policy). ADR stands for Adjusted Discounted Reward and is displayed with 95% confidence bounds . . . . . . . . . . . 4.1 Task 1 task completion time. ANOVA p value and effect sizes q 2 for all pairwise comparisons of conditions. 4.2 71 . . . . . . . . . . . . . . . . . 161 Task 2 task completion time. ANOVA p value and effect sizes q2 for all pairwise comparisons of conditions. . . . . . . . . . . . . . . . . . 161 4.3 Task 1 completion time in milliseconds . . . . . . . . . . . . . . . . . 161 4.4 Task 2 completion time in milliseconds . . . . . . . . . . . . . . . . . 162 4.5 Task 1 total number of actions. ANOVA p value and effect sizes rJ2 for all pairwise comparisons of conditions. 4.6 Task 2 total number of actions. ANOVA p value and effect sizes r 2 for all pairwise comparisons of conditions. 4.7 . . . . . . . . . . . . . . . . . 163 . . . . . . . . . . . . . . . . . 164 Task 1 total number of actions . . . . . . . . . . . . . . . . . . . . . . 164 25 4.8 Task 2 total number of actions . . . . . . . . . . . . . . . . . . . . . . 4.9 Task 1 human action interval. ANOVA p value and effect sizes all pairwise comparisons of conditions. T2 165 for . . . . . . . . . . . . . . . . . 166 4.10 Task 2 human action interval. ANOVA p value and effect sizes rj 2 for all pairwise comparisons of conditions. . . . . . . . . . . . . . . . . . 167 4.11 Task 1 human action interval rate of change. ANOVA p value and effect sizes r 2 for all pair wise comparisons of conditions. . . . . . . . 168 4.12 Task 2 human action interval rate of change. ANOVA p value and effect sizes q2 for all pairwise comparisons of conditions . . . . . . . . 4.13 Task 1 human action interval rate of change 169 . . . . . . . . . . . . . . 169 4.14 Task 2 human action interval rate of change. . . . . . . . . . . . . . . 170 4.15 Task 1 human functional delay ratio. ANOVA p value and effect sizes 2 T/ for all pair wise comparisons of conditions. . . . . . . . . . . . . . 171 4.16 Task 2 human functional delay ratio. ANOVA p value and effect sizes . . . . . . . . . . . . . 172 4.17 Task 1 human functional delay ratio. . . . . . . . . . . . . . . . . . . 172 4.18 Task 2 human functional delay ratio . . . . . . . . . . . . . . . . . . . 173 72 for all pair wise comparisons of conditions. 4.19 Task 1 human functional delay rate of change. ANOVA p value and effect sizes rj 2 for all pair wise comparisons of conditions. . . . . . . . 174 4.20 Task 2 human functional delay rate of change. ANOVA p value and effect sizes 72 for all pair wise comparisons of conditions. . . . . . . . 26 175 List of Algorithms 1 Convergence Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2 The Belief Branch and Bound RTDP (B 3RTDP ) algorithm . . . . . . 65 3 Subroutines of the B 3 RTDP algorithm . . . . . . . . . . . . . . . . . . 67 4 Pseudocode for a simplified deterministic approach to MTP 5 Pseudocode for constructing the transition functions 77 . . . . . . 108 h/r, at levels I > 0, of the goal situations. Note that the subscript h/r denotes that this works for either agent's predictive stacks but the order of h/r versus v/h marks that if one refers to the human then the other refers to the robot and vice versa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Pseudocode for constructing the transition function TPOMDP for the M TP POM DP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 120 Pseudocode for the 127 GRAPHPLAN algorithm. The algorithm operates in two steps, graph creation and plan extraction. The EXTRACTPLAN algorithm is a level-by-level backward chaining search algorithm that can make efficient use of mutex relations within graph. . . . . . . . . . . . . . . . . . . . . . . . 8 The RTDP algorithm interleaves planning with execution to find the optimal value function over the relevant states relatively quickly. 9 190 The BRTDP algorithm. . . . . . . . . . . 191 Uses a bounded value function and search heuristic that is driven by information gain. . . . . . . . . . . . . . . . 27 192 10 The RTDP-Bel algorithm from (Geffner & Bonet, 1998) and (Bonet & G effner, 2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 193 Chapter 1 Introduction 29 1.1 Motivations As robots move out of factory floors and into human environments, out from safe barricaded workstations to operating in close proximity with people, they will increasingly be expected to understand and be able to coordinate with basic aspects of human behavior. If they are to really become useful and productive participants in human-robot teams, they will require good methods of modeling their human counterparts in order to be able to better coordinate and cooperate with them. One of the ways that people reason about other people's behavior is to explain and describe the actions of others in terms of their presumed intentions or goals (Blakemore & Decety, 2001). In effect, people are constantly performing plan recognition when observing actions of others to better understand the underlying reasons for their behavior. It has been argued that our human obsession with teleological interpretation (the explanation of phenomena by its ultimate purpose) of actions stem from its importance of both on-line action prediction as well as social learning, enabling an agent to learn about new affordances of actions or resources (Csibra & Gergely, 2007). This ability is observed at a very early age; an experiment was conducted which showed that infants begin reasoning about observed actions in a goal-directed way near the time that they gain control over those actions themselves, or as early as nine-months old (Woodward, 2009). When people work closely with each other they often naturally reach a high level of fluent coordination. Explicit planning, verbal communication, training, and experience can speed this process but are not necessary as people can achieve fluent joint action using non-verbal behaviors such as attention cueing, autonomous mimicry, mentalizing and anticipation, and more (Sebanz et al., 2006). In this thesis we plan to endow robots with some of the core capacities required for achieving such autonomously coordinated behavior by developing methods that begin to provide robots with a basic understanding of how actions are predicated on beliefs and directed towards goals, and how to leverage that information for planning. 30 A great example of the utility of mental state reasoning, and how we employ it naturally, is that of the piano mover. A piano mover's task is very complicated, not only because of the challenging geometric problem of moving a large, heavy, and yet delicate object through a cluttered environment, but because of the intense need for tight coordination with other movers. With many hands on the piano, each mover needs to be constantly reason about the environment as they perceive it, which parts of it the others might be able to perceive and what they are attempting to accomplish, predicting their behavior and react to changes, and make sure they all move in synchrony. The fact that the human brain can perform such challenging tasks with relative ease supports the theory that this computation is important for our survival that it has been allotted dedicated neural circuitry in the brain (mirror neurons) (Gallese & Goldman, 1998). The piano mover's challenge is one where coordination and mental reasoning is particularly important, but even the challenge of navigating crowded environments can present interesting problems. When we plan to move through such environments, we need to avoid running into others, which calls for constant mental state reasoning and behavior anticipation. For example, when passing someone on a sidewalk we intuitively give individuals who we believe are unaware of us a wider berth than others, since we can anticipate no cooperation from them (as they are unaware of us) and in fact they might do something completely unexpected like turn around or change direction at any moment. Lastly, mental state reasoning can be used as an extra "sensor" in the environment. If one truly understands how behavior is based on beliefs and desires formed about the environment, additional information about the environment can be indirectly inferred from the behavior of others. An example of this is the bicyclist that cannot see if they can cross an intersection because of a visual obstruction. If that cyclist were to see that another person, whose line of sight is not obstructed, moves into the intersection with a baby stroller, then the cyclist can draw the conclusion that no 31 traffic is oncoming since that person would not rationally take those actions if they believed that a car was coming and their goal was to cross the street unharmed. In light of the apparent importance of mental state reasoning and its useful application to human coordination and interaction, we believe that it is a crucial capability for a robot to have if it is to be a useful teammate in human-robot teams. This thesis presents an approach that makes progress on solving this problem. 1.2 A Mind-Theoretic Robot Planning System A robotic agent that can reason about people in terms of their mental states needs to possess a range of different capabilities. The following is a "wish list" of capabilities for a mind-theoretic agent. Action prediction: A mind-theoretic agent needs to have a way of predicting which actions the other agent will take. This prediction will serve to help the robot anticipate future changes in the environment, which can help the robot avoid damages that might occur as well as exploit opportunities that get created. Means-end understanding of action: In addition to being able to predict future actions of the other agent, the robot needs to understand on what basis that prediction was built and how it should be adapted as the environment or the agent's beliefs change. Recursive mental state reasoning: Mental state reasoning is inherently a recursive process. As we think about the thoughts and beliefs of others, do we take into account that they might have beliefs about us? And if so, how deeply should we recurse? Should we reason about the beliefs of the other concerning the degree to which they think we are reasoning about their beliefs? Information seeking behavior: An agent reasoning about mental states of others and using them to predict their behavior needs to understand the value of information. Specifically it should understand that there is value in being certain 32 about which mental state the other has, so it may improve the prediction of their actions. Such an agent would need to understand that it might be worth taking a few actions simply to gather information before starting to take task-directed actions. Hedging against uncertainty: Once the agent has the capacity to anticipate possible different future configurations of the environment, as caused by different predictions of others' actions based on their mental states, the agent should be able to hedge against uncertainty in those predictions. Planning to manipulate mental states: Lastly, once agents are reasoning about each other's beliefs and using them to predict actions and plan, they should be able to have goals that relate to mental states. There are plenty of examples for this type of behavior in games and sports. Usually in games, the goal is to achieve some task for your team while trying to block the other team from achieving theirs. This often involves selectively sharing information with your team members while denying it to the opposing team. A mental-state goal might therefore be to take actions in a way that maximally informs your team about your intentions and the state of the world while simultaneously trying to hide them from the opposing team, causing them to have false beliefs about the world, which might disadvantage them in the game. 1.2.1 Proposed System In this thesis we present a system that attempts to accomplish most of the aforementioned desired features for a mind-theoretic agent system. The Mind-Theoretic Planner (MTP) presented is able to create predictions of others' actions based on what they believe about the environment and what goals they have. These predictions in turn take into account what actions the other agent expects of other agents and how they will react. These predictions are leveraged to create a predictive forward model of the world, which includes how the world state will be affected as a function of the mental states of the others. This forward model is used in conjunction with a perceptional observation model of the world to produce mind-theoretic agent 33 behavior that seeks to better understand the mental states of others in order to better predict state changes and to produce better plans. 1.3 Research Questions The research questions we are interested in investigating with this work concern both the technical feasibility of making a mind-theoretic planning system, as well as the objective and subjective difference that system would have on the performance and attitudes of human-robot teams. 1. How can mind-theoretic reasoning abilities be encoded in the formalisms of autonomous planning and reasoning? 2. To calculate near real-time solutions to realistic problems in Human-Robot Interaction (HRI) domains, what kinds of approximations and heuristics can be applied in the MTP method? 3. What are the strengths and weaknesses of the MTP approach with respect to the task performance and subjective experience of a mixed human-agent team? The following questions concern comparisons between a person's subjective attitudes towards the competencies of its autonomous teammate when it uses the MTP system as opposed to other autonomous systems. 4. Will teaming with an MTP system, as opposed to a different autonomous system, influence how people judge an autonomous partner in a human-agent interaction? 5. More specifically, how will an MTP system influence perceptions of the partner as being engaging, likeable, capable, intelligent, or team-oriented? 34 1.4 Overview of This Document This thesis is mainly separated into three different chapters. Chapter 2 presents a novel Partially Observable Markov Decision Process planner called B 3 RTDP . This planner is capable of producing approximate solutions to belief planning problems which is crucial to the implementation of the mind-theoretic system presented in this thesis. Chapter 3 presents the implementation of the MTP system, and lastly Chapter 4 covers the human subject study that was performed to evaluate the MTP system. Chapter 1: Introduction The current chapter presents motivations for the mind-theoretic reasoning problem along with the research questions for the presented work and this overview. Chapter 2: Autonomous planning This chapter introduces some of the basic methods and representations within the autonomous planning literature that are relevant to this thesis. Specifically the RealTime Dynamic Programmingmethod (Barto et al., 1995) and several extensions that solve Markov Decision Planning problems. Belief planning is introduced and a few solution techniques discussed. The novel B 3 RTDP algorithm is presented and all of its approximation methods detailed. Lastly, B 3RTDP is evaluated on known benchmark problems against a state-of-the-art planning algorithm. Chapter 3: Mind-Theoretic Reasoning In this chapter, the motivations and some of the psychological concepts underlying mental-state reasoning are presented. Existing work on problems in this domain are discussed and compared, and our own previous approaches to solving the problem briefly presented. The MTP system is presented in detail with all of its internal 35 mechanisms and representations explained. Lastly, some demonstrative examples of the MTP system in action are presented. Chapter 4: Evaluation of Mind-Theoretic Reasoning This chapter introduces an on-line video game and simulator that was developed to evaluate the MTP system. A user study is presented in which people interact with a virtual agent in a task-oriented environment. The experimental conditions of the study were different methods to control the agent, two of which were the MTP system in different configurations. Results from the study are presented and discussed. Chapter 5: Conclusions This chapter discusses the impact of this work along with the research contributions of this thesis. Some future directions for the work are also discussed. 36 Chapter 2 Autonomous Planning 37 2.1 Introduction In this section I will introduce the research field of autonomous planning as well as provide background on some important algorithms and representations that are frequently used in that field. I will introduce classical planning, both plan-space and state-space methods, as well as decision theoretic planning based on Markov Decision Processes both directly- (MDP) and partially-observable (POMDPs). I'll introduce a novel POMDP solver algorithm named Belief Branch and Bound RealTime Dynamic Programming (B3 RTDP ) which extends an existing RTDP approach. B 3 RTDP employs a bounded value function representation and uses a novel pruning technique Confidence Action Pruning which allows pruning actions from the search tree before they become provably dominated by other actions, and a Convergence Frontier which serves to speed up search time. I present empirical results showing that B3 RTDP can outperform a state-of-the-art planning system named SARSOP both in convergence time and total adjusted discounted reward from two well known POMDP benchmarking problems. 2.2 Background Planning is a term that is commonly used in many fields related to Al and Robotics. In the most general case, it describes a process of determining which action should be taken at a given time which transforms the world state in a way that is conducive to help satisfy a goal criterion. This process can be implemented in many different ways, and we will discuss a few of them in this section. States and actions are two of the most important representational concepts in autonomous planning. Most planning systems deal with state either explicitly or implicitly. It serves as a description of the environment at a given time and how it is encoded can have an incredible effect on the complexity of planning task (Ghallab et al., 2004). Generally it is important that only the features of the environment that 38 are significant to solving the planning task should be encoded in the state but equally no important feature can be missing. State-space "explosion" is a term that has been used for when the size of the set of states to solve a planning problem grows so large that it either becomes unmanagable by the planning algorithm or even too large to store in memory. Several planning approaches attempt to use factored state-spaces (Boutilier et al., 2000) or state-space abstractions (Dearden & Boutilier, 1997) to reduce the negative effect that a large state-space has on the planning process. The ultimate goal of planning is to figure out the best action to take at any given time, where actions can be thought of as operators on the states. Each action a is defined by a function that transforms a state s into a different state s' or even a set of states in the case of probabilistic planning. This function is generally referred to as the transition function. 2.2.1 Classical Planning Classical Planning (CP) systems generally refer to ones that solve a restricted planning problem that satisfies the following simplifying assumptions (Ghallab et al., 2004): 1. State-space is finite and discretely represented as a set of literals that hold true in a state. 2. States are always fully observable. 3. Environment is static and deterministic (only planning actions can affect state and they do it in a predictable and deterministic manner). 4. Actions are instantaneous in time. 5. Actions are described by three sets of literals. (a) A set of precondition literals that need to hold true in the current state for the action to be applicable. 39 (b) A set of "add-effects" which will be added to the current state literals should the action be taken. (c) A set of "del-effects" which will be removed from the current state literals should the action be taken. 6. A plan consists of linearly-ordered sequences of actions Classical planning domains are fully described by E {A, P}, where P represents a set of lifted logical predicates and A a set of lifted actions. The term lifted here refers to an un-instantiated variable, example: IsHolding(?human,?object) is a lifted logical predicate which could be grounded over a set of humans and objects to produce a list of grounded literals like this one: IsHolding(John,redball). CP planning problems are fully described by their domain, a set of grounded world objects, an initial state and a logical goal state description (a conjunctive set of grounded predicates that need to hold true in a state to qualify) E {A, P,0, 1, G}. A standard has been developed for representing CP domains and problems as well as various planning features and functionality in the Planning Domain Definition Language (PDDL) (Ghallab et al., 1998). Plan-space planning CP systems generally fall into one of two categories: state-space planners and planspace planners. Originally, plan-space systems such as Partial-OrderPlanners (Barrett & Weld, 1994) and HierarchicalTask Network (HTN) planners (Nau et al., 2003) were considered faster and more efficient than their state-space counterparts. Plan-space planners search over a space of partial plans (note that this space is infinitely large), constantly attempting to refine the current plan and resolve flaws and unsatisfied constraints. These planners tend to produce short plans very quickly but cannot guarantee their optimality. They can naturally take advantage of hierarchical structures in the action space (such as macro actions composed of a fixed sequence of 40 regular actions) and have therefore been favored by the game development community for a long time. Examples of state-space planners With the advent of Graphplan (Blum, 1995) and subsequent systems that used its compact and efficient graph-based state representation, state-space planners became more scalable to realistic domains. The GraphPlan algorithm operates on a structure called the planning graph which consist of sequential temporal layers of state variables and action variables in the following fashion: {So, Ao, S 1 , A 1 , S 2 , A 2 , .. . } The algorithm proceeds in two interleaved steps until termination: graph expansion and plan extraction (see Algorithm 7). After each graph expansion, the state liter- als at the current level are inspected for mutex relations. These basically represent constraints on which state literals can truly "co-exist" at any given time. Once we find all the goal literals non-mutexed in a level, we can attempt to extract a plan. The EXTRACTPLAN subroutine is implemented as a backward chaining search algorithm that takes advantage of the pre-calculated mutex relations and uses heuristics available from the graph. Modern CP systems are often based on heuristic search, and their performance is critically impacted by the quality and efficiency of that heuristic calculation. Domain independent heuristics are desirable but often hard to come by. A popular usage of the GraphPlan algorithm is to produce exactly such a heuristic. The first level where the goal literals appear non-mutexed is the theoretically the shortest possible theoretical at which a certain goal might be reached by a valid plan. This level can serve as a theoretical minimum for the plan length and can therefore serve as an admissible heuristic for another search algorithm. A "tighter" heuristic can be extracted if one also performs the EXTRACTPLAN routine but omits the del-effects of actions (significantly reducing number of mutexes in the graph and therefore simplifying the planning problem). Several heuristic-based search algorithms take advantage of these 41 heuristics such as the FF planner (Hoffmann & Nebel, 2011) and FastDownward (Helmert, 2006). 2.2.2 Decision Theoretic Planning Markov Decision Processes Markov Decisions Processes (MDPs) (Bellman, 1957a) have been a favored problem representation amongst Al researchers for a long time, especially in the Reinforcement Learning and ProbabilisticPlanning communities. The model is based on the assumption that all relevant information to solving a planning problem can be encoded in a state and furthermore that no element of the domain dynamics (transition probabilities, rewards, etc.) should ever depend on any state history other than the single previous state. This is referred to as the Markovian Property. A fully specified MDP is represented by: " S: A finite set of states. * A: A finite set of actions. * T(s, a, s'): A transition function that defines the transition distributions Pr(s'ls,a). " C(s, a) or R(s, a): A cost or reward function. * 7: A discount factor. The representation of an MDP can either encode action costs as positive quantities or action rewards as positive quantities, this creates no significant difference except for whether or not to use a min or max operator in the Bellman value update calculation (equation 2.2) and how to interpret upper and lower bounds of value functions. For reward-based domains the upper is the "greedy" boundary which should be initialized to an admissible heuristic whereas for cost-based domains the opposite is true. In this document we will always refer to cost-based domains unless otherwise specified. 42 The transition function T encodes the dynamics of the environment. It specifies how each action will (possibly stochastically) transition the current state to the next one. The cost/reward function C/R can be used to encode the goal of the planning task or more generally to specify states and/or actions that are desirable. A solution to an MDP is called an action policy and is often denoted by the symbol 7w. It represents a mapping between a state and an action that should be taken in that state 7r : S -- A. The optimal action policy is often referred to as r*, and it is the policy that maximizes the Expected Future Reward for acting in this domain. The following equations are called the Bellman equations, which recursively define the value of a state as a function of the cost of greedily choosing an action and an expectation over the successive state values. The solution to these equations can be found via Dynamic Programming. Q(s, a) C(s, a) + -y E T(s, a, s')V(s') (2.1) s'CS V (s) :=min Q (s, a) aEA (2.2) (For reward-based domains, equation 2.1 would use R instead of C and equation 2.2 would use a max operator instead of min) The optimal action policy can therefore be defined as always choosing the action with the lowest Q value: 7r*(s) := argminaEAQ (s, a). Partially Observable Markov Decision Processes MDPs still play an important role in the autonomous planning literature and are a sufficient representation for a large host of problems. An important limitation of MDPs is that they can only represent uncertainty in the transition function but assume that the planning agent can always perfectly sense its state at any time. Partially Observable Markov Decision Processes (POMDPs) (Kaelbling et al., 1998) 43 are an extension to the MDP model that is capable of representing both transitional uncertainty as well as observational uncertainty (this can be thought of as "actuator noise" and "sensor noise"). To fully represent a POMDP model, in addition to the aforementioned MDP parameters S, A, T, C/R and -y we need to specify: * 0: A set of observations * Q(a, s', o): An observation function dictating probability distributions over observations given an action and a resulting state, Pr(ola,s') In a partially observable domain, the agent cannot directly observe its state and therefore needs to simultaneously do state estimation with planning. A planning agent represents its uncertainty about its current state as a probability distribution over possible states which we will refer to as a belief b where b(s) := Pr(sjb). Equation 2.3 demonstrates how the state estimation proceeds to update the belief b given that an action a is taken and observation o is received. To calculate bo(s'), we sum up all possible transitions from any state s with non-zero probability in b to s', weighed by T(s, a, s') and b(s). That sum is then multiplied by the probability of observing o when taking a and landing in s' or Q(a, s', o). Finally, this quantity is divided by a normalization factor that can be calculated by 2.4 but is not needed if we perform the belief update for all observations o E 0 since it will simply be the normalization factor that makes all of the nominators of equation 2.3 sum to one. V Pr(o b, a) Q(a, s', o) zS T(s, a, s')b(s) Pr(olb, a) Q(a, s', o) = (2.4) sCS s'CS 2.3 >3 T(s, a, s')b(s) (2.3) POMDP Planning Algorithms Algorithms exist to solve for the optimal value function of POMDPs (Sondik, 1971), but this is rarely a good idea as the belief space is infinitely large and only a small 44 subset of it relevant to the planning problem. A discovery by Sondik about the value function's piece-wise linear and convex properties led to a popular value function representation which consists of maintaining a set of ISI-dimensional a vectors, each representing a hyperplane over the state space. This representation is named after its author Sondik and many algorithms, both exact and approximate, take advantage of it. The Heuristic Search Value Iteration (HSVI) algorithm extends ideas of employ- ing heuristic search from (Geffner & Bonet, 1998) and combines them with the Sondik value function representation but with upper and lower bound estimates. It employs information seeking observation sampling technique akin to that of BRTDP (introduced below) (Smith & Simmons, 2004) but aimed at minimizing excess uncertainty. The Point-based Value Iteration (PBVI) algorithm doesn't use a bounded value function but introduced a novel concept of maintaining a finite set of relevant belief-points and only perform value updates for those sampled beliefs. Lastly the SARSOP algorithm combines the techniques of HSVI and PBVI, performing updates to its bounded Sondik-style value function over a finite set of sampled belief points. Additionally, SARSOP uses a novel observation sampling method which uses a simple learning technique to predict which beliefs should have higher sampling probabilities to close the value gap on the initial belief faster. 45 2.4 Real-Time Dynamic Programming Several methods exist for solving MDPs, most of which are based on learning the value function, V : S -± R, over the state space by solving the Bellman equations (equation 2.1 and 2.1) using dynamic programming. Algorithms such as Value Iteration (Bellman, 1957b) and Policy Iteration (Howard, 1960) are successive approximation methods that can solve this problem by either explicitly initializing V arbitrarily and then iteratively performing Bellman value updates or implicitly by iteratively improving an arbitrarily initialized policy. 2.4.1 RTDP Real- Time Dynamic Programming (RTDP) (Barto et al., 1995) is a family of algorithms that perform asynchronous updates to the value function by combining simulated greedy action selection with Bellman updates. This approach leads to more focused updates in a part of the state space that is relevant to the optimal action policy. The RTDP algorithm in its native form operates on a special case of MDPs called Stochastic Shortest Path (SSP) problems that are the subset of all MDPs which have absorbing terminal goal states and strictly positive action costs. Even though these constraints seem like they would limit RTDP's applicability to general MDPs they really do not as there are methods to transform general MDPs to SSP MDPs. Bonet and Geffner have shown how this can be applied to POMDPs, and it is trivial to apply their method to MDPs (Bonet & Geffner, 2009). The basic RTDP algorithm (Algorithm 8) repeatedly simulates acting on what is currently the best estimate of the optimal greedy policy, while simultaneously updating state values, until it either finds the goal state or hits a depth limit. The value function can be initialized arbitrarily but if it is initialized to an admissible heuristic then it can be shown that under the assumption that the goal is reachable 46 with positive probability from every state, repeated iterations of RTDP will yield the optimal value function Vu(s) = V*(s) for all relevant states. 2.4.2 Extensions to RTDP Several extensions have been proposed to improve the RTDP algorithm. These extensions generally attempt to improve convergence time by focusing updates on "fruitful" parts of the state space. Labeling solved states Labeled-RTDP (Bonet & Geffner, 2003) (LRTDP) introduced a method of marking states as solved once their values and those of all states reachable from them had converged. Solved states would subsequently be avoided in the RTDP's exploration of the state space. This effectively improves convergence time by creating a con- ceptual boundary of solved states which initially only contains the goal states, and successively expanding the boundary out toward the initial state with every iteration of the algorithm. Each iteration of the algorithm in turn is more efficient as it needs to travel a shorter distance to meet the boundary. Bounding the value function In the previous approaches discussed above, a single value function is maintained. If initialized to an admissible heuristic for the problem, the value function will represent a lower bound of the true optimal value function. The following extensions of RTDP also include an upper bound which should be initialized such that: VL(s) < V*(s) < VU(s) VL(s) = VU(s) = 0 |VsCG We also define the following specifications of equation 2.1: 47 3 QL (s, a) =C (s, a) + -y E s a, s')VL(s') S'ES 3 QH (s, a) =C (s, a) - y E s a, s')VH (s') Bounded-RTDP (BRTDP) (McMahan et al., 2005) takes advantage of this bounded value function representation in two ways, for search iteration termination and search heuristic guidance (Algorithm 9). Each trial iteration follows greedy action selection according to the lower value function boundary (as in RTDP) but performs value updates on both boundaries. Each iteration is terminated when the expected value gap of next states becomes smaller than a certain fraction (defined by the parameter T) of the value gap of the initial state sl. This effectively achieves the same effect of the LRTDP termination criteria except that this boundary of "solved" states is dynamic and moves further away from the initial state as its value becomes more certain. Lastly, BRTDP samples the next state to explore not from the transition function but rather from the a distribution created by the value gaps at subsequent states weighed by their transition probabilities. This equates to a search heuristic that is motivated to seek uncertainty in the value function to quickly "collapse" its boundaries onto the optimal value V*. Several algorithms have been proposed to improve RTDP using a bounded value function, each of which providing different search exploration heuristics and iteration termination criteria. These generally attempt to focus exploration onto states that are likely to have large contributions toward learning the optimal value function (Sanner et al., 2009) and (Smith & Simmons, 2006). 48 2.5 Belief Branch and Bound Real-Time Dynamic Programming In this section we present a novel planning algorithm for POMDPS called Belief Branch and Bound Real-Time Dynamic Programming (B3 RTDP ). The algorithm extends the RTDP-Bel system with a bounded value function, a Branch and Bound style search tree pruning and influences from existing extensions to the original RTDP algorithm for MDPs. 2.5.1 RTDP-Bel Geffner and Bonet proposed an extension to the RTDP algorithm that was designed to solve MPDs (introduced in Section 2.4.1) called RTDP-Bel which is able to handle partially observable domains (Geffner & Bonet, 1998). The two most significant differences between RTDP and RTDP-Bel are what types of graphs the algorithms search over and how they store the value function. RTDP searches a graph composed of states, a selection of actions, and stochastic transitions into other states. RTDP-Bel searches a graph of beliefs, a selection of actions, and stochastic transitions, through observations and associated probabilities, into other beliefs. The two graph structures are depicted in figure 2-1. The second and more significant contribution of this work is in the value function representation. Implementing a value function over beliefs is much more challenging than for states as even for domains with a finite number of states the belief space is infinitely large as the probability mass of a belief can be distributed arbitrariliy over the finite states. One of the most commonly used representations (attributed to Sondik (Sondik, 1971)) maintains a set of oz vectors, each of dimension IS1, where V(b) = max, a -b. RTDP-Bel uses a function-approximation scheme which discretizes the beliefs and stores their values in a hash-table which uses the discretized belief as key. 49 b(s) = ceil(D - b(s)) (2.5) An RTDP-Bel value function is therefore defined as such: The calculation of action h(b) I if 6 V HASHTABLE HASHTABLE(b) I otherwise Q values is adjusted to use the discretized value function: Q(b, a) = c(b, a) + y E (2.7) Pr(olb,a)V(b") oEO Where Pr(olb, a) is calculated with Equation 2.4 and b" with Equation 2.3. bol S S2 a1 a, T(si, ai, s) Pr(o bi, a,) S1 S~ bi S4 a2 T(si, b0 S Sa \o a2, S ) a2 S Pr(o bi, S a2) b0 (b) (a) Figure 2-1: (a) Demonstrates how the state tree can be traversed by selecting actions and transition links to successor states according to the transition function T(s, a, s'). (b) Shows how traversing the belief tree is similar to traversing the state tree except that when an action is taken in a belief b we use equation 2.4 to determine the "belief transition probability" to the successor beliefs, through observation probabilites, which can be calculated with equation 2.3. 50 Transformation from General POMDP to Goal POMDP As previously discussed in section 2.4.1, the RTDP algorithm operates on so-called Stochastic Shortest Path MDP problems. Similarly, RTDP-Bel operates on Goal POMDPs which satisfy the following criteria (only listing differences from general POMDPs for brevity): 1. All actions costs are strictly positive. 2. A set of goal states exist that are: (a) absorbing and terminal. (b) fully observable, upon entering them a unique goal observation is emitted. These constraints seem on first sight quite restrictive and would threaten to limit RTDP-Bel to only be applicable to a small subset of all possible POMDP problems. This is not the case in reality as general POMDPs can be transformed to a goal POMDP without much effort. The transformation is explained in detail in (Bonet & Geffner, 2009) but basically proceeds as follows: 1. The highest positive reward in the Discounted POMDP is identified and a constant C is defined as C := max(R(s, a)) + 1 2. A "fake" Goal state g is constructed along with a new unique observation og 3. The observation function is modified to include the goal observation: Q(a, g, og) 1. 4. A cost function is defined such that C(s, a) := C - R(s, a) and C(g, a) := 0 5. A new transition function is formulated that introduces a probabilistic transition to the goal state from any state with probability 1 - y where -y is the discount factor from the Discounted POMDP: Tew(s, a,.) := 7Yld(s, a, -) with the addition that Tnew(s, a, g) := 1 - -y. 51 listen/1.0 openL/.50 openL/.50 jopenR/."50 left openR/.,50 openL/.50 openR/.50 listen/1.0 openL/.50 oenR/.50 right listen/1.0 openL/1.0 openR/1.0 target listen/.05 openL/.05 openR/.05 listen/.05 openL/.05 penR/.05 openL/.475 listen/.95 openL/.475 openR/.475 Figure 2-2: left opn/45listen/.95 openU.475 openR/.475 right openL/.475 openR/.475 Demonstrates how the transition function for the discounted Tiger POMDP is transformed into a Goal POMDP. From: (Bonet & Geffner, 2009) 2.5.2 Bounded Belief Value Function 3 As was previously mentioned, B RTDP maintains a bounded value function over the beliefs. In the following discussion, I will refer to these boundaries as separate value functions VL (b) and ZH(b) 7 but the implementation actually stores a two-dimensional vector of values for each discretized belief in the hash-table (see equation 2.6) and so only requires a single lookup operation for the value retrieval of both boundaries. It is desirable to initialize the lower bound of the value function to an admissible heuristic for the planning problem. This requirement needs to hold for an optimality guarantee. It is easy to convince oneself of why this is, imagine that at belief bi, all successor beliefs to taking the optimal action a 1 have been improperly assigned inadmissible heuristic values (that are too high). This will result in an artificially high Q(bi, ai) value, resulting in the search algorithm choosing to take a 2 . If we assume that the successor beliefs of action a 2 were initialized to an admissible heuristic then after some number of iterations we can expect to have learned the true value of 52 Q* (b, a 2 ), but if that value is still lower than the (incorrect) Q(bi, ai), then we will never actually choose to explore a, and never learn that it is in fact the optimal action to take. It is equally desirable to initialize the upper boundary VH(b) to a value that overestimates the cost of getting to the goal. This becomes evident when we start talking about the bounding nature of the B 3 RTDP algorithm, namely that it will prune actions whose values are dominated with a certain confidence threshold. For that calculation, we require that the upper boundary be an admissible upper heuristic for the problem. In the previous section, we discussed a method of how to transform discounted POMDPs to goal POMDPs. This transformation is particularly useful when working with goal POMDPs we get theoretical boundaries on the value function for free. Namely, there is no action that has zero or negative costs which means that no belief, other than goal beliefs, has zero or negative values. This means that the heuristic hL(b) = 0 is an admissible (although not very informative) heuristic. Similarly we know that since the domain is effectively discounted (through the artificial transition with probability 1 - 'y to the artificial goal state) that the absolute worst action policy an agent could follow would be repeatedly taking the action with highest cost. Because of the discounted nature of the domain, this (bad) policy has a finite expected value of max(C(s, a))/(1 - -y) which provides a theoretical upper bound. This value bound is called the Blind Action value and was introduced by (Hauskrecht, 2000). Even though it is important that the heuristics (both lower and upper) be admissible, and it is nice that we can have guaranteed "blind" admissible values, it is still desirable that the heuristics be informative and provide tighter value bounds. Uninformed heuristics can require exhaustive exploration to learn which parts of the belief space are "fruitful". An informed heuristic can quickly guide a search algorithm towards a solution that can be incrementally improved. This is especially important for RTDP-based algorithms as they effectively search for regions of good values and 53 then propagate those values back to the initial search node through Bellman updates. What this effectively means is that the sooner the algorithm finds the "good" part of the belief space, the quicker it will converge. Since the lower boundary of the value function is used for exploration, the informative quality of the lower heuristic plays a much bigger role in the convergence time of the algorithm. Domain-dependent heuristics can be hand-coded by domain experts, which is a nice option to have when the user of the system possesses a lot of domain knowledge that could be leveraged to solve the problem. In lieu of good domain-dependent heuristics, we require methods to extract domain-independent heuristics that are guaranteed to be admissible. There are several different ways to obtain admissible lower heuristics from a problem domain. The most common method is called the QMDP approach and was introduced by (Littman et al., 1995). This approach ignores the observation model in the POMDP and simply solves the MDP problem defined by the transition and cost/reward model specified. This problem can be solved with any MDP solver much faster than the full POMDP and provides a heuristic value for each state in the domain which can be combined into a belief heuristic as such: h(b) -- ,C h(s)b(s). The QMDP heuristic provides an admissible lower bound to the POMDP problem as it is solving a strictly easier fully observable MDP problem. This heuristic tends to work well in many different domains, but it generally fails to provide a good heuristic in information-seeking domains since it completely ignores the observation function. Another admissible domain-independent heuristic is the Fast Informed Bound (FIB) which was developed by Hauskrecht (Hauskrecht, 2000). This heuristic is provably more informative than QMDP as it incorporates the observation model of the domain. This more informative heuristic does come at a higher cost of O(JA 2 S 210) whereas QMDP has the complexity of regular Value Iteration or O(JAJ S12 ). There also exist methods to improve the upper bound of the value function. One 54 method is to use a point-based approximation POMDP solver to approximately solve the actual POMDP problem as long as the approximation is strictly over-estimating (Ross et al., 2008). This is a very costly operation but can be worth it for certain domains. The B 3 RTDP algorithm can be initialized to use any of the above or more heuristic strategies, but we have empirically found its performance satisfactory when initialized with the QMDP heuristic as the lower bound and the blind worst action policy heuristic as its upper bound. 2.5.3 Calculating Action Selection Convergence A central component of the B 3 RTDP algorithm is determining when the search tree over beliefs and actions can be pruned. This pruning will lead to faster subsequent Bellman value updates over the belief in question, less memory to store the search tree, and quicker greedy policy calculation. Traditionally, Branch and Bound algorithms only prune search nodes when their bounds are provably dominated by the bounds of a different search node (therefore making traversal of the node a sub-optimal choice). To mitigate the large belief space that POMDP models can generate we experiment with pruning actions before they are actually provably dominated. Figure 2-3 demonstrates how the search algorithm might find itself in a position where it could be quite certain (within some threshold) that one action dominates another but it could still cost many search iterations to make absolutely sure. We use the assumption that the true value of a belief is uniformly distributed between the upper and lower bounds of its value. Notice that the following calculations could be carried out for any number of types of value distributions, but the uniform distribution is both easier to calculate in closed form and appropriate since we really do not have evidence to support a choice of a differently shaped distribution. 55 10 9 8 7 6 5 4 I -1 3 2 0 al a2 a, 02 (b)Beies ad cton reevntto hefiur ontelf a3 (a) Demonstrates example Q boundaries for three actions. Action a3 is the greedy action to choose and it seems like its Q value will dominate that of a2 Figure 2-3 For readability we introduce the following shorthand notation: Q" =zmin(QH (a), QH (a')) Q74a= max(QL(a), QL(a')) G(a) = QH (a) - QL(a) If we assume that the true value of a belief is uniformly distributed between its bounds, then the action Q values are also uniformly distributed and the following holds: Pr(q = Q*(a)) = { 1 G(a) QL(a) < q < 0 otherwise 56 QH(a) (2.8) Pr(q < Q*(a) q) = 1 q<QL(a) Qj (a)-q G(a) QL(a) < q < QH(a) 0 q > QH(a) We are interested in knowing the probability that one action's Q (2.9) value is lower than another's at any given time during the runtime of the algorithm so that we can determine whether or not to discard the latter. This is a crucial operation to the bounding portion of the algorithm. The quantity we are interested is therefore Pr(Q*(a) < Q*(a')) when calculating whether we can prune action a' since its Q value is dominated by that of a. We start by noticing that there are two special cases that can be quickly determined in which the quantity of interest is either 0 or 1. If QH(a) the probability mass of Q*(a) is guaranteed to be below that of QL(a') then all of Q*(a') and therefore Pr(Q*(a) < Q*(a')) = 1. By the same rationale we have Pr(Q*(a) < Q*(a')) = 0 when QH(a') QL(a). QH(a') QL(a) I | QH (a) QL(a') equation 2.11 | otherwise 0 Pr(Q*(a) < Q*(a'))) = e o And we carry out the following calculation: 57 (2.10) Pr(Q*(a) < Q*(a')) {We begin by applying the law of total probability} 00 JPr(q < Q*(a') q)Pr(q = Q*(a)) dq 0 {We apply equation 2.8} QH(a) Pr(q < Q*(a') q) dq) G(a) \QL(a) {We apply equation 2.9 and split the integral into three intervals} QL(a') G I ) G(a) QH(a') ldq + \Q L(a) G(a) Q ma Q ( QH(a') SI QL(a') - QL(a) + dq + IOdq G(al) QH(a) 1 G(a') QH(a) (Qmin - QL(a')) QL(a') Qmnax QL(a) ± G(a)G(a') (QH(a) (QrHin G(a) - "ma- QL(a) G(a) ) qdq -F 2QH(a)Qmin - 2QH(a')QL( QL(a')) - (Q I (Qmin) 2 ) -2 2 (QL(a')) (QL~a') 2 2G(a)G(a') (2.11) This calculation can also be demonstrated graphically for a deeper intuitive understanding. Figures 2-4, 2-5 and 2-6 demonstrate how this calculation equates to finding the area under rectangles and triangles and can be done quite efficiently. 58 1 --- - Pr(Q(a')) Pr(Q(a)) 1 Q (a) - QL (a) *--------------+ QL(a) (a) -QL I 0 0 0 0 0 0 0 *QH (a') I I I I I I I Qk(a) QL(a') q Q (a') Figure 2-4: Shows the Q boundaries for two example actions. The value of the true Q*(a) is uniformly distributed between the bounds for both actions. Pr(q < Q(a') I q) Pr(Q(a')) Pr(Q(a)) M%, QH(a)--QL(a) * * QHn(a,) a - Q(a') . -qQL(a) Q (a) QL(a') Qj(a') Figure 2-5: In addition to the Q distributions, the probability function Pr(q < Q*(a') g) is plotted. This function always evaluates to the probability mass of the Q(a') function that exists between q and QH(a') which for uniform distributions is a piecewise linear function of a particular shape. 59 - 1--- - - .... -Pr(q < Q(a') | q) Pr(Q(a')) Pr(Q(a)) Pr(q < Q(a') | q)Pr(Q(a)) QH(a)-~QL(a) 'Ilk ~iq 0 QL(a) QH(a) QL(a') QH(a) Figure 2-6: Finally this figure shows the function whose integral is our quantity of interest Pr(Q*(a) < Q*(a')). This integral will always simply be the sum of rectangle and triangle areas for two uniform 2.5.4 Q distributions. Convergence Frontier The Convergence Frontier(CF) is a concept created to take advantage of early action convergence close to the initial search node. An intuitive understanding of it can be gained by thinking about when the action choice for a belief has converged to only one action that dominates all others. At that point, actually simulating action selection is unnecessary as it will always evaluate to this converged action. This can be the case for several successive beliefs. Figure 2-7 demonstrates how action choice can converge over the initial belief as well as some of its successor beliefs, effectively extending the CF further out whenever the action policy over a belief within it converges. When planning for a given POMDP problem, the usefulness of the convergence frontier depends on the domain-dependent difficulty of choosing actions early on. The CF is initialized to only contain the initial belief with probability one. Whenever the action policy converges to one best action over any belief in the CF, that belief is removed and the successor beliefs of taking that action are added with their respective observation probabilities weighted by the original CF probability of the originating belief. Sometimes the value function converges over a belief before the 60 Figure 2-7: Demonstrates how action choice can converge over a belief, creating effectively a frontier of reachable successive beliefs with associated probabilities. This effect can be taken advantage of to shorten planning. action policy has converged. In this case, we simply remove this belief from the CF (effectively reducing the total CF probability sum from one) as the value function has been successfully learned at that node. This presents two separate termination criteria: 1. If the total probability of the CF falls below a threshold. And 2. If the total probability weighted value gap of all beliefs in the CF falls below a threshold. Pseudocode for the UPDATECF routine is provided in algorithm 1. The routine iterates through every belief b currently in the frontier, in line 5 it checks to see if the value function has been collapsed over b and if so it is simply removed (which reduces the total probability of the frontier since no subsequent beliefs are added for that node). In line 10 the routine checks to see if only one action is left to be taken for b (the action policy has converged), if so then b is removed from the frontier and all subsequent beliefs of taking the converged action from b are added with their respective observation probabilities multiplied by the probability of b in the frontier. The short SHOULDTERMINATECF routine is also defined in algorithm 1. It simply dictates that the algorithm should terminate when either of two conditions are satisfied. The first condition activates when the total probability of all beliefs in the frontier goes below a threshold #t. This means that when acting on the optimal policy, starting at the initial belief, it is sufficiently unlikely that any frontier belief 61 is experienced. The second condition activates when the probability weighted value gap of the frontier goes below the threshold c, this means that the value function has been sufficiently learned over all the beliefs in the frontier. Lastly, the SAMPLECF routine in algorithm 1 simply creates a probability distribution by normalizing the probability weighted value gaps of the beliefs in the frontier. At line 33 this distribution is sampled and the corresponding belief returned. Algorithm 1: Convergence Frontier 1 UPDATECF (c: ConvergenceFrontier) 2 foreach b E c do 4 // If value of belief in c becomes certain, we remove it 5 if H (b) - L 7 VL(b) < c then c.REMOVE(b); // If action policy has converged over belief, we add all successor beliefs else if SIZE(A(b)) = 1 then a:= PICKACTION(A(b)); foreach o E 0 1 Pr(olb, a) > 0 do 9 10 12 13 16 if b"o c then c.APPEND(b ); 18 c.prob(b") := c.prob(ba) + c.prob(b) - Pr(olb,a); 14 c.REMOVE(b); 20 21 23 24 SHOULDTERMINATECF (c: ConvergenceFrontier) // We terminate when either the total probability of the CF falls below a threshold or when the probability weighted value gap does return (Ec.prob(b) < 0) V (1c.prob(b) (VH(b) - VL(b)) \bec <c); bc / SAMPLECF (c: ConvergenceFrontier) 27 // We sample a belief from the frontier seeking high value uncertainty 25 29 Vb 31 G 33 cC,g(b) c.prob(b) ('H (b) - IL (b) Eg; return b'~ g(-)/G; 62 2.5.5 Belief Branch and Bound RTDP In the previous sections we have introduced many of the relevant important concepts that we now combine together into a novel POMDP planning system called Belief Branch and Bound RTDP (B3 RTDP ). The algorithm extends the RTDPBBel (Geffner & Bonet, 1998) system but uses a bounded value function. It follows a similar belief exploration sampling strategy as the BRTDP (McMahan et al., 2005) system does for MDPs except adapted to operate on beliefs rather than states. This exploration heuristic chooses to expand the next belief node that has the highest promise of reduction in uncertainty. This approach realizes a search algorithm that is motivated to "seek" areas where information about the value function can be gained. An interesting side-effect of this search algorithm is that the algorithm never visits beliefs whose values are known (VL(b) = VH(b)) such as the goal belief, because there is nothing to be gained by visiting them. Finally B3 RTDP is a Branch and Bound algorithm, meaning that it leverages its upper and lower bounds to determine when certain actions should be pruned out of the search tree. The B3 RTDP system is described in detail in algorithms 2 and 3, but we will informally describe its general execution. Initially, the upper bounds on the value function VH are initialized to a Blind Policy value (Hauskrecht, 2000), which can generally be easily determined from the problem parameters (namely the discount factor -y and the reward/cost function R/C). The lower bounds of the value function are initialized to an admissible search heuristic such as the QMDP which can be efficiently calculated by solving the MDP that underlies the POMDP in question by ignoring the observation model. The initial belief b, is added to the Convergence Frontier (CF discussed in section 2.5.4) with probability one in line 5. Until convergence as determined by the SHOULDTERMINATECF routine in algorithm 1, the B3 RTDP algorithm samples an initial belief for that trial bT from the CF in line 8, performs a B3 RTDPTRIAL and finally updates the CF in line 12. 63 Each B3 RTDPTRIAL in algorithm 2 initializes a stack of visited beliefs in line 19 and proceeds to execute a loop until either maximum search depth is achieved or the termination criteria (discussed below) is met. On every iteration we push the current belief onto the stack and the find the currently "best" action to take according to the lower boundary of the QL function is found in line 28. need to calculate the QL To find this action we values for all actions so we might as well use them to perform a Bellman value update on the current belief in line 32 (we also perform a bellman update for the upper bound of the value function). We then select the next successor belief to explore, using the PICKNEXTBELIEF routine line 38. Once the loop terminates, we perform Bellman updates on both value boundaries (lines 55 and 57) and action pruning (line 51) on all the beliefs we visited in this trial, in reverse order. This is done to both propagate improved value boundaries back to the initial belief as well as make successive search iterations more efficient. The PICKNEXTBELIEF routine in algorithm 3 starts by creating a vector containing the observation probability weighted value gaps of the successive beliefs of taking action a in line 3. The sum of the values of this vector is called G and is used both for normalization and determining termination (line 8). If G, which is the expected value gap of the next belief to be experienced, is lower than a certain portion (defined by T) of the value gap at the trial initial belief, then the trial is terminated. Otherwise we sample a value from the normalized vector and return the associated belief. Lastly, the PRUNEACTIONS routine in algorithm 3 simply iterates through the set of all actions available at the current belief, calculates the probability of them being dominated by the currently best action (line 19), and removes them from the set if that probability is higher than a threshold o. The following are the important parameters for B 3RTDP along with a discussion of their impact on the algorithm: e D: Belief discretization factor. Determines how densely the belief space is 64 Algorithm 2: The Belief Branch and Bound RTDP (B3 RTDP) algorithm 1 B 3 RTDP (b1 : Belief) 3 // 5 INITIALIZECF(c, 6 while! SHOULDTERMINATECF(c) do Initialize the Convergence Frontier to b, bl); b:= SAMPLECF(c); 8 10 B 3 RTDPTRIAL(b); 12 UPDATECF(c); 14 return GREEDYUPPERVALUEPOLICYO; 15 E 3 RTDPTRIAL (bT : Belief) 17 // Maintain a stack of visited beliefs 19 trace.CLEARO); 21 b:=bT; 22 while (trace.SIZE() < MAXdepth) A (b 7 0) do trace.PUSH(b); // Pick action greedily from lower Q boundary 24 26 28 a := argminaEQ 4 QL(b, a); 30 // Perform Bellman updates for both boundaries /L (b) 32 34 VH(b) 36 // 41 mina'cAb QH(b, a) Sample next belief to explore b:= PICKNEXTBELIEF(b, bT, a); 38 40 QL(b,a); (see alg. 3) // Value update and prune visited beliefs in reverse order while visited.SIZE() > 0 do 43 b:= visited.PoP(); 45 // 47 a := argminaEA6 QL(b7 a); 49 53 // Prune dominated actions (see alg. 3) PRUNEACTIONS(b, a, Ab); // Perform Bellman updates for both boundaries 55 VL (b) 57 _VH(b) 51 Pick action greedily from lower Q boundary QL(b, a); mina'cA 6 QH(ba') 65 clustered for value updates. If set too low then belief clustering might cluster together beliefs that should receive very different values and negatively impact planning result. If set too high then value hash-table will grow very large and many updates will be required to learn the true values of beliefs. Typical range: [5, 20]. See equation 2.6. * a: Action convergence probability threshold. This parameter determines when it is appropriate to prune an action from the selection at a given belief. If the Q*(b, ai) dominates Q*(b, a 2 ) with probability higher than a, then a1 is pruned. Typical range: [0.65, 1]. See equation 2.11. * E: Minimum value gap. This threshold dictates whether the search algorithm has converged when VH(bI) - VL(bI) < E or when the probability weighted value gap of the Convergence Frontier beliefs is below e. Typical range [0.0001, 0.1]. * #: Minimum Convergence Frontierprobability. This provides a secondary termination criteria for the algorithm. When the total probability of the CF falls below 3 the algorithm terminates. Typical range [0.0001, 0.01]. 0 T: Trial termination ratio. This parameter is used to determine whether a search trial should be terminated. When the search arrives at a belief where the expected value gap of the successor beliefs is lower than a ratio measured by T of the value gap at the trial's initial belief then the iteration is terminated and value updates are propagated backwards to the trial's initial belief. Typical range [5, 100]. The parameters that have the biggest impact on the efficiency of B3 RTDP are D and a. For all of the evaluations and future discussion we will use the values e = 0.01, 0 = 0.001 and T =10, and show results with varying values of D and a. 66 3 Algorithm 3: Subroutines of the B RTDP algorithm 1 PICKNEXTBELIEF (b, bT : Belief, a : Action) 3 Vo E 0, g(bo) := Pr(o b, a)(fH(bo) - VL(ba)); 5 G : 7 // 8 if G < 14 // Sample next belief according to probability weighted value uncertainty return b' ~ g(-)/G; 15 PRUNEACTIONS 16 L(bT)) /T then HbT- return 0; 10 12 Terminate search iteration when the expected value of uncertainty at current transition is lower than some portion of trial-initial belief value uncertainty (b,: Belief, abet : Action, Ab : ActionSet) foreach a E Ab a : abest do If probability that abest dominates a is higher than a (eqn. 2.11) then remove a 18 // 19 if Pr (Q*(b, abest)) < Q*(b, a)) > a then 21 LAb := Ab \ a; 67 2.6 Results In this section we present evaluation results for the B3 RTDP POMDP planning algorithms on two well-known evaluation domains. We have chosen to include evaluation results for a state-of-the art POMDP planner called SARSOP (Kurniawati et al., 2008) which is a popular belief point-based heuristic search algorithm. The following points are good to keep in mind when comparing performances between the SARSOP planner and B3 RTDP : 1. Because of the RTDP-Bel hashtable value function implementation, B 3 RTDP consumes significantly more memory than SARSOP. 2. SARSOP takes advantage of the factored structure in the problem domains. This makes for a significantly more efficient belief update calculation. There is no reason why B3 RTDP could not do the same but it simply hasn't been implemented yet (see section 2.7). 3. Even though both algorithms are evaluated on the same machine, SARSOP is implemented in C++, which compiles natively for the machine, whereas this implementation of B3 RTDP is implemented in JavaTM. Therefore its performance can suffer from the level of virtualization provided by the JVM. 4. In the standard implementation, SARSOP uses the Fast Informed Bound lower heuristic (see section 2.5.2). SARSOP is similar to B3 RTDP in that as a search algorithm it can be provided with any number of different heuristics to initialize its value function so we modified it to use the QMDP heuristic so that its results would be more comparable with those of B3 RTDP . In actuality, the difference between using the two heuristics was not very noticeable for these evaluations. To evaluate the B3 RTDP algorithm, we have chosen to use two commonly used POMDP problems called Rocksample and Tag. We will show anytime performance of B 3 RTDP on these domains, that is how well a policy performs when the algorithm 68 is stopped at an arbitrary time and policy is extracted. We will also compare convergence times with the Average Discounted Reward measure which is a measure of how much discounted reward one could expect to garnish from a problem by acting on a policy produced by the algorithm. All reported times are on-line planning times, we do not include the time to calculate the QMDP heuristic for either system as there exist many different ways to solve MDPs and this does not fall into the domain of the contributions made by either POMDP planners. 2.6.1 Rocksample Rocksample was introduced by Smith and Simmons to evaluate their algorithm Heuris- tic Search Value Iteration (HSVI) (Smith & Simmons, 2004). In this domain, a robotic rover on mars navigates a grid-world of fixed width and height. Certain (known) grid locations contain rocks which the robot wants to sample. Each rock can either have the value good or bad. If the rover is in a grid location that has a rock, it can sample it and receive a reward of 10 if the rock is good (in which case sampling it makes the rock turn bad) or -10 if the rock was bad. The rover can sense any rock in the domain from any grid with a sensei action which will return an observation about the rock's value stochastically such that the observation is more accurate the closer the rover is to the rock when it senses it. The rover also receives a reward of 10 for entering a terminal grid location on the east side of the map. The rover's location is always fully observable, and the rock locations are static and fully observable but the rock values are initially unknown and only partially observable through the sense action. In Rocksample_ n_ k, the world is of size nxn and there are k rocks. The robot can choose from actions move_ north, move sense1 , ... , sense8 . 69 south, move_east, move_ west, sample, Algorithm SARSOP SARSOP B3 RTDP (D=15,a=0.95) B3 RTDP (D=15,o=0.75) B3 RTDP (D=10,a=0.75) ADR 21.28 0.60 20.35 0.58 21.47 ± 0.02 21.45 ± 0.02 21.03 ± 0.39 Time [ms] 100000* 1000* 1300 1241 508 Table 2.1: Results from RockSample_ 7 8 for SARSOP and B3 RTDP in various different configurations. We can see that B3 RTDP confidently outperforms SARSOP both in reward obtained from domain and convergence time (* means that the algorithm had not converged but was stopped to evaluate policy). The ADR value is provided with a 95% confidence interval. 22 20 18 a: 0 - - - - - - - - - - - - - - - 16 B3RTDP anytime performance -_I B 3RTDP convergence time SARSOP @ 1000 ms 14 12 102 10 t [ms] Figure 2-8: Shows the ADR of B3 RTDP in the RockSample_ 7 8 domain. Algorithm was run with D - 10 and a = 0.75 and ADR is plotted with error bars showing 95% confidense intervals calculated from 50 runs. 70 Algorithm SARSOP SARSOP B3 RTDP (D=-c20,a-0.85) B3 RTDP (Dt=15,a=0.65) B3 RTDP (D=10,a=-z0.95) B3 RTDP (D=10,a=0.65) ADR -5.57 ± 0.52 -6.38 ± 0.52 -5.43 ± 0.09 -5.88 t 0.08 -6.13 ± 0.28 -6.28 ± 0.35 Time [ms] 100000* 1000* 72137 4950 1476 680 Table 2.2: Results from Tag for SARSOP and B3 RTDP in various different configurations (* means that the algorithm had not converged but was stopped to evaluate policy). ADR stands for Adjusted Discounted Reward and is displayed with 95% confidence bounds 2.6.2 Tag The Tag domain was introduced by Pineau, Gordon and Thrun to evaluate their Point-Based Value Iteration algorithm (Pineau et al., 2003). In this domain, a robot and a human move around a grid-world with a known configuration. The robot's position is fully observable at all times but the human's position can only be observed when the two occupy the same grid. The robot chooses among the following actions: move_ north, move_ south, move_ east, move_ west and tag and receives a negative reward of -1 for every move action, a negative reward of -10 for the tag action if it is not in the same grid as the human and positive reward of 10 if it is (which leads to a terminal state). For every move action, the human moves away from the robot's position in a stochastic but predictable way. 71 -5 -5.2 -5.4-5.6-5.8 - 0 -61k r -6.2- } - - -6.4- - -6.6- + - -6.8-7' 0. 5 4 0.55 0.6 0.65 0.7 I i 0.75 0.8 . i 0.9 0.95 i 0.85 d=10 d=15 d=20 1 alpha Figure 2-9: Shows the ADR of B3 RTDP in the Tag domain as a function of the action pruning parameter a and discretization D. ADR is plotted with error bars showing a 95% confidence intervals calculated from 20 runs of the algorithms. 72 105 - d=10 - d=15 d=20 * E 10 .. . . . . . . ... . 0 1 0.5 0.55 0.6 0.65 0.7 0.75 alpha 0.8 0.85 0.9 0.95 1 Figure 2-10: Shows the convergence time of B3 RTDP in the Tag domain as a function of the action pruning parameter a and discretization D. ADR is plotted with error bars showing a 95% confidence interval calculated from 20 runs of the algorithms. We can see that the convergence time of B3 RTDP increases both with higher discretization as well as a higher requirement of action convergence before pruning. This is an intuitive result as the algorithm also garnishes more ADR from the domain in those scenarios. 73 -5 -6 -7 -8 0 -9 B3RTDP anytime performance B3RTDP convergence time -. -10 - I. Ii SARSOP @ 3700 ms Ii Ii I. I; -12 103 102 10, t [ms] Figure 2-11: Shows the ADR of B3RTDP in the Tag domain. The algorithm was run with D = 15 and a = 0.65 and ADR is plotted with error bars showing 95% confidence intervals calculated from 50 runs. We can see that B3 RTDP converges at around 370 ms, at that time SARSOP is far away from convergence but has started to produce very good ADR values. 2.7 2.7.1 Discussion and Future Work Discussion of Results As we can see from tables 2.1 and 2.2, B3 RTDP can outperform the SARSOP algorithm both in the Average Discounted Reward (ADR) measure as well as in convergence time. We can also see that the anytime behavior of the algorithm is quite good from the graphs in figures 2-8 and 2-11 such that if the planner were stopped at any time before convergence, it could produce a policy that returns decent reward. We know that these benefits are largely due to the following factors: 74 1. The belief clustering which is inherent in the discretization scheme of our value function representation. This benefit comes at the cost of memory. 2. The action pruning that significantly improves convergence time and is enabled by the boundedness of the value function. 3. The search exploration heuristic which is guided by seeking uncertainty and learning the value function rapidly. It is satisfying to see such positive results but it should be mentioned that two parameters of the B 3 RTDP algorithm which most heavily impact its performance, namely D and a. These parameters have quite domain-dependent implications and should be reconsidered for different domains the algorithm is run on. We show in our results how the performance of the planner varies both in ADR and convergence time as a function of these parameters on the two domains. 2.7.2 Future Work During the development of B 3 RTDP we identified several areas where it could be improved with further research and development. Much of the current running time of the algorithm can be attributed to the belief update calculation of equation 2.3. This update is computationally expensive to carry out or O( OIS 2) in the worst case for each action (this can be mitigated by using sparse vector math when beliefs do not have very high entropy). Many POMDP problems have factored structure which can be leveraged. This structure means that a state is described by a set of variables, each having their own transition, observation and reward functions. Factored transition functions are traditionally represented as Dynamic Bayes Nets (DBNs) and can cause a significant reduction both in the memory requirement of storing the transition matrix as well as in the computational complexity of the belief update. This benefit is gained if the inter-variable dependence of the transition DBNs is not too complex. 75 RTDP-based algorithms would clearly benefit greatly from taking advantage of factored structure, possibly even more so than other algorithms as the hash table value function representation might be implemented more efficiently. To improve convergence of search-based dynamic programming algorithms, it is desirable to spend most of the value updates on "relevant" and important beliefs. If the point of planning is to learn the true value of the successor beliefs of the initial belief so that a good action choice can be made, then we should prioritize the exploration and value updating of beliefs whose values will make the greatest contribution to that learning. B 3 RTDP already does this to some degree but could do more. The SARSOP algorithm (Kurniawati et al., 2008) uses a learning strategy where it bins beliefs by discretized features such as the upper value bound and belief entropy. Then, this algorithm uses the average value in the bin as a prediction of the value of a new belief when determining whether to expand it. Focused RTDP (Smith & Simmons, 2006) also attempts to predict how useful certain states are to "good" policies and focuses or prioritizes state value updates and tree expansion towards such states. B 3 RTDP could take advantage of the many existing strategies to further focus the belief expansion. 76 Chapter 3 Mind Theoretic Reasoning 77 3.1 Introduction In this chapter, we address the main challenge of producing robotic teammate behavior, which approaches that of how a human teammate can dynamically anticipate, react and interact with its other human teammate. When people work together, they can reach a natural level of fluency and coordination that stems from not just understanding the task at hand but also understanding how the other reacts to the environment, what part of the task they are working on at any given time and what they might or might not know about the environment. Castelfranchi wrote about modeling social action for Al agents and had some insightful thoughts that are relevant (Castelfranchi, 1998): "Anticipatory coordination: The ability to understand and anticipate the interference of other events with our actions and goals is highly adaptive. This is important both for negative interference and the avoidance of damages, and for positive interference and the exploitation of opportunities." -Castelfranchi As Castelfranchi points out, anticipating future events by modeling the social environment can not only help to avoid damages but it can present opportunities which can be exploited. Lets take an example, lets say that an agent knows that to accomplish its goal, a certain sub-task needs to be completed, and that the same sub-task is needed for the successful completion of another agent's goal. The agent can choose to either complete that sub-task and do it in a way that is obvious to the other agent that it was completed to spare him the work, or it can spare itself the work by exploiting the opportunity presented by knowing that the other is likely to complete it later. "Influencing: The explicit representation of the agents' minds in terms of beliefs, intentions, etc., allows for reasoning about them, and even 78 more importantly it allows for the explicit influencing of others, trying to change their behavior (via changing their goals/beliefs)..." - Castelfranchi Not only can agents that model their social environment passively avoid damages and exploit opportunities, but they can also actively predict how they might manipulate the other to produce a more favorable outcome. This distinction has also been called weak social action vs. strong social action. This can clearly be advantageous in many scenarios but particularly in collaborative settings such as human-robot teamA robot should be able to model its teammate and leverage that model not work. only to passively exploit predictive actions but actively manipulate by for example making sure the teammate is aware of relevant features in the environment that they might not have been otherwise. Mind-Theoretic Planning 3.1.1 In this chapter, we present the development of a Mind-Theoretic Planning system (MTP). We believe that it is crucial for a robot or an autonomous agent, that is charged with collaborating and with people in human-robot teams, to have some understanding of human mental states and be able to take them into account when reasoning about and planning in the world. We strive to accomplish the following design goals with such a system. It should be able to: " reason about the following types of unobserved mental states of its human teammate: - possible false beliefs they might have about the environment - possible goals or desires they have * predict future actions based on what their mental state is " plan its own actions in a way that: 79 - seeks to better know which mental state the other has if useful - seeks to correct false beliefs of other if useful - exploits opportunities created by prediction of actions - avoids damages anticipated by predicted actions - accomplishes the goals of human-robot team The rest of this chapter is dedicated to introducing the relevant background concepts, reviewing existing literature, demonstrating earlier approaches which lead to the MTP system, presenting the final version of the MTP system, and finally presenting methods and results of evaluating it. 3.1.2 Overview of Chapter Background In this section, a few of the important underlying psychological concepts of mental state reasoning are presented and their relevance to this thesis discussed. Overview of Related Research This section will outline some important existing research in the areas of computational mind-theoretic reasoning, human-robot teamwork and belief space planning. Earlier Approaches to Problem Here we will give a brief synopsis of a couple of approaches that we experimented with to implement the MTP system. This section simply serves to show how we arrived at the final formulation, which is presented in the next section. Much more detail on these earlier approaches is provided in the appendix of this thesis. 80 Mind Theoretic Planning This section details the final implementation of the MTP system using a variety of Markov Decision Process models both fully and partially observable. Evaluation In this section, we describe the simulators and on-line game environment that was developed to evaluate the MTP system. We also describe the experimental setup and results of a user study that that demonstrates the capabilities of the MTP system and evaluates its impact on human-robot teamwork. 81 3.2 Background In this section, we introduce some of the psychological concepts that underly human mental state reasoning and discuss their relevance to our MTP system. We also introduce basic concepts from logic, knowledge representation, and probabilistic reasoning. Note that we provide no background on the autonomous planning literature here as that has been covered in our previous chapter 2.2. 3.2.1 Theory of Mind Theory of Mind (ToM) is the term that has been coined for our ability to reason about other people's behavior in terms of their internal states (such as beliefs and desires). Having a ToM allows an individual to understand and predict others' behaviors not only based on the directly observable perceptual features but also knowledge about the other person, what have they done in the past, what they know about their environment, what types of relationships they have with others that are involved, and more. Reasoning about their thoughts, beliefs, and desires can also include what beliefs they hold about you, or even your beliefs. This can become a recursive process as demonstrated in figure 3-1, although it seems unlikely that a human reasoner would take more than about two recursive steps or to what would be level four in the figure but is not shown ("Jill believes that Jack knows that she is mind-reading him"). This skill does not only help us understand others' behavior better and make more accurate predictions about possible future actions, but it also supports traits that are important for our society to function. This includes developing and feeling empathy and compassion for other people and their suffering (Christian Keysers, 2008) which has been empirically tested and confirmed with f\IRI studies (Singer et al., 2004) and (Singer et al., 2006). 82 Level 2 Level 1 ack: Menmal sink, Mindreadift Level 3 Embedded onindreadeng Mest recuslion) Figure 3-1: Demonstrates the recursive nature of ToM. Adapted from (Dunbar, 2005). 3.2.2 Internal Representation of ToM The internal mechanisms that get used to understand, interpret, and reason about others' behavior have also become a topic of much debate and discussion in the ToM community. The debate has been about whether people actually use a naive theory of psychology (Theory-Theory or TT) to make inferences or predictions about human behavior, or whether they feed the perceptual features they believe their model observes through their own cognitive mechanisms and then read the suppressed outputs of those mechanisms to reason about the model's mental states (Simulation-Theory or ST). Many researchers prefer the TT account (Saxe, 2005) (Gopnik & Wellman, 1992) or at least a hybrid account that explains mind-reading as a combination of simulation and theorizing, whereby a theory might govern how the perceptual inputs should transformed before being fed into the cognitive mechanisms and possibly also how to interpret the results of the simulation (Saxe, 2005) (Goldman, 2006) (Nichols & Stich, 2003). 83 3.2.3 False Beliefs The ability to attribute false beliefs to others has become recognized as a fundamental capability that a ToM affords. Understanding that another might hold a belief about a certain state of the world that is incorrect (or at least different from your own) is called understanding false belief. False belief tasks have become a benchmark for the development of ToM in children, as an understanding of it is believed to be an important developmental milestone. Failure to attribute false belief to another by a certain age could be considered a sign of some cognitive developmental deficiencies such as autism (Perner et al., 1989). A classic false belief task is the Sally-Anne task which was originally proposed by (Wimmer Josef, 1983). In this task, a child is shown a cartoon strip (see figure 3-2 for all of the strips). In this story, Sally puts the ball in the basket and leaves. Then, Anne moves the ball from the basket to the box and leaves. After seeing the strip, the child is asked where Sally will start looking for the ball. If the child has developed a ToM, she will understand that Sally did not witness when Anne moved the ball and will therefore indicate that Sally will look where she thinks the ball is, in the basket. How we come to develop ToM and when is still a matter of some contention. Some researchers claim that children interpret others' behavior in terms of their individual beliefs and desires from birth (Onishi & Baillargeon, 2005), but most studies agree that children are able to pass the false belief task at around the age of 3-5 years (Wellman et al., 2001). 3.2.4 Mental State The use of the word "mental states" is fairly ambiguous in common use of the language and requires further specification to be meaningful. Several theories in human psychology and philosophy use this term to describe non-physical properties of our being, things that cannot be reduced to physical or biological states (Wikipedia, 2014). 84 This iSaly. This is Anne Sally has a basket Anne has a box, Sally goes out for a walk. Anne takes the marble out of the basket and puts 1 into the Now Sally comes back. Where She wtr - Sany look tor her to box. play with her marble marble? Figure 3-2: A classic false belief situation involving the characters Sally and Anne (Image courtesy of (Frith, 1989)). Common interpretations include: beliefs, desires, intentions, judgements, preferences and even thoughts and emotional states such as happiness and sadness. In this thesis, we focus on two important types of mental states, namely propositional attitudes or beliefs and desires or goals. These two categories of mental states cover a broad spectrum of concepts and are particularly helpful to predict the behaviors of others (Schiffer, 2012). If one understands what another person believes to be true and false about the world in addition to knowing what that person desires or wants to achieve, 85 then the only piece missing to fully predict their behavior is a model of how they will choose to bring about the changes needed to transition the world from what their beliefs say it is to what their goals say they want it to be. This can of course be a challenging task and is further complicated with uncertainty about how predicted behavior is affected by the behavior of other agents, perception of unknown features of the environment, and possible goal switching or re-prioritizing, just to name a few. 3.2.5 Knowledge Representation We now switch gears slightly and discuss different means of reasoning about and representing knowledge. Much classical work in Artificial Intelligence has dealt with logical reasoning and inference with symbolic representations such as in First Order Logic (Russell & Norvig, 2003). First order systems provide a formal way to manipulate truth statements about objects and their relations to other objects, and to derive or infer the truth-values of statements that do not already exist in the knowledgebase. It is now widely accepted in the field of Al that logical reasoning is not sufficient for making useful decisions in the real world because knowledge is almost never certain and our models of the world are inaccurate. Three sources of uncertainty that intelligent systems need to be able to cope with (Korb & Nicholson, 2004): 1. Ignorance: Even certain knowledge is limited. 2. Physical indeterminism: There are many sources of randomness in the real world. 3. Vagueness: Our models cannot be specific for every possible input or outcome. The caveat is that we also know that learning and reasoning with fully unconstrained knowledge or no bias for focusing on relevant concepts is intractable and also somehow does not quite fit with our intuition for how human intelligence might 86 work. We believe that a truly intelligent system should know when to use purely deterministic logical reasoning methods and when to take uncertainty into account. Probabilistic Representation Probabilistic information is often represented as a set of stochastic variables with either discrete or continuous domains. The value of a stochastic variable can depend on values of other variables in the set through a Conditional Probability Distribution (CPD) or Conditional Probability Table (CPT). In this discussion, we will only consider variables with discrete domains. Computing any quantity of interest in this set of variables is in the general case very computationally expensive; for example, finding the marginal probability that a variable takes a particular value can mean summing over variables (Barber, 2011). 2 N-1 states for N binary In order to be able to reason efficiently about the set of variables, we need to constrain their interactions. In particular, we want to be able to leverage inherent independencies between variables and make independence assumptions when the computational gain of doing so does not come at too severe a cost in accuracy. 3.2.6 Bayesian Networks Bayesian Networks (BNs) have become a widely accepted method to represent probability distributions because of how they can make the independence relations that exist in a distribution explicit, help in visualizing the distribution, and support efficient inference. BNs are Directed Acyclic Graphs which means that they prohibit the possibility of traversing from a variable in the network, along the directed arcs, and arriving back at that variable. A BN is fully defined by a list of variables and their CPTs, but it is often useful to think of them only in terms of their inter-connectivity and then parameterize the generation of their CPTs. This parameterization is domain dependent but often takes the form of logic gates where the truth value of a binary 87 variable could for example be defined as a (possibly noisy) OR gate of its parent values. Reasoning with Bayesian Networks Making a query to a BN means to either request the probability distribution of a variable (possibly given some evidence) or a sample from that distribution. One might be interested in the marginal likelihood of a variable, sometimes called "model evidence," which means the likelihood that a variable takes a value, given only the network it belongs to. Another quantity of interest is the posterior probability of a variable which is the probability that a variable will take a value given the observed values of other variables. Four types of reasoning can be done with BNs (Korb & Nicholson, 2004): 1. Diagnostic: An effect is observed and a query is made about its cause 2. Predictive: A cause is observed a query is made about its effects 3. Inter-causal: A cause and one of its effects are observed and a query is made about other causes of that effect 4. Combined: A node's cause and effect are observed and a query is made about the node Any of these types of reasoning require probabilistic inference in the BN. This inference can take several forms, and there are a multitude of both exact and approximate inference techniques, see (Guo & Hsu, 2002) for an exhaustive review. For reasonably large BNs, approximate inference algorithms are a sensible choice. A family of sampling-based algorithms called Markov Chain Monte Carlo is of particular use. This family includes the Gibbs Sampling, Metropolis-Hastings, and Hamilton Dynamics methods, all of which are different ways to create an estimation of the target distribution that we wish to sample from. 88 Some novel interesting approaches to represent and sample from BNs, that have relevance to this thesis, are (1) Methods that exploit Context Specific Independence where a CPT can be represented more efficiently with rules rather than a table (Boutilier et al., 1996), (Poole, 1997) and (2) ProbabilisticProgramming approaches, where programs, rather than tables, govern a variable's CPD (Gerstenberg & Goodman, 2011). 89 3.3 Overview of Related Research This section describes some of the prior work in the field of reasoning about agents' beliefs, goals, and plans. Much progress has been made, and some very interesting results gathered, but there are many challenges that still remain unaddressed. Notably, not much work exists that can reason explicitly about false beliefs (discussed in section 3.2.3). Furthermore, much of the work describes systems that are limited to either (1) estimating beliefs given observed actions (2) predict actions given estimated goals or (3) plan actions that take others' actions into account. Limited work exists that tries to combine these methods in a holistic approach to a mind-theoretic agent system. In this thesis, we intend to frame and make considerable progress on some of these problems. 3.3.1 ToM for Humanoid Robots Brian Scassellati was one of the first authors to discuss the possibility of endowing robots with ToM capabilities (Scassellati, 2002). In this work, he compared different models of ToM, introduced by Leslie et al. (Leslie, 1994) and Cohen et al. (BaronCohen, 1995), with respect to their applicability to robots. Scassellati discussed how Leslie's model breaks the world down into three categories of agency: mechanical agency, actional agency and attitudinal agency. The theory further suggests that a Theory of Body mechanism (ToBY) supports people's understanding of physical objects and mechanical causality. ToBY's inputs are theorized to be two-fold, one a three-dimensional object-centered representation of the world which has been processed by high-level cognition, and another which is more directly perception-based and focuses on motion. In the course of cognitive development, a Theory of Mind Mechanism emerges (ToMM). This mechanism deals with the laws of agents, specifically their attitudes, goals, and desires. This model proposes a hierarchical segmentation of the world into a super-class of "background," of which 90 "objects," that obey physical laws of movement and mechanics are a subclass that can be even further be classified as "agents." The realm of "objects" is governed by the ToBY mechanism whereas "agents" are processed using the ToMM. This seems like a sensible organization of objects in the world as far as ToM is concerned. By the author's account, Baron-Cohen's model of ToM breaks perceptual stimuli into two categories of interest: the first one is concerned with objects (perceived through any modality) that appear to be self-propelled, and the second category contains all objects within the visual stimuli that appear to have eye-like shapes. In Baron-Cohen's model of ToM there are four distinct important modules: the Intentionality Detector (ID), Eye-Direction Detector (EDD), the Shared Attention Mechanism (SAM), and the Theory of Mind Mechanism (ToMM). The processing of ToM related cognition filters perceptual features through the ID and EDD to the SAM and finally the ToMM modules. The authors presented implementations of some of the lower-level cognitive or perceptual modules required for robots if they were to possess these models of ToM, namely a module for differentiating inanimate and animate objects as well as automatic gaze following. 3.3.2 Polyscheme and ACT-R/E Trafton et al. analyzed multiple hours of audio recordings from astronauts' utterances performing a training task and showed that about 25% to 31% of the utterances involved perspective-taking (such as "on your left side" or "come straight down from where you are" etc.) (Trafton et al., 2005). Their interpretation of the results is that it is vital for robots to be able to take the perspective of the human to be able to provide effective interactions. The authors then proceeded to implement a robotic perspective-taking system based on the Polyscheme architecture. Their system is a symbolic reasoning and planning system that integrates multiple representations and inference techniques to 91 produce intelligent behavior. Trafton's lab has explored how to make robots more efficient teammates by enabling the robots to model their human teammates using what they call a "cognitively plausible" system that mentally simulates the decision making process of the teammate (Adams et al., 2008). Their reasoning for modeling the other agent for the sake of the quality of teamwork is that it would reduce the amount of monitoring required. We are not sure that this reasoning holds as the simulation of the teammate can only be as good as the inputs into its computation. We assume that for the simulation of the teammate's decision-making strategy to be as efficient as possible, all available data should be provided to the simulation via as much monitoring as the task allows. Therefore, we think a better reasoning for performing ToM modeling of other agents for teamwork is to reduce the communication required rather than monitoring, as communication can be redundant if all agents can monitor and understand the behavior of their teammates. Their implementation of the system extended the ACT-R cognitive architecture to handle embodied agents by incorporating a spatial reasoning module and navigation (ACT-R/E). The researchers evaluated their work using a computer simulation of a warehouse patrol problem with two patrollers, one robot and one human. When an alarm is heard, the agents need to make it to two guard stations. The problem they focused on is which agent should go to which station. They experimented with two strategies that the robot could pursue: 1. Self-centered strategy: The agent simply moves to the closest guard station 2. Collaborative strategy: The agent attempts to predict the teammate's choice of stations and selects the other one A concern that immediately surfaces is that even when the robot selects the more sophisticated strategy of predicting the teammate's choice of stations, it makes the simplifying assumption that the human would select the less sophisticated strategy 92 of simply choosing the closest station. This might be an oversimplification of the intricate socio-cognitive capacities of the human's decision-making process. Finally the authors report that in simulation, the robot and simulated human agent performed fewer steps when the robot followed the "collaborative strategy" than when it followed the "self-centered strategy." They claim that some actual human-subject experiments were performed using an iRobot B21r and the results were very similar. It would be interesting to know how the human experiments were similar or different, as it is possible that the humans were performing a much more sophisticated modeling of the robot's decision making than the robot was of them, and would therefore always compensate for the robot's actions effectively rendering the robots station-choosing irrelevant. It is even possible that in the non-simulated case, having the robot choose the "self-centered strategy" could be more efficient as the human is likely to be able to move much faster than the robot and therefore the distance from the robot to its station would become the bottleneck of the system. 3.3.3 ToM Modeling Using Markov Random Fields Butterfield, Jenkins et al. performed experiments with Markov Random Fields as a probabilistic representation for ToM modeling (Butterfield et al., 2009). In their setup of the MRF, each agent was represented by two vector-valued random variables: xi for the agent's internal state representing intentions, beliefs and desires whereas; yi represents the agent's perception of the physical world, the presence or absence of certain objects etc. The agent models the team by a network of these variable pairs, one for each teammate, as can be seen in figure 3-3. Each agent's state vector xi is conditioned on its perception vector yi through the local evidence function O(xi, yi) as well as all of the other agents' state vectors through the compatibility function (xi, xj). This setup was then adapted to coordinate action selection in a multi-robot sce- 93 States of other agents X2 X3 X4 CompatibilityThe influence of agent 2's state on agent I 0 (X1, X 2) (I, X4) Local Evidence- X1 State-The agent's intentions The influence of the agent's local perception on its state (XI, Yi) and knowledge of the environment Nz' The agent's perception of the physical world Figure 3-3: Shows the inter-connectivity of the MRF network. Each agent is represented with an observation vector yi and a state vector xi. From (Butterfield et al., 2009). nario by using a Belief Propagation algorithm where any robot's communication is restricted to a limited set of "neighbors." The result of performing the BP algorithm is effectively an action posterior for each agent, which they can sample to execute actions within a joint task of the team. The MRF framework has been evaluated and shown promise in being able to correctly explain known patterns in ToM tasks performed by children with the appropriate local evidence and compatibility functions. The framework has well founded support for probabilistic manipulation of data and shows promise as an action selection mechanism in a multi-robot team task but its capacity to model ToM in humans seems limited or at least working at a very low-level of cognition. 3.3.4 Plan Recognition in Belief-Space The purpose of plan recognition systems is to infer the goals and plans of an agent given observations of its actions, which makes them very relevant to the presented 94 work. A harder version of the plan recognition problem is to try to infer which beliefs an agent needs to hold so that its observed actions are likely to be a part of a rational plan towards its estimated goal; (Pynadath & Marsella, 2004) and (Ito et al., 2007) developed a computational model for this kind of abductive reasoning for agents in a multi-agent domain. This model is effectively a plan-recognition system that models the beliefs, goals and actions of other agents with a POMDP formulation (see background section 2.2.2). They calculate optimal policies for each agent to achieve each of its goals and then perform a maximum likelihood search for the POMDP policy that gives the highest value for the observed sequence of actions. This gives the observer an estimate of the belief state of the observed agent as well as what their most likely goal might be. This approach lacks the representational power to reason about cases where other agents might be acting on false beliefs about the world (ie. the false-belief problem from section 3.2.3), and it also does not model in an explicit way how an agent's actions depend on each other or how agents could anticipate or react to each others' behaviors. These are important features of the presented work, which differentiate it substantially from this work. 3.3.5 Inferring Beliefs using Bayesian Plan Inversion This section describes a class of work where researchers speculate about the internal mechanisms that are employed by humans to predict and reason about the behavior of other agents. Many of the experiments are conducted in such a way that human observers are made to watch an animation of an agent travelling in a grid-like world and are asked to make predictions about its future behavior at various time points. The animated agent simply plays out a pre-described motion as deterministically designed by the researcher. The researcher then creates several graphical models that employ different inference strategies and compare the prediction results of these models with those of the human observers. The relative similarities between behavioral 95 predictions made by the models and the humans provide evidence that the models capture some aspects of the kinds of probabilistic behavioral inference that human minds might employ. Tauber et al. performed an experiment where they showed that people use Lineof-Sight (LoS) cues as well as Field-of-View (FoV) to assign belief states to observed agents and then use these beliefs to reason about the agents' behaviors (Tauber & Steyvers, 2011). Their experiment showed that the graphical model that only used the agent's LoS and FoV created predictions that were much more similar to humans' predictions than models that used X-Ray vision (could see through obstacles but were bound by FoV), only proximity, or were all-knowing. This suggests that if a mind-theoretic agent wishes to accurately predict the actions of other agents (human or autonomous) it would benefit from reasoning about their beliefs and perspective. Experiment 1 (a) C A Judgment Judgment 7 ,... 11 A B..- A A C Judgment point: C point: ... '7 point: Experiment 2 (b) Judgment point: .. 10B1.......' 11.. 1 Figure 3-4: An example visual stimuli that participants would be shown in (Baker et al., 2009). An agent travels along the dotted line and pauses at the points marked with a (+) sign. At that point participants are asked to rate how likely the agent is to have each marked goal (A, B and C). In (b) the participants were asked to retroactively guess the goal of the agent at a previous timepoint. Baker et al. illustrate how Bayesian Inverse Planning (BIP) can be used in conjunction with the rationalityprinciple -the expectation that agents will act optimally/rationally towards achieving their goals- to formalize the concept of "ideal observer" or "ideal inference agent" which are terms that are used heavily in the cog- nitive science literature (Baker et al., 2009). 96 This approach is especially useful for rational goal inference as demonstrated in figure 3-4. Baker et al.'s inference model varied the richness of the goal representations -allowing agents to switch their goals during an episode, as well as having sub-goals- by adjusting different prior probability models to goals. The researchers showed good correspondence between model predictions and human judgments. Baker's work was extended to make predictions about social judgments (Ullman et al., 2010). In this scenario, two agents move around a 2D maze, which contains a boulder that can block access through the grid and two goal-objects (a flower and a tree). One agent is small and the other is large. The large agent can push the small agent and the boulder around. Participants are exposed to animated episodes of the agents moving around in the maze and are asked to rate the goals of "flower," "tree," "help," or "hinder" (where "help" or "hinder" refer to whether the large agent is trying to help or hinder the smaller agent). The predictions of the BIP model were compared to human judgments as well as a simple movement-cue based model. The cue-based model surprisingly generated predictions more similar to human judgments about simple object-based goals, than the BIP model. But the cue-based model was not able to capture the more complex social goals -helping and hindering- which the BIP model predicted quite well. More recently, this work was extended to represent what the authors call a "Bayesian Theory of Mind" (Baker et al., 2011) (Jara-ettinger et al., 2012), a system of Bayesian reasoning to infer mental states from actions. This work combines the ideas in the previously discussed work of (Baker et al., 2009) and (Ullman et al., 2010) to create a computational system that can generate plausible explanations for observed agent behavior using Bayesian Inverse Planning in conjunction with belief and rationality modeling with preferences. In this work, the authors present a model that can predict the preferences of a college student looking for their favorite food truck. The results show that the model can predict the agent's preferences in a compelling way. Again, this work provides a computational model that can generate predictions 97 of agent behavior that closely resemble human judgments but does not address the question of how to act in this domain or how an agent might manipulate the beliefs of others. It does provide a principled way to think about and frame this problem and shows that Bayesian inference provides a lot of flexibility to represent different goal representations and agent preferences. 3.3.6 Game-Theoretic Recursive Reasoning As figure 3-1 suggests, reasoning about Theory of Mind is inherently a recursive process. Depending on the kinds of modeling one wants to perform in this domain, recursion should be addressed to some degree. The following work discusses a way to handle recursion gracefully and make an informed decision about how deep the model should reach. The Recursive Modeling Method (RMM) gained some popularity in the field of Decision Theoretic Agents in multi-agent systems (Vidal & Durfee, 1995). This method effectively reasons about pay-off matrices, which dictate each agent's preference for choosing an action given the action choice of all other agents. RMM arranges these matrices in a tree-structure with a probability distribution over every set of childmatrices (see figure 3-5). The method assumes that those pay-off matrices are provided by some external entity (presumably a planning system) or are derived from statistics of observations of actions taken previously. How to produce these matrices is actually not well explained in the paper. The authors admit that they are pre-calculated per domain and that simplifying assumptions were used such as maintaining no history and having all mental states always be immediately derivable from the instantaneous physical situation. The authors chose to focus on showing how to effectively use dynamic programming methods to solve the problem of finding the best action to take given the pay-off matrix hierarchy. Their approach is especially useful for reasoning about when not to expend more computation on going deeper into the recursive hierarchy of matrices and is therefore able to provide a good compromise 98 between deep social reasoning and computational cost. They call this Limited Rationality and argue that agents' reasoning capabilities can often be outstripped by the amount of data available, which makes it necessary to meta-reason about when and what to reason about (Durfee, 1999). P A A A -1 Q B 2 BOO A B A Q B A QB P2 P Q B /\Q 1P2 A B A A B A -i 2 AF2 B 0 0 BL910 P 7 A 1- Q P3 Q B B A 2 2 ] k-A--I P AF BOO B iL0 Figure 3-5: An example of an expanded tree structure of pay-off matrices as perceived by P when choosing whether to execute action A or B. With probability pi P thinks that Q views the payoff matrix in a certain way (represented by a different matrix) and with probability (1 - pi) in another. This recursion continues until no more knowledge exists, in which case a real value is attributed to each action (0.5 in the uninformed case) (Durfee, 1999). RMM was extended to incorporate Influence Diagrams (IDs) which are able to approximate optimal policies in Simple Sequential Bayesian Games (SSBG) (SondbergJeppesen & Jensen, 2010). Framing the problem as an SSBG places it in a GameTheoretic framework and therefore should produce policies that tend to model, and be adaptive to, the other agent's policy. An insight that the authors apply in their implementation to approximate the solution is to both experiment with removing arcs in their ID as well as adding them and observing the effects they have on the 99 policy. This is equivalent to assuming that the agent knows "less" or "more" about the variables in the domain, either of which could make the computation less difficult according to the authors' claims. Finally it is worth mentioning the work of (Zettlemoyer et al., 2009) in this context as they have formalized the computation of infinitely recursive belief management in multi-agent POMDP problems. They developed an algorithm that uses finite belief representation for infinitely nested beliefs and showed that in some cases these can be computed in constant time per filtering step. In more complex cases, the algorithm can prune low probability trajectories to produce approximately correct results. 3.3.7 Fluency and Shared Mental Models Researchers have pursued many different ways of modeling "other" agents in multiagent scenarios, here we will mention some notable work where the concept of the "other," their behavior patterns, preferences, beliefs, and goals, are represented in different and interesting ways. Human-robot teamwork, whether in research or industry, tends to be implemented as a rigid turn-taking process to avoid dealing with the many issues that arise with concurrent action execution and modeling teammates. In the Human-Robot Inter- action (HRI) domain, prior systems achieve some of the seamless coordination and fluency that human teams tend to display when they have been jointly trained and cooperate in harmony. (Hoffman & Breazeal, 2007) developed an adaptive agent system capable of adjusting to the behavior of a human partner in a simulated domain of jointly assembling a cart from its parts. This system maintained statistics of world transitions given the actions of the robot and employed a nafve Bayesian estimate for the transition distribution. It would then consider the four alternative proto-action sequences of: <pickup a tool and use it>, <return a tool and return to workbench>, <return a tool, pick up a new tool and return to workbench> and <do nothing>, perform an optimization of reducing the expected cost of performing any of those 100 sequences given the statistical estimates of how the world-state would be advanced as a result. An experiment compared this approach to a purely reactive system and demonstrated that in the best case there was a very significant difference between the objective measure of time-of-task-completion, but when averaged over all episodes there was actually no significant effect. The experiment's post-study questionnaire showed a significant difference in the perceived contribution of the robot to the team's fluency and success. This work demonstrates how powerful the effect a flexible and adaptive autonomous agent can have on a teammate's subjective experience of the agent's competencies. This work does not focus on the planning aspect of the problem but informs how behavioral statistics could be collected and exploited to create more adaptive and fluent robot teammates. Abbte> Choose acon Figure 3-6: A view of the ABB RoboStudio Virtual Environment during task execution. The human controls the white robot (Nikolaidis & Shah, 2013). Nikolaidis et al. took a completely different approach to a similar problem in their work (Nikolaidis & Shah, 2013). They strived to achieve a "Shared Mental Model" (SMM) between the human and its robot teammate by formulating the problem of joint action selection of the team as a specialized MDP (see background section 2.2.2). 101 In this formulation, the robot's uncertainty about the human action is encoded in the probability distributions of the transition function T of the MDP model, and similarly the human's expectation of the robot action is encoded in the reward function R. These functions are learned through an interactive training phase between the teammates where roles are switched to gather data to estimate both functions. When the SMM has converged and the robot uses the optimal policy of its learned transition and reward function, the algorithm is really selecting an action whose effects compounded with the expected response of the human (encoded in T) will lead to a good world state as measured by the accumulated reward (by matching the expectations of the human) of visited states. This method was shown to produce excellent results both on objective measures of fluency such as concurrent-motion and human-idle-time as well as on subjective measures. 3.3.8 Perspective Taking and Planning with Beliefs "Perspective taking" has been shown to be an important cognitive ability that humans use in various socio-cognitive as well as spatio-cognitive tasks. This activity can provide the perspective taker with a more intuitive understanding of the other agent's situation and is often useful in disambiguating communication or providing support for behavior. Perspective-taking was integrated as a core function into the cognitive architecture Polyscheme and was used to help the robot better understand human instruction (Trafton et al., 2005). The idea of using perspective-taking for robots was taken one step further by Breazeal et al. in their behavioral architecture for a robot which maintains an almost fully replicated version of all of its reasoning mechanisms for every observed human (Breazeal et al., 2009). These replicated behavior systems operate in parallel with that of the robot and process the robot's sensor stream as filtered through a perspective filter appropriate for the human's FoV and perceptual range (see Figure 3-7). This 102 Body Motor System Body Pos e e -POW rei (Mindreading) u SRecognition/Inference d Generation Sensors Belief System r's Ballets Perception System "tnm"hC Trn fr ation ... y.....e. Perception- as perormingperspective taigFrm(ezalta.,20) Figure 3-7: Shows the trajectory by which data flows and decisions get made within the "Self- as- Simulator" architecture. Demonstrates how the robot's own behavior generation mechanisms are used to reason about observed behavior of others as well as performing perspective taking. From (Breazeal et al., 2009). "Self As Simulator" system implements perspective taking at a very deep level of its behavioral architecture and was shown to improve learning sessions between a human and a robot by disambiguating demonstrations (Berlin et al., 2006). In his thesis work, Jesse Gray employed the behavioral architecture from (Breazeal et al., 2009) to create a robotic "Theory of Mind" system that can realize goals which exist both in the real world as well as in the mental states and beliefs of others (Gray & Breazeal, 2012), effectively implementing Castelfranchi's strong social action. This work leverages the embodied aspect of robots (as different from virtual agents) and how it connects the physical world and its observable/ occludable features with the hidden mental states of other agents. The system is able to come up with very highresolution motion plans for fairly short time-horizons, which execute useful actions in the world while attempting to maintain certain beliefs in another agent's mental state. In a demonstration of this system, the robot transports an object from one location to another while either hiding that object from the human or making sure 103 it was observable by them, depending on study conditions. This was done to satisfy different mental-state goals for the human while achieving the real-world goal of transporting the object. This research is relevant to many of the components of the presented thesis work, but it approaches the problem from a different direction and uses very different representations and algorithms. The presented thesis is informed by the efforts made by Gray's work but is not a direct continuation of it. In fact, the presented system will essentially be complementary to Gray's and could be employ it to find high-resolution solutions to sub-problems in a larger plan. The presented work will use more abstract action- and goal-representations, which will allow it to be applied to problems with longer time-horizons and domains. Another differentiation is that in this work we will frame this problem using probability theory, which provides more robustness to different action outcomes and unanticipated human behavior. This could come at a cost of increased computational complexity and size of representation, which can be alleviated by the use of heuristics and approximate algorithms. Furthermore, the presented work emphasizes predicting the future actions of others and how they might influence the state of the world as a function of time. This is different from Gray's where more consideration is give to how the future actions of the robot will influence the mental state of agents that are relatively static in time. 3.3.9 Belief Space Planning for Sidekicks The existing work that is most similar to the presented work in this thesis is the dissertation of Owen Macindoe (Macindoe et al., 2012). Macindoe presented his system called Partially Observable Monte Carlo Cooperative Planning (POMCoP), which is a POMDP planning system based on David Silver's and Veness' MarkovChain Monte Carlo framework POMCP (Silver & Veness, 2010). POMCoP performs belief space planning in an fully observable world with deterministic transitions where 104 progress per update jwA Time Figure 3-8: A robot hypothesizes about how the mental state of a human observer would get updated if it would proceed to take a certain sequence of motor actions. This becomes a search through the space of motor actions that the robot could take which gets terminated when a sequence is found that achieves the robot's goals as well as the mental state goals about other agents (Gray & Breazeal, 2012). the only unobservable variable is the agent model. Agent models are basically action policies which are provided to the POMCoP system as input. This work was evaluated with a simulated game where the actions of a human player were simulated with a noisy A * path planner. It would have been more satisfying to see the system evaluated using real human users to see the effect of how people perceive video game sidekicks that perform belief space planning over several possible hypothetical player models. 105 40 C30 cc C,) T0 20 10250 0 (a) 100 60 75 50 40 0 (b) _ Cts 20 - (c) *QMDP MPOMCoP POMCoP-NC A ]POMCoP-MD 20 (d) (e) (b) (a) Figure 3-9: (a) Shows the game with a simulated human player in the upper right corner and a POMCoP sidekick in the lower left. (b) Shows comparative results of steps taken to achieve the goal between a QMDP planner and different configurations of POMCoP This work proceeds from similar motivations as the presented thesis work (in addition to framing the problem as a POMDP). After reading the work carefully and meeting with the authors to discuss it, we have determined the main points that differentiate the two contributions. Firstly, POMCoP does not consider how the human predictive policies are generated but expects those to be provided as inputs. The generation of the human predictive policies is one of the core contributions of the presented work, in particular how those predictive policies take into account their own prediction of actions and reactions of the other. POMCoP also assumes perfect observability of the state but takes the actions of the other agent as its observation, whereas the presented work produces a unique observation for states that are indistinguishable from the robot's perspective. This difference is not so significant as we are sure that POMCoP would only need slight modification to incorporate this, also POMCoP's perception model can easily be implemented in the presented system. Lastly and very importantly, POMCoP does not take into account how the human agent might be acting on false beliefs about the environment but also assumes that they can perfectly observe the true state at all times. This is another important contribution of the presented work. 106 3.4 Earlier Approaches to Problem Here we will give a brief synopsis of a couple of approaches that we experimented with to implement the MTP system. This section simply serves to show how we arrived at the final formulation which is presented in the next section 3.5. 3.4.1 Deterministic Mind Theoretic Planning Our initial idea for implementing MTP was to create a system that used Classical Planning (CP) methods (see section 2.2.1) to create plans from the perspective of the other agent which could then be used as predictions of how the other agent will act and be incorporated in a plan for the robot (also generated by CP methods). This approach was meant to benefit from the relative speed of certain CP methods while producing plans that incorporate anticipation of the other's actions. This approach is described generally with pseudocode in Algorithm 4. An offthe-shelf PDDL-capable BASEPLANNER is used to solve sub-problems of the MTP domain. This planner is used to create simple predictions of how agents will act. Those predictions are simulated and iteratively constructed when found to be inconsistent inconsistent. Once a prediction is complete, the planning domain, along with the prediction, is compiled into a Quasi-Dynamic domain which enforces the action choice by the others while allowing flexible action choice by the planning agent. This problem is finally solved with the BASEPLANNER or any other CP system which produces a plan for the robot that takes into account a prediction of the actions of others and their effects. This approach has been demonstrated to successfully create plans which take into account predictions of other's actions, possibly based on false beliefs about the environment and reactions to perception. Algorithm 4 can be used to create plans for the robot for every false belief or goal hypothesis for the human, but this implementation still leaves many features of a mind-theoretic planner to be desired. One of which 107 Algorithm 4: Pseudocode for a simplified deterministic approach to MTP 1 DETERMINISTICMTP 2 4 foreach agent i do Use BASEPLANNER to create initial plan hypothesis pi from (possibly false) initial belief state of agent i to its goal; 6 Align each plan pi temporally to form P; 7 foreach timestep t in P do 9 11 12 14 16 18 20 21 23 25 27 29 31 33 Simulate executing all agents' actions at time t; Make note of when each agent perceives value of new state literals; if agent i's action a can't be taken then Backtrack to t' when agent i first perceived offending state literal; User BASEPLANNER to create re-plan p' for i from t'; Replace agent i's actions after t' in P with p'; Bring simulation back from t to t'; L if agent i perceives error of its false belief then User BASEPLANNER to re-plan from t using true state; Replace agent i's actions after t' in P with p'; L Remove actions taken by robot from P Compile planning domain and P into Quasi-Dynamic domain QD Use BASEPLANNER to solve QD and create plan p, for robot which takes into account static prediction of the others' actions return p, 108 is that it lacks robustness to randomness or stochasticity in the action choice of the human, this next section introduces an approach that was designed to overcome this limitation. 3.4.2 The Belief Action Graph To increase robustness of the deterministic MTP approach presented above, we designed a system that stores the state literals which are previously fully determined, and actions as stochastic variables in a Bayesian Network (BN) (see section 3.2.6). The Belief Action Graph (BAG) is a BN containing stochastic variables for state literals and action variables. Action predictions are created in the same way as in deterministic MTP except that the BAG is used in replacement of the predictive P structure. Each action's pre-conditions and effects on the state variables are modeled as dependencies in the BN, and the Conditional Probability Tables (CPTs) are generated using rules which ensure that actions cannot be taken unless pre-conditions are met. To represent agent beliefs, each state variable has a belief variable associated with each agent, these belief variables will take the same value as the actual state variable only when the corresponding agent can perceive that part of the state. Similarly an agent will act based on the values of its belief variables but the success of those actions is determined by the values of the "true" corresponding state variables. An instructional view of the BAG as well as an example of an actual BAG representation of a navigation problem can be seen in figure 3-10. The BAG is constructed in the following manner: 1. In the first layer of the BAG (t = 0), create a probability variable for every state variable in so. Set the prior probabilities of the state variables according to the confidence their assignments. 2. For each agent aj, create a goal variable G with Domain(Gj) = [1, P] where Pi is the number of goal hypotheses for aj. Similarly create an initial belief 109 variable Is with Domain(Is) = [1, Qj] where Qj is the number of hypotheses for ai's initial belief. Set the prior probabilities for Gi and 1i according to the confidence in those hypotheses. 3. For each agent aj, create a belief variable for every (b9)j and add sQ as the corresponding belief variable's parent. Connect each (b?) 3 to 1i and set P((b)j 1)I) according to what that agent's initial belief dictates. 4. Create a layer for t + 1, for each state variable s' create s'+' and add s as its parent. Similarly, for each agent ai and variable j, create (b'+ 1 ) and add (bt), and s as its parents. 5. For each agent aj, and each plan candidate of that agent m E [1, M], create a variable for action (p')m and add the agent's goal variable Gi as its parent. Find all variables in s' and bt that correspond to the action's preconditions: pre ((p )m) and add them to its parents. Now find all variables in s'+1 and b'+1 that correspond to the action's effects: add ((p'),) and del ((p )) and add the action variable to their set of parents. 6. Increment t and repeat steps 4) through 6) until there are no more actions in the plan candidates for the agents. Once the initial BAG has been constructed, several iterations are performed of evaluating the network using standard BN inference mechanisms, detecting either failed actions or perceptive events that might alter action choice, re-planning and inserting new partial plans into the BAG. The BAG approach can be used to improve the robustness of the deterministic MTP approach but still lacks many fundamental capabilities that we would expect from a mind-theoretic planning system. It also requires frequent, and fairly computationally expensive, inference operations. Even though this implementation can 110 --_ -- -..... ... ............. ............... ...... .......... ...... ............... .......... ...... .......... ..... Other agent Planning agent') s02 So1 b_1 inta b02 E pa01 ao1 paO2 S11 aO2 s12 4I) E (a) (b) Figure 3-10: (a) An demonstrative example of a simple BAG. (b) An example of a BAG instantiated for an actual navigation problem. compute plans that can anticipate actions of others and both avoid their negative effects as well as exploit the positive ones, it suffers from the fundamental limitation of not being able to produce behavior that pro-actively seeks to learn or disambiguate between the mental states of the other. Similarly it lacks a principled method for probabilistically reasoning about which mental state the other truly has and therefore what set of actions should be anticipated and how to hedge against that uncertainty. These limitations led us to make a heavier commitment to probabilistic representations and adopt the Markov Decision Process formalisms for our state and action representations. 111 3.5 Mind Theoretic Planning In the previous sections, we described earlier attempts at implementing a system to perform mind-theoretic planning. In pursuing those projects and learning more about the challenges involved, we developed better instincts about how to approach the problem, which lead to the formulation presented in this chapter. The most significant insights that were employed in this solution are that the behavior of others can be predicted by knowing how they place value on states in the world and then to leverage that knowledge in our own forward model of the environment for planning purposes. This implies the need for value functions, policies, and transition functions. The approach taken in this work is to predict the actions of others by reasoning about possible beliefs they may have about the environment and goals they wish to achieve. We will construct mechanisms called goal situations and mental situations that can be used to predict which actions the other agent is likely to take in any state. These predictions will be leveraged by the transition function of a specialized POMDP that relies on a customized observation function to help with the mental state inference of the other. The product of the POMDP planning process will be an action policy for the robot that produces behavior conducive to learning about the beliefs and goals of the other agent, attempt to correct false beliefs if useful, and finally assist in the completion of the determined goals. A serious effort is expended to make these calculations tractable. A heuristic based search algorithm, B 3 RTDP (see section 2.5), was developed mainly for this purpose and is employed by the system. Value functions at various levels of the system are initialized with heuristics extracted from previous calculations wherever possible and several approximation techniques offered by the B3 RTDP system are taken advantage of (such as action convergence and belief discretization). 112 3.5.1 Definitions of Base States and Actions In this section, the notation and formalisms that are used to specify the mechanism of the MTP system will be defined. We begin by defining the basic building blocks of our MTP system. These are the actual representations of the real and physical (as opposed to mental) features and dynamics of the environment in which mind-theoretic reasoning about others is being performed. These functions and representations are domain-dependent and would be crafted for any environment, as is common practice in the automatic planning literature. We define the following: * Sb: The actual state representation for the physical problem being solved. Any state s - Sb should contain any relevant information about the environment and importantly about all of the agents (for example their locations, orientations etc.) * {A , Ar }: Sets of deterministic actions that are available to the robot and the human (note that different agents might have different capabilities) " 'Tb: Since all actions in A are deterministic, the transition function is not very interesting. A more useful notation to have defined is the resulting state when taking action a from state s. We will refer to actions as functions a : Sb -+ Sb such that a(s) = s' " Cb: The action cost function simply defines the base cost of expended resources for any action. Since we are using the Stochastic Shortest Path (SSP) model of MDPs this function needs to be strictly positive When encoding the base state space Sb, care should be taken to keep it as small as possible while still encoding all of the relevant features of the world required for the task at hand. More specifically, when encoding a state representation that will be used for mind-theoretic reasoning, it is important to encode any feature of the agents' configurations that could be useful for mental state reasoning. For example, 113 in a task where navigation is important and the type of mental state reasoning being performed is goal inference, then observing agent orientation might be an important visual cue which might otherwise not need to be encoded. The base actions in A h/r should simply encode the actual "real-world" effects of taking that action and contain no information about anticipated behavior of other agents etc. For example, the action MoveForward(i) should only affect the location of agent i and change no other feature of the state, Pick Up(i,k) should only affect the part of the state that refers to what agent i is holding etc. These base states and actions are treated as basic building blocks by MTP system which combines them in different ways to construct specialized transition and observation functions. 3.5.2 Types of Mental States As was previously discussed in the background section (see Section 3.2.4), we will focus only on two kinds of mental states, namely beliefs and desires which we will refer to as false beliefs and goals. Goal Mental States The goal mental state is one that fits particularly well with the existing planning metaphor of MDPs and POMDPs, especially the more restricted Stochastic Shortest Path versions of those models. An agent's goal hypothesis will simply be a boolean function over the base state space, evaluating to true when the goal is satisfied in the state and false otherwise. For the human we define gj(s) to be its j-th goal hypothesis. gj Sb '- {true, falsc} 114 (3.1) g (s) 3 { true if the agent's j-th goal is satisfied in s f alse otherwise (3.2) False Belief Mental States The belief mental state requires a little more adaptation to the MDP metaphor. We will restrict ourselves to only representing belief mental states that refer to the physical state space of the world as opposed to other possibilities such as reasoning about agent beliefs about actions and their effects, other agents and their capabilities and so on. We will define a false belief fk to represent two concepts: (1) mapping the true state to the false state and (2) dictating in what true state the error of the false belief can be perceived by the agent holding it. These two functions are defined as follows: convertToFalse : Sb canPerceiveFalse: Sb convertToFalse(fk, st) = - H-+ Sb {true, f alse} where sf is the "false" version of the true Sf (3.3) (3.4) state st for the agent's k-th false belief { true canPerceiveFalse(fA,St) false 3.5.3 if the error of the agent's k-th false belief can be perceived (3.5) by the agent in st otherwise Inputs to the Mind Theoretic Planner As previously discussed, the MTP system attempts to predict the behavior of other agents, based on their beliefs and goals, and plan actions that aid in both better 115 understanding which goal the other agents have as well as assist in accomplishing them. Since the MTP system should be relatively domain-independent, it requires the domain model as input (as discussed in Section 3.5.1). The types of mental states that the MTP system is concerned with are false beliefs and goals (the formal descriptions of which are explained in the above Section 3.5.2). The MTP system requires as input distributions over hypotheses of both which goal the other agent might have as well as which initial false beliefs it might hold (one of which will be the true belief). Lastly the MTP system requires a perception function, which dictates what features of a state any given agent can perceive. This basically provides the robot with an understanding of how its perception (as well as that of other agents) of the environment is limited by its sensors' ranges and other limitations. This will allow the robot to incorporate action sequences in its plans whose goal is simply to bring parts of the state space into the perceptual range of the robot so it can make judgements about how to better proceed. For example, these could be features which best distinguish between different possible mental states of the other agents. The more significant inputs to the MTP system are the following: * Sb, {Ah, " Sinit " Ar}, C: The base state space, actions spaces, and cost function. C Sb: The initial base state Pr(gl:G): A distribution over G goal function hypotheses for the other agent " Pr(fl:F): A distribution over F false belief function hypotheses for the other agent " canPerceive(h/r,V, Sb): A boolean perception function which specifies what features v of state sb can be perceived by either h/r. The system also has various tuning parameters which can be specific to the particular structures that will be introduced in the subsequent sections: 116 * cZ: Probability that should be assigned to predicting random action choice by the other agent * /: Dictates how much preference should be placed on predicted actions from higher levels in the predictive stacks of the goal situations versus the lower levels * L: The number of predictive levels that should be used in the goal situations * Any parameters needed for the B 3 RTDP POMDP solver and the BRTDP MDP solver 3.5.4 Mental State as Enumeration of Goals and False Beliefs Given our existing representations of false belief functions and goals of the human agent, we define a mental state index to be simply the index of any combinatorial assignment of false beliefs and goals to the other agent. As an example, in a domain where we have two goal hypotheses go and gi and one false belief hypothesis fi (in addition to the NULL or true belief ft) then we have the following set of mental state indeces: Goal False Belief Mental State go ft 0 go fi 1 91 ft 2 gi fi 3 We are able to retrieve either the goal function or false belief from the mental state index by simply using integer division and the modulus operator. For a given mental state index m the goal function is picked out by gfloor(m/F) and the false belief function is found using f(m % F). These function simply provide a way to go from having the mental state index to having the actual goal and false belief that it represents. 117 3.5.5 Action Prediction Goal Situation We define a goal situation for each of the other agent's goal hypotheses g3 (see Figure 3-11). The purpose of this structure is to predict the other agent's action given certain information about its goal. Each goal situation contains a stack of predictive models for each agent. A stack is composed of levels, the number of which is a parameter for the MTP system. Each level / contains a value function 7T/h The transition functions Ir, a transition function 7 /h and an action policy are constructed according to Algorithm 5. Note 7 that action predictions from the higher levels in the stack are given higher probability than predictions from lower levels. This is achieved with a discounting factor 3 (when / = 1 all prediction levels get assigned the same probability, when it is close to zero only the highest level gets assigned any value). We define the following function to retrieve a predictive policy for other agent's action selection at level 1 from a particular goal situation: getPredictivcPolicy(gj,1) = h(36) We also define the following simple function to determine whether the goal is accomplished in a goal situation: goal (g, {) true false If the goal encoded by gj is satisfied in sb Otherwise Lastly we define a function to extract the state values of the highest-level robot MDP, this will be used later to provide heuristic values to a POMDP solver: get RobotValue Function(gj) = V' 118 (3.8) Algorithm 5 creates a transition distribution from each state s for every action a such that the agent's action a's effects are achieved in every possible outcome with full certainty. In addition to the effects of action a, several possible actions of the other agent are predicted and their outcomes added to the transition with varying probability weights. Some probability is assigned to the other agent taking any of its actions randomly, but higher probability is assigned to actions from predictive policies of lower levels than the current one. The algorithm is initiated to compute either the robot's or the human's transition function and it iterates over each state s and each applicable action a of the agent in question (lines 5 and 6). Line 8 asserts that every transition from s by taking a has zero probability initially. The deterministic successor state s' is retrieved from the base transition function 7b in line 12. The algorithm then proceeds to iterate over all actions of the other agent that are applicable in s', retrieve their successive deterministic states s" and attribute some minimum probability to those outcomes in the transition function in line 19. It then iterates through each lower level of the goal situation, picks out the predicted action of the other agent in that level in line 24 and assigns probability to the deterministic outcome state of that action (giving higher probability to action outcomes from higher levels) in line 28. It finally makes sure that every row in the transition function is normalized in line 30. Lowest Level Prediction As was discussed earlier, each goal situation defines a predictive stack of MDP models for each agent where the transition function at every level references the predictive policies of all lower levels from the other agent's stack. Obviously that can only be done for levels that have any other levels beneath them. Special care needs to be taken for the lowest level of the predictive stacks, especially since the behavior produced by the predictive policies of that level "seeds" the predictive stack with an over-simplified 119 Algorithm 5: Pseudocode for constructing the transition functions 7 7 h/r, at levels I > 0, of the goal situations. Note that the subscript h/r denotes that this works for either agent's predictive stacks but the order of h/r versus r/h marks that if one refers to the human then the other refers to the robot and vice versa. 2 Input: a C [0, 1] Probability assigned to random action choice; 4 Input: / E]0, 1] Preference factor for predicted actions from higher levels; 5 6 8 10 12 14 foreach s E Sb do foreach a c A h/r do (h/r(s, a,:) to 0; // Apply our own base action to state s'= a(s); // Assign small Drobabilitv to random action choice by Initialize all entries in other agent 15 foreach a' E Ar/h do Ss" = a'(s') 17 ,h/r(s, 19 ; a, s")= O 21 // 22 foreach I' E 0, l[ do Pick predicted action of other agent from every lower level and apply with preference for higher levels 24 a' = 26 s" = a'(s') 28 (s, a, s") - N ih/r 30 rh() Normalize 1) ,hr(,a:- 120 Robot predictive stack Heuristic value Human predictive stack Goal Situation c~J a) Predicted action 7h Th -J 7Fh V~h Th I 1 ci) -j V 1 TF C a) -J T Thr h 0~ C .5 Vjoirit Figure 3-11: Shows how a goal situation is composed of stacks of predictive MDP models for each agent. Each model contains a value function, a transition function and a resulting policy. Each transition function takes into account predictions from lower level policies for the actions of the other agent. Value functions are initialized by heuristics that are extracted from the optimal state values from the level below, this speeds up planning significantly. Since every level of the stack depends on lower levels, special care needs to be taken for the lowest level. In the MTP system, we have chosen to solve a joint centralized planning problem as if one central entity was controlling both agents to optimally achieve both of their goals, since this is a good and optimistic approximation of perfect collaborative behavior. prediction that gets improved and made more sophisticated with every level that is added. Because of the intended collaborative nature of the MTP system we thought it appropriate to make the simplifying assumption in the lowest level that every agent should act optimally with respect to the joint set of goals for all agents and with perfect information about which actions the other will take. This is equivalent to 121 making the simplifying assumption that there exist perfect trust, benevolence, and communication between the agents. This is an optimistic simplification since in the real world each agent generally acts greedily with respect to achieving its own goals. It does not have any certain knowledge of the other's actions and often not even information about their state and lastly communication is often limited or costly. We implement this simplified prediction with a centralized planner that has access to the actions of all the agents and uses them to calculate a joint value function. This planner then defines a greedy policy for each agent that chooses greedily from the joint value function using the restricted set of only that agent's actions. Mental State Situation We now define a mental state situation (MSS) to be a complete assignment of a goal hypothesis and a false belief hypothesis (which can be the "true" false belief) from their respective input distributions (see Figure 3-12). This means that for an MTP system there will be as many MSSs as there are mental states. An MSS can be queried for what action should be expected from the other agent (from any of its predictive levels). It will produce the action predicted by inputting the appropriate false projection of the state to the requested level's policy of the other agent's predictive stack from the appropriate goal situation. This is performed in the following sequence: 1. The appropriate false belief is picked out using the mental state index by taking modulus f = f( m % F) 2. The incoming state is transformed to the false state: s5 = convertToFalse(f,s) via equation 3.3 122 3. The appropriate goal is picked out using the mental state index: g gfloor(m/F) 4. The action policy at level I of the other agent's predictive stack in goal situation g is selected (Equation 3.6) and its prediction from the false state is returned r = getPredictivePolicy(g,1) a =r(sf) This provides us with a method for predicting actions of an agent if we know with certainty the false belief that they hold and the goal state that they desire. 3.5.6 POMDP Problem Formulation Up until now, we have discussed how agent actions can be predicted given that we know their mental state (what they believe and what they desire), and this is useful in coordinating actions with those of the other or which changes to expect in the state as a consequence of the anticipated actions of the other. But we actually know the mental state of others with certainty and we therefore need a way to reason probabilistically about which mental state others hold and learn to act not only in coordination with them but in a way that can: 1. Seek better knowledge of their mental state 2. Having identified that an agent holds a false belief, seek to help correct it if useful 3. Take actions that are good with respect to all currently possible mental states while avoiding taking ones that are detrimental to any of them 123 Input: True state and mental state index Output: Predicted human actions ft L1f Input: False state Output: Predicted human actions r(i7h- 4-17 O2r 91 go 0m~ OVh 1 9) Vioint iointp Figure 3-12: This figure shows an example MTP action prediction system with two goal hypotheses and a single false belief hypothesis (in addition to the true belief), resulting in four distinct mental situations. An action prediction from any level can be queried by following the enumerated steps in section 3.5.5 under "Mental State Situation". 4. Weigh the expected benefits of the above activities with the cost of taking actions, in a principled way Partially Observable Markov Decision Processes (POMDPs) (described in Section 2.2.2) are a natural choice for this problem as they operate on probability distributions over possible states rather than states themselves. These probability distributions are called beliefs and should not be confused with the type of mental state that we have been calling false beliefs, which really refer to a misconception about the state of the world rather than a distribution over possible states. So if we say that an agent has some particular false belief about the world, then we are stating that agent thinks with full certainty that the state of the world is in some false configuration. Another 124 agent might be unsure about which false belief that first agent has and therefore maintains a belief over all possible mental states of that agent. Because POMDPs do not assume that the true world state can be directly observed (otherwise there would never be any uncertainty about the state and beliefs would be unnecessary), they require that actions emit observations when taken. These observations can depend on the state that the action transitioned to as well as the action itself. This is what gives POMDPs the expressiveness to produce actions that seek information about the world. This is an important feature for a mind-theoretic agent as it can often be advantageous to act with the specific purpose of trying to learn the true mental state of the other agent to be able to better predict their future actions. Beliefs Over Augmented States Up until now, we have been referring informally to base states and mental state indeces. As was discussed in Section 3.5.4, a mental state index simply refers to a unique combination of a false belief and goal of the other agent. Once the mental state index is known with full certainty, then what false belief the other agent holds and their goal are also fully known. We now define a new augmented state s', which is simply the combination of those two types of states (base state sb and mental state index m): sb := {sb, m} (3.9) This augmented mental state sm will now serve as our state representation for the subsequent sections of this chapter. We have also defined a special mental state index which does not correspond to any false belief or goal index but is simply used to represent the absence of goal-directed behavior. This is useful when interacting with agents that are people as they not always act on explicit goals but sometimes just wander or explore. This "extra" mental state, which we will denote as s , serves 125 as a first order approximation of recognizing that behavior. The POMDP beliefs are defined as probability distributions over these augmented mental states. The initial belief is constructed to contain one state per mental state index, each initialized with the same base state sin.it The initial probability of any mental state is defined by the product of the probabilities of the false belief and goal corresponding to the mental index from their respective input distributions. As an example, this is what the initial belief looks like for an MTP with three goal hypotheses and one false belief hypothesis: so s1n s2s :Pr(go) -Pr(ft) : Pr(go) Pr(fi) : Pr(gi) Pr(ft) - sinit : Pr(gi) Pr(f) S3ng : Pr(g 2 ) - Pr(ft) - sinit :Pr(g m Sinit 2) Pr(fi) - .1 -7 Transition Function The transition function for the MTP POMDP is used to model our prediction of how the other agent will act in any mental state. We have already created and described exactly the mechanism for such prediction in Section 3.5.5 as a mental state situation which takes a base state and mental state index (the exact content of a mental state s'), converts the base state through the appropriate false belief transformation, and picks out the predicted action from any level of the appropriate goal situation. This process is explained graphically in Figure 3-12. The construction of TPOMDP is similar to the construction of the transition functions of goal situations (which are explained in Algorithm 5), except in this case the state is augmented with the mental state index so we need to extract the correct mental state situation to predict the other agent's action. The construction is explained 126 in detail in Algorithm 6. Algorithm 6: Pseudocode for constructing the transition function TPOMDP for the MTP POMDP. Input: a c [0, 1] Probability assigned to random action choice; 4 Input: # c]0, 1] Preference factor for predicted actions from higher levels; 5 foreach sm E S" do 6 foreach a C Ar do Initialize all entries in TPOMDP(s", a,:) to 0; 8 // Apply our own base action to the base state 10 2 12 S' = a(s); 14 // 15 17 1D 19 21 22 24 26 27 29 31 33 35 37 Assign small probability to random action choice by the other agent foreach a' C A' do s" = a'(s'/);b a; TPOMDP (sj7n, a, {s"' m}) // If we have a goal-oriented mental state index if m 7 ?n then // Use appropriate false belief conversion (eq. s'f =convertToFalse(f(m% F), Sb 3.3) foreach ' C [0, l[ do // Find the appropriate goal situation given nqi and get policy for level (eq. 3.6) w = getPredictivePolicy Ploor(m/F) , 1) ; // We pick the action given the false state a'=7 g - // And then apply the action to the true state 39 41 43 TPOMDP (s, a, {sb, rn}) += 011 Normalize TPOMDP(Sb, a,:); Observation Function The observation function of the MTP system serves two purposes. Firstly, it models the perceptual perspective of the robot by generating a unique observation for any unique configuration of the features of a state that are currently perceivable to the 127 robot. This is visualized in Figure 3-13. Secondly, the observation function is used to expel false belief hypotheses about the other agent once it should have perceived the error of the false belief in the true state. The MTP system is agnostic to which kind of perception is needed for different kinds of problems. It only requires as input a boolean function describing what features of the state can be observed by a particular agent given the true state. Even though the system accepts any such perceptual function, it is useful to think about a domain where agents perceive the environment with a camera and perceptual availability is limited by the field of view and line of sight. We will use this scenario for demonstrative purposes. Figure 3-13: Shows two states that would produce the same observation because they are undistinquishable within the perceptual field of the robot (which is limited by field of view and line of sight in this domain). If the other agent would move slightly into the white space in the state on the right, then the observation function would produce a different observation. We chose to make the MTP system use a deterministic observation function. This design decision was made to increase the tractability of the problem. Given how we use the observation function, the space of possible observations will be fairly large for any kind of interesting domain. For example, if sensor noise was also being modeled by the function, then that would require significantly more computational effort. Since 128 dealing with sensor noise is not the main goal of this work but reasoning about other's mental states is, we decided to spend our computational effort on the more relevant parts of the problem. Lastly we have chosen to have the observation function only depend on the resulting state from taking an action but be independent of both the originating state and the action itself. This reduces the observation function to a mapping function from state to a positive natural observation number where states that are indistinguishable within the perceptual range of the robot get assigned the same number and not otherwise. This is visualized in Figure 3-13. The second role of the observation function is to expel false belief states from the POMDP belief when the other agent is able to perceive the error of the false belief. We encode the observation function to emit a special unique observation when the mental state is such that the mental state index indicates that the other agent holds a false belief, and the error of that false belief can be perceived by it in the true base state. This is formalized as follows: Vi C- [1, F] ((O(sm) = of) A (m % F = i) A canPerceiveFalse(fi,st)) (3.10) In actuality, these special false belief observations of will never be emitted, which forces the Bayesian belief update (see equation 2.3) to assign zero probability to any state in a belief that should have emitted this observation (namely the false belief states). To demonstrate this, we will use an example scenario where the robot and a human are in a room and the robot knows that the human wants to exit the room as well as the location of the exit. The robot is uncertain whether the human knows the location of the exit so it generates two false belief hypotheses (one being the true belief) with equal probability (see Figure 3-14). In this demonstrative scenario the human might take several actions to move 129 P=0.5 P=0.5 00 IN] 0 0C) Figure 3-14: Shows an example scenario where the robot knows that the human's goal is to exit the room, and the robot also knows the location of the exit. The robot is uncertain about whether the human knows where the location of the exit and therefore creates two false belief hypotheses, one representing the true state and another representing a false alternative. towards the exit until it is facing north and only one turn action away from learning the truth about the location of the exit. At this point the robot holds a belief with two mental states, one that predicts that the human will turn left because it has the false belief and another that predicts a right turn (see Figure 3-15). If the human chooses to turn left, the belief update will expect the special false belief observation if the false belief mental state should be true, but a regular perceptual observation if the non-false belief state should be true. Obviously the false belief state is false and therefore the perceptual observation will be emitted and the resulting belief will only contain the true mental state since it is the only one that could have produced that observation. This will have accomplished what we wanted which is to model how the 130 4 (a) The false state (b) The true state Figure 3-15: Shows the stage where the human is one action away from learning what the true state of the world is human can come to perceive the error of its false beliefs. 3.5.7 Putting it All Together In the previous sections, we have presented a way to frame the problem of MindTheoretic Planning as a Partially Observable Markov Decision Process. A method has been described to do action prediction using probabilistic reasoning about false beliefs and goals of the other agent, that creates a layered structure of predictive MDP models. We also presented a formulation of an observation function, which can predict how agents can perceive their environment and also the error of their false beliefs. A few pieces are still needed to calculate the robot's POMDP action policy, which will be discussed in the subsequent sections. Robot's Goal Function The MTP system is designed to support collaborative human-robot teaming. Therefore it is central to its design that it should generate helpful and collaborative be- 131 4 4 (a) The false state (b) The true state Figure 3-16: When the human agent has turned left, it will expect to see either the exit or the wall depending on its mental state. In the false belief state where it expects to see the exit, a special observation of is also expected since in this state the agent should be able to perceive the error of its false belief. Since this observation will actually never be emitted by the MTP system, the belief update will attribute zero probability to any state in the subsequent POMDP belief where that observation was expected. havior. To achieve this, we designed a goal function for the robot which stipulates that the other agent should accomplish their goal, whatever it may be, within a given probability threshold e. The inherent challenge in this encoding is clearly that the robot is initially uncertain about which goal the other agent has. But this is exactly the core challenge which the MTP system is designed to solve, namely to generate behavior that seeks to learn which goals and false beliefs the other agent holds and then act to assist them in achieving those goals. We define a formal goal function for the robot that uses a notational convenience function goalSatisfied (which, in turn, relies on the goal function from Equation 3.7): .f goalSatisfied(sb) 1 when goal(gfloor(m/F), Sb) 0 otherwise goalpoMDP(b) s"cb b(sb) - goalSatisfied(s 132 (3.11) 1 / This goal function simply sums up the belief probabilities of the mental states whose goals are satisfied. If that sum is higher than 1 - c then we say that our goal is satisfied in this belief. Input: Beliefs Output: Robot action ~--~-- ---- --- --- Mind Theoretic POMDP 9MDP Input: Augmented mental state Output: Predicted human actions - - -- - --- - ---- -Input: False state Output: Predicted human actions ft Vr--j" vr2J f Fr) U 2jD .Dv9 go ~00 - - 7r h I _02)S9yI I Vojoin Figure 3-17: Shows the complete MTP system on an example problem with two goal hypotheses and one false belief hypothesis. On top sits a POMDP with an observation function that produces perceptually limited observations with the addition of specialized false belief observations when appropriate. The POMDP transition function is deterministic in the action effects of the robot but uses the lower level mental state situations to predict which actions the other agent is likely to take and models the effects of those stochastically. The figure also shows how the value functions at lower levels serve as initialization heuristics for higher-level value functions. The value function of the highest level of the robot's predictive stack is used as an initialization to the QMDP heuristic for the POMDP value function. 133 Solving the POMDP using a QMDP Heuristic Now that we have defined the goal function, the transition function, and observation function, we have all that is needed to go ahead and solve the POMDP. We use our B 3 RTDP algorithm which was described in Section 2.5. B3 RTDP is a heuristic search based algorithm, and its performance can be greatly improved if the heuristic that it is provided with is good. It is common practice to use what is called a QMDP heuristic for POMDPs (see Section 2.5.2). This heuristic solves the underlying MDP problem of the POMDP by ignoring the observation function. This is equivalent to assuming that the state of the problem will be fully observed upon taking the first action. As can be seen in Figure 3-17, we use a QMDP belief heuristic for our MTP POMDP. Furthermore we use a Bounded-RTDP (BRTDP) (see Section 2.4.2) MDP solver to solve for our QMDP. BRTDP is also a heuristic based search algorithm so it also benefits greatly from a good heuristic. Incidentally since we have computed predictive MDP value functions for both agents, originally for the purpose of predicting the other agent's actions, we now have great heuristic values to provide the BRTDP solver for our QMDP belief heuristic. We therefore define the following heuristic function: h(s') = getRobotValueFunction(floor(m/F))(Sb) (3-12) we use it to initialize the QMDP problem which, in turn, provides a belief heuristic for the final POMDP planning. 3.5.8 Demonstrative Examples In this section, we will demonstrate with graphical examples the resulting behavior and inferences of the MTP system. 134 False Belief Uncertainty We will begin by examining a navigational domain where the robot and a human navigate a constrained environment to arrive at their respective goal locations. The actions available to them are: TurnNorth, TurnSouth, TurnEast, TurnWest, Move and Wait. Figure 3-18 shows the true environment configuration and the agent goals. In this example, the robot is certain about the human's goal but is unsure whether the human is aware of the obstacle that is immediately east of that goal. Figure 3-18: Shows the configuration of the environment of this example. Gray areas represent obstacles. For demonstrative purposes, we apply two kinds of color overlay functions. Firstly, we will slightly "gray out" the grids that cannot be perceived by the robot at any given time to illustrate the parts of the world that can be observed. Secondly, we will color grids in tones of green depending on the robot's certainty that the human occupies that grid. We exaggerate the level of color to better visualize areas of low probability with the following non-linear function: green = 1 - e5Probability. 0.8 0.6 robot0.4 0.2 True False No goal belief belief (a) (b) Figure 3-19: Simulation at t = 0. The robot can perceive the human but is initially uncertain of their mental state. 135 0.8 0.6 0.4 True False No goal belief belief (a) (b) Figure 3-20: Simulation at t = 11. The robot has moved out of the human's way but did not see if they moved east or west. Robot maintains both hypotheses with slightly higher probability of the false belief since the human did not immediately turn east at t 1. 1 f root0.8 0.6 - 0.4 0.2 0 L - - ... . True False No goal belief belief (a) (b) Figure 3-21: Simulation at t = 20. The robot now expects that if the human originally held the false belief that they would have perceived its error by now and is confident that they currently hold the true belief. The robot expects that if the human originally held the false belief then it should pass by the robot's visual field in the next few time steps. Notice how the robot has been waiting for that perception to happen or not happen (indicating that the human held the true belief the whole time) before it proceeds to move to its goal. 136 1 0.8 0.6 0.4 0.2 0 True False No goal belief belief (a) (b) Figure 3-22: Simulation at t = 28. Finally once the robot has seen the human pass by it proceeds to follow it and subsequently both agents accomplish their goals. Even if the robot had not seen the human pass by, eventually once it would become sure enough, it would proceed in exactly the same manner thinking that the human had originally held the true belief. Goal Uncertainty We now explore a different navigational domain. In this scenario, the robot is not certain which of eight goals the human might have. The robot's goal is simply to learn the human's goal and provide assistance (which in this domain mostly consists of getting out of the way). 137 (b) (a) 1 1 0.8 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 0-00-0------ 00000000 bb0----00 0O 00 z 0 0 0W z (d) (c) Figure 3-23: (a) and (c) refer to the simulation at t = 0, (b) and (d) refer to the simulation at t = 1. We can see that initially the robot is completely uncertain about the mental state of the human but after seeing that the human took no action, it assumes that goals 5 and 6 are most likely (the ones that the robot is currently blocking access to). 138 (a) (b) 1 1 0.8 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 o r-i rJ m i 0 00 0 0O0 Ln 1.0 r-. (DWWD(.D0 ( -U 0 0 00 z -Fa U (U 0 -F 0 0 00 z (d) (c) Figure 3-24: (a) and (c) refer to the simulation at t = 13, (b) and (d) refer to the simulation at t = 18. Once the robot has retreated to allow the human to pursue the two goal hypotheses that are most likely, it chooses to make one goal accessible. If the human does not pursue that goal given the opportunity, the robot assumes that the other one is more likely and creates passageway for it to pursue. 139 (a) (b) 1 0.8 0.6 0.4 1 0.8 0.6 0.4 0.2 0.2 0 0 0 - 00 0 000 om4 r DP--U0r-4 4Cno 00 0t 0 00 , D r- -F 0 00 z z (d) (c) Figure 3-25: (a) and (c) refer to the simulation at t = 28, (b) and (d) refer to the simulation at t = 36. If the human moves away while the robot cannot perceive it, the robot uses its different goal hypotheses to predict the most likely location of the human. The robot then proceeds to find human in the most likely locations. In this case, its first guess was correct and by using our Mind-Theoretic Reasoning techniques, it was able to find the human immediately. 140 Chapter 4 Evaluation of Mind Theoretic Reasoning 141 4.1 Simulator We decided to develop a first-person perspective 3D simulator, both for supporting development of the MTP system as well as to provide an environment in which we can run user studies to both evaluate the system and learn about how it can better support human-robot teamwork. 4.1.1 Different Simulators When pursuing research in robotics, it can often be very useful to have access to good simulators to speed up the development and testing cycles since actual robot hardware can be cumbersome and error-prone to run. We have used existing robot simulators, and developed our own, for the various tasks that have been under development in the past years in our research group. Figure 4-1: A snapshot from the USARSim simulator which was used to simulate urban search and rescue problems. On the right, we can see a probabilistic Carmen occupancy map created by using only simulated laser scans and simulated odometry from the robot. The first robot simulator we evaluated was USA RSim, a robotic simulator built on top of the Unreal TournamentTM game engine. This simulator is widely used within the robotics community and proved useful to test and share code and projects across collaborations with other researchers on previous projects. We ended up choosing not to use USARSim since it is known to be a bit heavy to run, often buggy, and difficult 142 to customize. When we moved away from using USARSim, we chose to develop a robot simulator that could be tightly integrated with our existing JavaTM codebase. We decided to build a simulator using the Java Monkey EngineT M which is an open source game engine. This framework was sufficient for developing our simulations and Figure 42 (a) shows a team of MDS robots performing joint navigation to build a tower of blocks. We ended up having to stop using this environment because of its lack of documentation and rather small user base (leading to sparse forum posts and general lack of support). (a) (b) Figure 4-2: Screenshots from the (a) Java Monkey EngineT M and (b) Unity3D sim- ulators that were developed to evaluate our robot systems. At this point, we decided move our development onto a better supported environment and chose a world leading commercial platform called Unity3DTM. Figure 4-2 (b) shows the first simulator we developed in this environment. In this figure, the robot is actually controlled by an early version of the deterministic MTP system. In this scenario, the robot is reasoning about a possible false belief the human construction worker might have (whether or not he knows about the fire blocking one of the exits). Finally, we created another version of the Unity3DTM game with much more emphasis on making it a good platform to run user studies (see Figure 4-3). With 143 that in mind, we created a simulator, or video game, that would compile to run either in desktop mode or in a web browser. In this game, the user controls a graphical human character and operates in a grid-world environment with a robot. The robot is controlled from a different terminal of the game and can either be puppeteered by an AI system or controlled by another user. A trick that we employed so that we could have two people play each other while both thinking the other one was an autonomous robot, was to make the controlled character from either terminal always render as the human model and the "other" character render as the robot. Figure 4-3: Shows the video game from the perspective of the human user. The character can navigate to adjacent grids, interact with objects on tables, and push boxes around. The view of the world is always limited to what the character can currently perceive, so the user needs to rotate the character and move it around to perceive more of the world. 144 4.1.2 On-line Video Game for User Studies The user can control the character either by using the keyboard or by clicking with the mouse within the game window. The actions available to them are to rotate the character left or right, move forward, pick up or put down items in front of character, and apply tools to the engine base if applicable. At any given time, the user can only see the features in the environment that are visible to their character. As the user moves the character around, different features of the environment "fade in" as they become perceptible. The main goal-oriented task that has been implemented in this simulator, other than simple navigation tasks, is an engine assembly task. For this purpose, we have included several graphical assets for items that are relevant to assembling an engine. Figure 4-4 shows the different parts and tools that can be interacted with. 145 (a) Enaine base (b) Engine block (c) Air filter (d) Screwdriver (e) Socket wrench (f) Wrench Figure 4-4: These are the objects in the world that can be picked up and applied to the engine base. To fully assemble an engine, two engine blocks and an air filter need to be placed on the engine base. After placing each item, a tool needs to be applied before the next item can be placed. The type of tool needed is visualized with a small hovering tool-tip over the engine. The engine assembly task has a very sequential structure, in that a very specific order of bringing items and tools to the engine base is required. First an engine block needs to be placed on the base and a socket wrench should be applied. The socket wrench needs to be returned and another engine block placed on the base but this time a regular wrench is required. Finally an air filter should be placed on the base and a screwdriver is required to finish the assembly. This assembly process can be seen in Figure 4-5 which was used to help study participants understand how to achieve the goal. To recruit many people to participate in our studies, we designed the game to run in an internet browser so that study participants can play the game from wherever they are without the hassle of coming into the lab to participate. The reason for this 146 U..,.. Figure 4-5: This figure demonstrates the sequence of items and tools that need to be brought to and applied to an engine base to successfully assemble it. design was so that we could easily recruit many people to participate in our studies. We developed a framework which can handle many concurrent users playing the game while simultaneously gathering behavioral data from the game as well as data from a post-game questionnaire (see Figure 4-6). A central web server manages the connectivity between the different components of the system and serves up all web pages that are needed (main sign-up page, game instructions, user specific task links, etc). But it also maintains connectivity between the user's browser and the game servers during game-play to detect error conditions, whether or not the user is actually playing the game, etc. The web server can either sign users up using their email (and send confirmation email to verify identity) or it can accept game requests from Amazon Mechanical TurkTM (in which case the user identity is obtained by using their MTurk workerId). A commercial multiplayer cloud service called PhotonTM is used to synchronize all game activity between the game servers and the game which is running in the user's browser. 147 Photon Cloud Multiplayer network Local network photon Web server RELTM http://prg-robot-study.media.mit.edu User in a web browser Game controller (_Gam~e controller Game _0 log OF Gam controller GameGame 0g og Survey Monkey Game servers Amazon Mechanical Turk Figure 4-6: This figure demonstrates the connectedness of different components to create the on-line study environment used in the evaluation. A user signs up for the study either by going to the sign-up website (possibly because they received a recruitment email or saw an advertisement) or because they are an Amazon MTurk user and accepted to play. The web server assigns the user to a study condition and finds a game server that is not currently busy and assigns it to the user. The game server initializes the Unity game and puppeteers the robot character according to the condition of the study assigned to the user. The game state, character actions, and environment are synchronized between the game server and the user's browser using a free cloud multiplayer service called PhotonTM. Study data is comprised of both the behavioral data in the game logs as well as the post-game questionnaire data provided by the Survey Monkey service. 148 4.2 Human Subject Study Human subject study evaluations of robotic systems can present many challenges. Research robots require a lot of maintenance and can have frequent malfunctions. Their sensory systems are imperfect and navigation and manipulation are usually slow and error-prone. These factors make it difficult to construct a study where certain details about the robot's behavior are being manipulated while other variables are held fixed (confounding factors in HRI are discussed further in (Steinfeld et al., 2006)). Studies involving complicated research robots that are difficult to operate also tend to make it difficult to gather sufficient data to confidently support or contradict the research claims. Simulations and virtual environments can often be used to provide an alternative to physical experiments with robots. Computer simulations offer the benefit of repeatability and consistency across participant experiences. They are not constrained by the limiting resource of physical robotic hardware, allowing many participants to interact with the same simulated robot at the same time. This, combined with making it easier for participants to take part in robot studies (allowing them to participate from their homes rather than having to travel to a location with a robot) can allow the researcher to gather much more data than otherwise possible. One limitation of using simulations to evaluate systems for human-robot interaction is that users generally do not get as engaged in the interaction with the simulated robot as they would with a physical robot (Kidd & Breazeal, 2004). This can significantly impact people's perception of the systems being evaluated. These limitations should be weighed against the benefits of using simulations for evaluating HRI systems on a case-by-case basis. We chose to evaluate the presented work using the on-line simulator presented in Section 4.1. Our reasoning for using a simulation to evaluate this work is to demonstrate the benefit of mind-theoretic reasoning, the complexity of the task needs to be significant, this can be difficult to accomplish with an actual robotic system 149 while gathering the required number of data points. The choice of using a simulator is likely to impact our ability to accurately measure people's subjective impressions of interacting with the robot. 4.2.1 Hypotheses Our research hypotheses are categorized into two groups: attitudinal,relating to the human's perceived traits of their autonomous teammate, and performance, relating to the objective measures of performance and improvements in task metrics. Attitudinal Hypotheses We posit that the following hypotheses hold when people cooperate with a mindtheoretic agent rather than an alternate kind of autonomous agent: H1 They perceive the MT agent to be: (a) more competent at the task (b) more helpful (c) more team-minded (d) more intelligent H2 They attribute the MT agent with: (a) a higher degree of Theory of Mind (b) more human-like features H3 Their experience with the MT agent is perceived to be: (a) less mentally loading (b) more enjoyable 150 Performance Hypotheses Similarly, we posit that the following hypotheses hold when people cooperate with a mind-theoretic agent rather than an alternate kind of autonomous agent: H4 Team fluency is improved: (a) by reducing the mean time between human actions (b) by decreasing the rate of change (within task) of the time between human actions (c) by reducing the functional delay ratio (ratio of total human wait time between robot taking action and human taking action and total task duration) (d) by decreasing the rate of change (within task) of the wait times in (c) H5 Task efficiency is improved: (a) by reducing the total time of task completion (b) by reducing the number of actions taken by the human and the robot 4.2.2 Experimental Design Task For the MTP system to have any substantial benefit to a mixed team of humans and autonomous agents, the task should require tightly coordinated operation of the team-members. This means that through some environmental constraints, agents' actions will affect features of the world state that are important to others. We created an engine assembly task that includes navigation within a constrained space, which requires coordination, as well as access to shared resources. In this scenario, two agents navigate a grid-world where some grids are not navigable because they contain walls or tables. The tables can contain either an engine base on which an engine can 151 be built, one of two engine items: engine block or air filter, or one of three tools required: regular wrench, socket wrench or a screwdriver. The items and tools can be picked up and applied to the engine and if done in a particular sequence then the engine will be fully assembled (see Figures 4-4 and 4-5). The study had three different configurations of this task with various rounds of each configuration where item and tool placements were randomized: 1. One engine base and only the human agent can take actions. The robot is immobilized. There was only one round of this level, and it was used to acclimate users to the environment. Data from this level was not used in the analysis. 2. One engine base and both agents active. This level had three rounds with randomized item and tool placements. This environment was very constrained and required navigational coordination to complete successfully. 3. Two engine bases and both agents active. This level had four rounds with randomized item and tool placements. In two of the rounds, the participant was told that more "reward" was given for assembling one engine over the other. In the remaining rounds those instructions were switched. Human and robot agent initial locations were switched between rounds. Experimental Conditions The experiment followed a between-participant design. We wanted to use this study to better understand the MTP system itself as well as to see how it compares to an alternative autonomous system and a human operator. To accomplish this, we created four different experimental conditions: C1 MTP-POMDP: Agent is controlled by the full MTP POMDP stack (see Figure 3-17) C2 MTP-QMDP: Agent is controlled by an action policy generated from the QMDP value function (used as a POMDP heuristic Figure 3-17) 152 C3 Handcoded+A*: Agent has access to a fully observable world state and is controlled by a hand-crafted optimal rule-based policy that uses an A* search algorithm for navigation C4 Human: Agent is controlled by an expert human confederate that has been instructed to perform task as efficiently and helpfully as possible C1 and C2 both take advantage of the predictive abilities of the MTP system. Both of them use the observation model of the POMDP problem formulation to perform a Bayesian belief update. The difference between them is that the C2 agent operates under the inherent assumption of the QMDP heuristic, which is to assume that after taking the first action, all subsequent states will be directly observable. This generally leads to behavior that is overly "confident" in its mental state estimate of the human and will never choose to take any actions purely for the purpose of information gain, or "sensing actions." C3 follows a hand-coded policy that ensures the following while using an A * path planner for navigation: 1. If holding object that is not currently needed for engine, return it to nearest available location 2. If holding an object that is currently needed for engine, navigate to engine and apply it 3. If not holding any object then navigate towards nearest item that is currently needed and pick it up 4. If item needed for engine is unavailable, navigate to "safe location" 5. If there are more than one engine, do not take any action until first item is applied to either engine, then designate that engine as the target engine to build 153 This policy is given the (admittedly unfair) advantage of perfect observability of the world state. Item 4 in the hand-coded policy was added after pilot testing to prevent the robot from creating a stand-off situation if it is blocking the access to engine when the human is holding the next tool to be applied but cannot apply it. C3 was developed to be a near-optimal (optimal when human stays out of its way) policy that is strictly task-oriented. This policy benefits greatly from two advantages: (1) perfect observability and (2) not ever building wrong engine. C4 is a special case where the simulated robot assistant is actually operated by an expert human confederate. In this condition, the human confederate operator is instructed to simply accomplish the task with the participant as fast and efficiently as possible while being helpful to the study participant. Procedure All pages and information presented to the participants is documented in appendix section B.2. Once a participant signed up for the study by providing an email to the study website, they would be randomly assigned to any of the four conditions. If they were assigned to C4 they would receive a confirmation email notifying them that they would soon receive another email to schedule their participation. If they were assigned to conditions one through three they would receive an email asking them to (1) read game instructions, (2) play the simplest level to practice, (3) play all three rounds of task one and all four rounds of task two and (4) fill out the post-study questionnaire. Once enough participants had been assigned to condition C4, they would be sent a scheduling email where they could sign up for 20 minute time slots. Once signed up, they would receive an email with the same instructions as the one above except that item (3) had to be completed within the scheduled slot. 154 Participants Out of approximately 360 participants that signed up, only 86 (57 male) completed the post-task questionnaire and passed our exlusion criteria (the most relevant being how many tasks they completed). The mean age of our participants was 25.4 (o- = 6). Participants were randomly assigned to conditions but the distribution of users into conditions after exclusion criteria was the following: C1=18, C2=23, C3=35, C4=10. The reason for the uneven distribution mostly stems from the fact that more people experienced technical difficulties in conditions CI and C2 and were therefore excluded, and in C4 many of the participants that had signed up did not respond to later scheduling emails. 4.2.3 Metrics Attitudinal Measures A post-study questionnaire was used to acquire the attitudinal measures for this study. Every effort was made to use known and validated surveys and metrics, with only minor adaptations to fit this particular scenario. Because of some of the unique aspects of this study, we also created a few adhoc questions we felt were relevant. All questions used a seven point Likert scale with the exception of a few free text responses. The full questionnaire used can be found in the Appendix B.1 of this thesis. To measure the participant's perception of the robot's competence, we used the qualification factor from Berlo's Source Credibility Scale (Berlo et al., 1969). For evaluating the participant's perception of the robot's team-mindedness, we used the goal sub-scale of the Working Alliance for Human-Robot Teams (Hoffman, 2014). We also used selected questions from Hackman's Team Diagnostic Survey (Wageman et al., 2005). The robot's perceived intelligence was measured using the Intelligence sub-scale of the Godspeed metrics (Bartneck et al., 2009). 155 We used a few selected questions from a study of perception of robot's personality to evaluate the perception of the robot's social presence (Lee et al., 2006). We were interested in the degree to which the participants attributed a Theory of Mind to the robot so we selected and adapted some relevant questions from the Theory of Mind Index (Hutchins et al., 2012). For measuring attribution of human traits to the robot, we used the Anthropomorphism and Animacy sub-scales of the Godspeed questionnaire (Bartneck et al., 2009), enjoyment and likeability were measured using Likeability from Godspeed and the Bond sub-scale from the Working Alliance for Human-Robot Teams (Hoffman, 2014). Lastly we measured task load using the standard Nasa Task Load Index (this was the first section in the questionnaire so people would answer it as soon after doing the task as possible) (Hart & Staveland, 1988). Behavioral Measures All behavioral metrics were extracted from game log files that are recorded during game play. We used two kinds of measures for the overall efficiency of the task: Time to task completion, which is the time between when the first action taken by either agent and the time when the goal is achieved, and total action count, which is the total number of actions taken by both of the agents during the task time. We used several behavioral metrics to measure fluency. We were interested in both the action interval time of the participant, which is defined to be simply the average time between actions, as well as the action interval rate of change, which is the slope of a linear regression fit to the action intervals within each task. The slope dictates how much change there was in this measure across the task duration and could indicate a level of adjustment or learning. We were also interested in the functional delay measurement, which Hoffman et al. defines to be the ratio of the accumulated wait times between the robot finishing its action and the participant taking their action, over the total task time (Hoffman, 2014). In our task, it was more appropriate to measure the wait time between when the robot starts its action 156 and when the participant starts theirs, because the participant could actually start theirs before the robot's action had finished, and it was generally very obvious which action the robot was taking once it had started. Similarly to the action interval measure, we were also interested in looking at the rate of change of this quantity within a session. 4.2.4 Exclusion Criteria In this study, we used the following criteria for completely excluding all data regarding a given participant: 1. If a participant completed any fewer than six out of the seven total rounds of tasks (allowing them to forget to complete one) 2. If participants answered with a five or higher on a seven point scale for the question: "Did you experience any errors or technical troubles while playing the game?" 3. If people did not select the only correct option out of seven possibilities for the question: "Please select a tool that you used in the tasks" 4. If their task completion time on any of the rounds exceeded 150 seconds (average task completion time was about 60 seconds with fairly low variation). 5. If they ever took more than 30 seconds to take the next action (average action interval was roughly 1.2 seconds with low variation) We applied the following criteria to exclude a particular feature from the data from a given participant: 1. In both the action interval metric and the functional delay metric, we excluded any single interval if it exceeded 15 seconds. 157 2. In both of the within-task temporal metrics, we omitted the slope of the linear regression fit if the data points were fewer than 5 or if the absolute value of the slope was higher than 150 (averages were in the range of [-5, -20]) 158 4.3 Study Results 4.3.1 Statistical Analysis Methods To evaluate whether data from participants in different experimental conditions was actually statistically significant, we used pair wise one-way ANOVA tests and looked both at the produced p value and computed the effect size q2. Each pair of con- ditions was evaluated using a one-way ANOVA, resulting in six separate comparisons. To correct for the effect of multiple statistical tests (the general effect of more tests increasing chances of rare occurrences), we performed the Bonferroni correction (Hochberg, 1988), which is a very conservative correction method. By this correction, if an ANOVA p value would otherwise be considered to indicate a significant difference in the means if it was lower than 0.05 then it would now need to be lower than 0.05/6 ~ 0.008 to indicate that same difference. We therefore use the following thresholds in all graphs and discussion in this session: Weak significant difference: p < 0.02 indicated with * Significant difference: p < 0.008 indicated with ** Strong significant difference: p < 0.0002 indicated with * All tables report both the mean and standard error (SE) of the measured quantities for all conditions. All graphs plot the means of the measured quantities for all conditions. Graph error bars represent a 95% confidence interval, which have been suggested to prompt better interpretation of statistical significance than nullhypothesis testing (Cumming, 2013). When discussing significance of differences we will also provide the Effect Size as the classical 2 T1 metric (Brown, 2008) defined as follows: 2 SSef fect SStotai 159 __ Ssbetween SStotal SSeffect represents the sum of squared errors between the independent variable means and the total mean (we use SSetween from the ANOVA results table). SSttai represents the total sum of squared errors between all data points and the total mean. 4.3.2 Behavioral Data Time to task completion The duration of a task or time to task completion is an important metric for efficiency. This time is measured from when the first action is taken by either agent, and the time when the goal is achieved. In this metric, a lower score is better. Task completion time x 104 9 - -. . -. .. 8 .......... 7 ** 6 0 0 5 a) E 4 3 2 1 0' Figure 4-7: Task 1 Task2 Shows the mean task completion times of all rounds of each task p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) 160 (* C2 C1 C3 C4 p r1/2 p 9.6e-04 0.26 0.580 6e-03 0.350 0.04 9.6e-05 0.26 0.033 0.15 0.433 0.02 C2 p C3 Table 4.1: Task 1 task completion time. ANOVA p value and effect sizes r/2 for all pairwise comparisons of conditions. C2 C3 p C1 0.271 C4 p 0.04 C2 p 0.289 0.02 0.004 0.30 0.015 0.11 0.046 0.16 2.8e-05 0.34 C3 Table 4.2: Task 2 task completion time. ANOVA p value and effect sizes r/ 2 for all pairwise comparisons of conditions. Figure 4-7 demonstrates the mean task completion times (over all rounds) for both tasks. We can see that in the first task, C2 MTP-QMDP dominates all of the other conditions with the lowest number of actions (significance not detected for C4 Human ). In the second task C4 Human is lowest. Round 1 Round 2 Round 3 P SE PI SE C1 89352 5041 75942 4638 62516 3834 C2 70559 3382 66740 3421 53959 1637 C3 79991 2947 75378 2536 68503 1447 C4 89180 6845 69862 5465 55911 1917 P1 SE Table 4.3: Task 1 completion time in milliseconds 161 Tables 4.3 and 4.4 show the mean completion times and standard errors from each round of each task. This data underlies the means reported in Figure 4-7. Round 1 Round 3 Round 2 Round 4 pt SE 1 SE p SE p SE C1 50057 2675 68706 5445 49835 2943 51347 5323 C2 54644 2692 50548 1675 59259 3849 46225 1624 C3 6 3919 1627 53434 1580 50186 1315 48415 1490 C4 5 2323 3813 47442 2085 48136 2158 46949 954 Table 4.4: Task 2 completion time in milliseconds Total number of actions Another important efficiency metric is the total number of actions taken by the robot and the participant to accomplish the goal. This is simply measured as the sum of the actions taken by either agent between the start of the trial and until the goal is achieved. In this metric, a lower score is better. 162 Total number of actions 1501 - ........I............ T T ** ** 100 T I~I ** 501 0 Task 1 Figure 4-8: Shows the mean number of actions taken by both agents over all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) C2 C3 p C1 C2 1.le-08 C4 p 0.60 p 0.967 4e-05 0.197 0.07 1.8e-15 0.71 1.5e-07 0.62 0.045 0.09 C3 Table 4.5: Task 1 total number of actions. AN OVA p value and effect sizes q2 for all pairwise comparisons of conditions. 163 C3 C2 p C1 0.001 0.30 C2 C4 p T/2 p 0.082 0.06 0.527 0.02 3.3e-04 0.23 0.007 0.28 0.525 9e-03 C3 Table 4.6: Task 2 total number of actions. ANOVA p value and effect sizes rj 2 for all pairwise comparisons of conditions. In Figure 4-8 we can see the mean total number of actions across all rounds for each task. In task one, it is clear that C2 MTP-QMDP confidently dominates the other conditions and those results are repeated in task two but with slightly less confidence. Round 1 Round 2 Round 3 /I SE /1 SE y, SE C1 131.47 5.31 117.28 4.93 101.67 4.64 C2 98.81 2.58 95.87 1.68 86.61 1.35 C3 122.15 2.57 117.58 2.04 110.85 1.73 C4 129.85 5.24 107.05 5.42 94.87 2.35 Table 4.7: Task 1 total number of actions Tables 4.7 and 4.8 show the mean number of actions and standard errors from each round of each task. This data underlies the means reported in Figure 4-8. 164 Round 1 Round 2 Round 4 Round 3 1t SE 1t SE 1t SE C1 85.22 2.67 104.50 6.85 85.72 3.11 78.50 1.96 C2 81.27 2.60 78.05 1.40 82.69 3.61 72.55 1.14 C3 91.32 1.46 86.65 1.45 84.80 1.63 74.54 1.37 C4 91.89 5.04 81.40 2.29 87.10 2.67 84.00 um o Ion 1.96 ISE Table 4.8: Task 2 total number of actions Human action interval We measured the time intervals between human actions within each session. The interval times are simply measured as the time between successive actions taken by study participant. Here we present the mean of those intervals across each round of each task. 165 Human action interval 1500 --- - -- SI -- ..... ...1 - - .. . .. T II T 1000 5001 - 0' Task 1 Figure 4-9: Mean action intervals of participants across all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) C2 p C1 C2 I 0.492 C4 C3 p 0.01 p 0.897 4e-04 0.237 0.06 0.316 0.02 0.783 3e-03 0.136 0.05 C3 Table 4.9: Task 1 human action interval. ANOVA p value and effect sizes q 2 for all pairwise comparisons of conditions. 166 C3 C2 p C1 0.792 C2 C4 p 2e-03 p 0.810 le-03 0.153 0.09 0.879 5e-04 0.299 0.05 0.111 0.06 C3 Table 4.10: Task 2 human action interval. ANOVA p value and effect sizes pairwise comparisons of conditions. ij2 for all Figure 4-9 shows the mean action intervals of participants across all rounds of each task. This graph shows that there is no significant difference between the means of the conditions. There is a trend towards C4 Human being lower than others. Within-task human action interval rate of change This metric was calculated by gathering all action interval times for a given round of a task and performing linear regression on those data points. The slope of the fitted line was used as the metric. 167 Human action interval rate of change 10 ** 5 ** ** 0 -5 -10 -15 '1 -...I C1: MTP-POMDP C2: MTP-QMDP C3: Handcoded+A* C4: Human Task 1 Task 2 Figure 4-10: Shows the mean rates of change in action intervals averaged over all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) C2 C3 P C1 C2 0.441 C4 P 0.02 P 2.9e-04 0.25 0.005 0.28 3.8e-04 0.22 0.002 0.28 0.656 5e-03 C3 Table 4.11: Task 1 human action interval rate of change. ANOVA p value and effect sizes q 2 for all pair wise comparisons of conditions. 168 C2 C1 C4 C3 p 772 p 0.028 0.15 0.010 4.le-06 C2 p 71 2 0.13 0.309 0.05 0.35 0.011 0.25 0.223 0.03 C3 Table 4.12: Task 2 human action interval rate of change. ANOVA p value and effect sizes /2 for all pairwise comparisons of conditions. Figure 4-10 shows the mean rates of change in action intervals averaged over all rounds of each task. A rather clear trend can be seen where C1 MTP-POMDP and C2 MTP-QMDP dominate the other conditions, this is more pronounced in task one. Round I Round 2 Round 3 1t SE Pt SE p SE C1 -6.84 1.99 -6.95 3.63 -27.44 3.13 C2 -8.58 2.54 -9.61 2.13 -16.84 1.91 C3 3.46 4.42 -5.52 1.26 -10.67 1.35 C4 -7.18 3.21 -0.36 3.70 -6.50 3.96 Table 4.13: Task 1 human action interval rate of change Tables 4.13 and 4.14 show the mean rates of change in participant action intervals and standard errors from each round of each task. This data underlies the means reported in Figure 4-10. 169 Round 1 Round 2 Round 3 Round 4 ,y SE pt SE fp SE t SE C1 -1.90 1.68 -2.75 3.48 -0.83 3.62 -0.16 1.45 C2 -10.07 3.27 -13.55 4.36 -6.69 1.89 -4.41 1.58 C3 0.25 1.51 7.35 1.71 6.37 1.95 5.65 1.94 C4 0.12 1.74 1.91 3.53 3.53 4.67 3.27 2.60 Table 4.14: Task 2 human action interval rate of change. Human functional delay ratio We define functional delay to be the time between when the robot takes its action and when the participant takes its next action. This time indicates a wait period during which the participant might be trying to understand the action that the robot just took. A game session will produce a sequence of these delays, and in this section we use as a metric the ratio of the sum of those functional delays to the time of task completion. 170 Human functional delay ratio T 0.6 ** - -.. - 0.5 - .- 0.4 0.31 - 0.2 MC1: C2: C3: C4: 0.1 0 Task 1 MTP-POMDP MTP-QMDP Handcoded+A* Human Task 2 Figure 4-11: Shows the mean participant functional delay ratios across rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) C2 p C1 C2 0.019 C4 C3 p 0.14 p 0.927 2e-04 0.149 0.08 0.008 0.13 0.526 0.01 0.142 0.05 C3 Table 4.15: Task 1 human functional delay ratio. AN OVA p value and effect sizes r72 for all pair wise comparisons of conditions. 171 C3 C2 p C1 4.5e-04 0.34 C2 C4 p T1 2 p 9.0e-04 0.20 0.242 0.06 0.026 0.10 0.002 0.34 6.8e-04 0.24 C3 Table 4.16: Task 2 human functional delay ratio. ANOVA p value and effect sizes 12 for all pair wise comparisons of conditions. Figure 4-11 shows the functional delay ratios across rounds of each task. We can see that in task one C2 MTP-QMDP generates the lowest value although no significant difference is detected for C4 Human. In the second task C2 MTP-QMDP still produces the lowest value but is not significantly different from C3 Handcoded+A* Round 1 Round 2 Round 3 p SE py SE p SE C1 0.53 0.02 0.60 0.03 0.61 0.02 C2 0.41 0.04 0.52 0.03 0.59 0.02 C3 0.50 0.02 0.57 0.02 0.67 0.01 C4 0.50 0.05 0.52 0.03 0.59 0.03 Table 4.17: Task 1 human functional delay ratio. Tables 4.17 and 4.18 show the functional delay ratios and standard errors from each round of each task. This data underlies the means reported in Figure 4-11. 172 Round 1 Round 2 Round 3 Round 4 SE P SE 1 SE 0.52 0.02 0.53 0.02 0.42 0.02 0.04 0.51 0.04 0.32 0.03 0.39 0.02 C3 0.40 1 0.01 0.51 0.01 0.50 0.01 0.38 0.02 C4 0.52 0.53 0.03 0.58 0.03 0.49 0.03 yp SE C1 0.53 0.02 C2 0.43 0.03 P1 Table 4.18: Task 2 human functional delay ratio Within-task human functional delay rate of change This metric was calculated by gathering all functional delay times for a given round of a task and performing linear regression on those data points. The slope of the fitted line was used as a metric. 173 101 ---.. . 5 -. Human functional delay rate of change . --.. . - -. .. ... -- --.. .I...... -.. - -.. -.-.± 1 -. -- . 0 -5 -10 -15 -201-25 -30 C1: MTP-POMDP C2: MTP-QMDP C3: Handcoded+A* C4: Human -35 I -40 Task 1 Task 2 Figure 4-12: Shows the mean rates of change in participant functional delays averaged over all rounds of each task (* p<0.02, ** p<0.008, *** p<0.0002, error bars indicate a 95% confidence interval) C2 p p C1 C2 0.108 C4 C3 0.07 p 0.300 0.02 0.439 0.02 0.006 0.14 0.091 0.10 0.792 2e-03 C3 Table 4.19: Task 1 human functional delay rate of change. ANOVA p value and effect sizes r72 for all pair wise comparisons of conditions. 174 C3 C2 p C1 0.280 C4 p 0.05 C2 p 0.038 0.08 0.899 7e-04 0.006 0.16 0.650 0.01 0.126 0.05 C3 Table 4.20: Task 2 human functional delay rate of change. ANOVA p value and effect sizes r2 for all pair wise comparisons of conditions. Figure 4-12 shows the mean rates of change in participant functional delays averaged over all rounds of each task. In both tasks we can see that C2 MTP-QMDP produces the lowest value but only significantly so compared to C3 Handcoded+A* 4.3.3 Attitudinal Data None of the attitudinal measures from the post-study questionnaire showed any significant differences after the Bonferroni correction had been applied. 175 4.4 4.4.1 Discussion Task Efficiency Figure 4-7 shows us two things, firstly that C2 MTP-QMDP confidently outperforms all of the other conditions except C4 Human , and secondly that C1 MTPPOMDP and C3 Handcoded+A* show no significant difference. Basically the same results can be interpreted from Figure 4-8 where it is shown that C2 MTP- QMDP takes significantly fewer actions to accomplish the goal than all other conditions and C1 MTP-POMDP is not statistically different from the others. This is not surprising, as we would expect the number of actions taken to correlate with time of task completion. Given that C2 MTP-QMDP takes full advantage of all mechanisms of the MTP system except for the final layer of POMDP planning, we interpret these results as a general success for the MTP approach, with the reservation that the value added by the POMDP layer needs to be further justified since it comes at a cost to the task efficiency of the team. This should be further investigated with a follow-up study where we try to better understand the difference between behavior produced by the POMDP and QMDP versions of the MTP system and how they can be better tuned. We are also pleased to see that there is no significant measured difference between C1 MTP-POMDP and C3 Handcoded+A* or even C4 Human (except in time of task completion for task two). Especially in light of the fact that C3 Handcoded+A* is controlled by a policy that has the unfair advantage of perfect observability of the world state at all times and uses a set of rules to accomplish the task that would produce optimal behavior in the single agent case. We therefore conclude that our hypothesis H5 Improved Task Efficiency is supported. Even in comparison with C4 Human , which was unexpected. 176 4.4.2 Team Fluency The behavioral metrics we chose to look at for measuring team fluency were human action interval and functional delay as well as the within-task rates of change in these metrics. In Figure 4-9, we can see that there is not any significant difference in the average action intervals in each task across the conditions. This is can be confirmed by looking at the p values and effect sizes in Tables 4.9 and 4.10. On the other hand, Figure 4-10 and tables 4.11 and 4.12 show clearly that when users play in C1 MTP-POMDP and C2 MTP-QMDP their action intervals get shorter across an episode of a task, and at a significantly faster rate than C3 Hand- coded+A* and C4 Human. When a participant has a negative rate of change in their action interval, it means that as the task progresses they take actions quicker. This might suggest that they are learning how to work with their teammate and therefore needing less and less time to choose their actions as the task progresses. The fact that the MTP agents produce an improvement in this metric might suggest that participants are quicker to learn how the agent operates; since the produced behavior better matches their expectations, they require less and less contemplation as the task progresses. Looking at the functional delay ratios in Figure 4-11 and Tables 4.15 and 4.16 we can see C2 MTP-QMDP has generally lower functional delay ratios than the other conditions and this difference is often significant. We also see that in task two, C3 Handcoded+A* scores lower than C1 MTP-POMDP . The mean rate of change in the functional delay across the tasks can be seen in Figure 4-12. We can see that there is a lot of variation in this metric but we can still that C1 MTP-POMDP and C2 MTP-QMDP both have consistently negative rates of change, and C2 MTP-QMDP is significantly lower than C3 Handcoded+A* in both tasks. We cautiously interpret the results shown by action intervals, functional delays, 177 and rates of change, to suggest that an MTP controlled agent produces behavior which a human teammate can model and understand more quickly, leading to a more fluent collaboration. We conclude that hypothesis H5 Team Fluency is Improved is semi-upheld as sub-hypotheses (b) and (d) were upheld, (c) partially upheld and (a) not supported. 4.4.3 Attitudinal Data We were disappointed to find that the questionnaire data did not reveal any statistically significant differences between experimental conditions after the Bonferroni correction had been applied. The fact that behavioral differences were observed but not self-reported attitudinal differences is not an uncommon occurrence with online studies. It seems that people's engagement in the task in such studies is often significantly lowered, leaving the participant with an impression of the experience that is less pronounced. We did observe some trends in the questionnaire data that mostly favor C1 MTPPOMDP over C2 MTP-QMDP , especially along dimensions such as liking the robot and perception of the robot liking and appreciating the participant. We believe that those indications could be made more pronounced if the study were modified to follow a within-participant design, where each participant experiences the different types of agent controllers and can better assess the differences in subjective experience. Consequently, the attitudinal hypotheses H1, H2 and H3 are not supported by the data and would require further investigation to confirm. 4.4.4 Informal Indications from Open-Ended Responses We included a few open-ended questions where participants could share their thoughts about various parts of their experience. It is helpful to look at this data to be able to better understand the dynamics of the game as the participants experienced them and how that might inform future studies. 178 Game Controls Because of the discrete nature of Al planners, we were required to put in place a few artificial limitations to the game which would allow the planner to cooperate with the participant. For example the environment was discretized to grid locations and the game characters could only turn in the four directions. Secondly we enforced that a character would have to finish an "animation" (such as moving between grids or rotating) before accepting another user command. This was done so that the game would always be in a decided state when the planner decided which action it should take but resulted in a slightly clunky user experience. Many participants would report this in the open ended questions and somewhat confuse this "inefficiency" with inefficiencies that we are more interested in measuring, such as that caused by a bad teammate. Some representative quotes are: "the reaction of the character was too slow", "...the robot was simply faster in a lot of ways due to a bit of lag-centered controls", "The major obstacle was the interaction with the controls; I found myself turning more than I meant to because the game was slow to respond" and "For the most part robot was helpful; only inefficient because for me there was a slight lag after pressing keys". Lack of Communication and Tendency to Lead Many participants reported that they were annoyed that there was no way to explicitly communicate with the robot: "It was intelligent because it knows what's the next step. It would be better if we had some communication.", "Intelligent enough to understand the task, not intelligent enough to understand basic communication" ... effectiveness could have been improved if communication were possible. ". People have a strong tendency to want to communicate verbally and neither the game nor any of the planners were designed to allow that. Participants resorted to attempting to communicate through their avatar's movement and behavior: ". . . it was unresponsive to anything I tried to do to it, or anywhere I tried to lead it", "Sometimes, blocking 179 the way to the lower point engine worked and sometimes it didn't" but this would conceivably only further confuse the planners as they were not equipped to model the intentions behind that behavior. We think that a great advantage could be afforded by giving autonomous robots the ability to model this behavior explicitly. Differences Between Conditions C4 Human : The responses from people in this condition were useful to determine a baseline for the game experience, since the agent they interacted with was controlled by an expert human and should therefore not have contributed much to any frustration with the game. It was interesting to see that not many participants reported that they believed the robot to be controlled by a human except this one who figured it out because of how the robot recovered from error: "Somehow, I feel like I was playing with a real person since when we competed for a item, the one that failed to get that will get away from the engine or step to the next item." C1 MTP-POMDP : Many participants from this condition reported their satisfaction with the robot's ability to infer their intention and anticipate their actions: "The tasks were performed relatively efficiently. The robot in general was able to infer the user's intention and anticipated the next step in the assembly process." "It seemed reasonably competent. It seemed to anticipate the action that was needed.'; "The robot would try to anticipate what I was doing, but at the same time, I would try to an- ticipate what part I should be going for based on what I thought it was going to do.", "The robot in general was able to infer the user's intention and anticipated the next step in the assembly process. ". We also noticed that some participants from this condition complained that the robot sometimes took a while to respond to their action. This is probably due to the fact that this condition requires the most planning by the autonomous agent, which usually happens so quickly that it can not be detected by the participant but can sometimes halt the game for a second or two (we used a planning timeout of about five seconds). Effort should be expended in the design of 180 future studies to neutralize this confounding effect, possibly by fully pre-calculating policies before study trials or otherwise speeding up policy look-up (possibly by employing a computer cluster to improve planning speed). Representative quotes: "the reaction of the character was too slow", "Slow response time; startup time was very unpredictable". C2 MTP-QMDP : Participants in this condition also commented on the robot's ability to anticipate their actions but it seemed that the robot would more often start building the wrong engine and they would have to settle for a sub-optimal reward outcome: "On the second task, it seems like it didn't know which engine was more important, so once it would place a part on any engine I had to follow", "Once he(she it?) started to assemble the wrong engine, so I just went along", "... but I had to make sure to put in the first part myself or it would choose the wrong engine to assemble". Participants also noted that this agent would often retrieve objects that were not needed immediately but rather a few moves later: "it also sometimes jumped several steps ahead in the assembly process" ... it wasn't great at deciding whether to bring the next item or the one after it", "The only time it seemed incompetent was one of the tasks, when it started by picking up the air filter, so I had to do all of the steps before that one". This is consistent with our qualitative understanding of how the QMDP agent operates which is generally to agressively take advantage of the predictive capabilities of MTP but often overestimate its confidence about the mental state of the other agent. C3 Handcoded+A* : The participants in this condition generally reported that the robot was very efficient but several commented that it didn't seem to consider them very much: "a few times the robot moved directly in front of the avatar to return something, ratherthan taking the alternate route which did not cross the space directly in front of the avatar" "Sometimes the robot blocked my way, but for the most of the time I felt that working with my assistant robot was efficient", "Ifeel I would be more efficient, if the robot also knew to walk around me more. However, did a good job 181 knowing next step.", "Very efficient. The robot is quite capable of doing the task on its own.". This is fairly consistent with our intuition for how this autonomous agent would behave, very efficiently with respect to the task but without any model of the behavior of the other. 4.4.5 Funny Comments A few of the open-ended responses were humorous and it would be a shame not to share any of them in this thesis. The following two comments were particularly amusing: "If I could provide more input to direct its goals, then yes, the robot would make a fine teammate for mechanical tasks. I wouldn't want to take it out for a beer, though." and "The robot tried to block me! It was annoying because I tried this game for money. I think the robot is evil!". 182 Chapter 5 Conclusion 183 5.1 Thesis Contributions This thesis makes contributions both to the field of probabilistic planning as well as Human-Robot Interaction (HRI). In fact one of its contributions is taking an HRI challenge and formulating it with the appropriate representations from probabilistic planning so that it may be solved in a computationally principled way. Introduction of a novel general purpose POMDP solver. We have presented a novel algorithm called B3 RTDP which extends the Real Time Dynamic Programming approach to solving POMDP planning problems. This approach employs a bounded value function representation which it takes advantage of in novel ways. Firstly, it calculates action convergence at every belief and prunes actions that are dominated by others within a certain probability threshold. This technique is similar to a branch and bound search strategy but calculates action convergence probabilities such that actions may be pruned before convergence is achieved. Secondly, B3 RTDP introduces the concept of a Convergence Frontier which serves to improve convergence time by taking advantage of convergence of early action selection in the policy. The B 3 RTDP algorithm was evaluated against a state-of-the-art POMDP planner called SARSOP on two standard benchmark domains and showed that it can garnish higher Adjusted Discounted Reward with a shorter convergence time. Introduction of a novel approach to predictive planning based on mindtheoretic reasoning. We presented the development of a novel planning system for social agents. This Mind Theoretic Planning (MTP) system employs predictive models of others' behavior based on their underlying mental states. The MTP system takes as input distributions over possible beliefs that the other agent might have about the environment as well as possible goals they might have. It then proceeds to create predictive models called mental state situations which construct stacks of Markov Decision Process models, each level taking advantage of policies and value functions computed in levels below, producing improved predictive power. The MTP system leverages the predictive mental state situations to compute a forward transi184 tion function for the environment that includes anticipated effects of the other agent as a function of their mental state. Finally, a perceptually limiting observation function is used in conjunction with the predictive transition functions to formulate a POMDP that is solved using the B 3RTDP algorithm. Development of evaluation environment and simulator. To evaluate the contributions as well as assist in the development of this thesis, we developed a robot simulator environment that is oriented towards simulated human-robot interaction experiments. This simulator is designed so that it can be deployed online and be played by participants through a web-browser. In the game, two agents are paired together to accomplish a task, presumably one is the study participant and the other can either be an autonomous system using the character control Java API, or another human controller. The game uses a 3-D grid-based environment and the character is controlled in first-person chase-camera mode. The environment is visually filtered to only show features that are available to the character, which is particularly important for the evaluation of mind-theoretic systems as it enforces participants to take perception actions. Human subject evaluation of MTP system. An online user study was performed with approximately a hundred participants. In the study, participants would interact with an autonomous agent, in the simulator discussed above, to accomplish a task of assembling an engine from parts. The study showed that the MTP system can significantly improve task efficiency and team fluency over an alternative autonomous system and even a human expert controller in some cases. 5.2 5.2.1 Recommended Future Work Planning with personal preferences The MTP system currently takes as input the possible beliefs and goals that a human agent might have about the environment, and it computes predictions of how that 185 agent might go about attempting to achieve its goals given those beliefs. Clearly there are many ways to skin a cat and often several different plans that can achieve the agents' goals. The MTP system uses approximate, but optimal, task planners to create these action policies and therein lays an assumption of perfectly rational behavior. This assumption often holds, especially in very task- or goal-driven scenarios, and even when incorrect it can provide good approximations of actual human behavior. We know that people do not always behave optimally or necessarily rationally. The are many reasons for people's sub-optimal behavior other than simple ignorance of how to perform better, such as personal preference, superstition, curiosity, boredom or simply creativity. We believe that the MTP system could be improved if it were able to model some of the most common sources or variance in people's deviations from optimal behavior, especially if it were able to learn those parameters specific to each individual based on history of interactions. 5.2.2 Planning for more agents The presented approach to mind theoretic reasoning is in no way inherently limited to planning only for one human teammate, but it is also not particularly designed to scale well to planning for many agents. This does not mean we intended for it scale poorly, but rather that we focused on demonstrating the concept and its impact in the simpler case before thinking about optimization for scaling. The planning problem generally grows exponentially with number of agents, but we believe that clever optimizations might be used to get a better leverage on the problem. In any given task, there might be, for example, large regions in the state space where reasoning about all of the agents' mental states or predicting their actions is completely irrelevant. There might be an opportunity to apply state-space abstraction methods here. 186 5.2.3 State-space abstractions and factoring domains into MTP and Non-MTP State space abstractions can be used to significantly reduce planning time; this reduction is gained by encoding which parts of the planning problem are relevant to the current goal and which parts are not. An abstraction in a navigational domain might for example recognize areas of the environment that should be treated as the same, since their differences are irrelevant to the particular navigational target. This technique might have a huge impact on mind-theoretic planning since often only a small (but important) part of the complete state space contains features that are significant from a mind-theoretic perspective. If the MTP solver could be sensitive to this fact and have an efficient way to identify those areas of the problem, it could possibly solve them in an easier way without having to consider mind-theoretic consequences of actions or predictions of others' reactions etc. Similarly, using a factored representation for the state space might produce significantly smaller transition and reward functions as much of the action space might not have any mental state consequences and mental state variables could therefore be considered independent of those actions in their Dynamic Bayes Net (DBN) encodings. 5.2.4 Follow-up user study In the human subject study we performed to evaluate this thesis, we found some very interesting trends where an non-POMDP MTP agent using a QMDP action selection strategy mostly outperformed all other conditions in terms of efficiency and team fluency. Although we did not see significantly different results in our attitudinal measures, which were acquired using a post-task questionnaire, they contained a trend which suggested that the QMDP agent was less liked than the POMDP agent. We would like to investigate this interaction further and see how the benefits of the two strategies could be achieved by adjusting the parameters of the POMDP model. We 187 propose that this experiment should follow a "within-participant" design where each participant experiences interacting with both types of agents and provides relative attitudinal judgments. We hope this will help to really underline the important perceived differences and provide guidance for how to improve the MTP system. 188 Appendix A Existing Planning Algorithms 189 A.1 Various Planning Algorithms Algorithm 7: Pseudocode for the GRAPHPLAN algorithm. The algorithm operates in two steps, graph creation and plan extraction. The EXTRACTPLAN algorithm is a level-by-level backward chaining search algorithm that can make efficient use of mutex relations within graph. 1 GRAPHPLAN (SI : 3 4 6 8 10 12 13 15 16 17 18 19 Set, SG Set) Create So from sj; foreach i c N do Add NoOp "actions" to Ai for each state literal in Si; Add all actions that apply in Si to Aj; Construct Sj+j from Aj; Inspect Mutex relations in Sj+; if sG exists non-mutexed in Sj+j then Attempt EXTRACTPLAN; if Solution found then L return plan; else if Solution impossible then L return failure; 190 Algorithm 8: The RTDP algorithm interleaves planning with execution to find the optimal value function over the relevant states relatively quickly. 1 3 5 6 8 : GoalSet, h(s) : Heuristic) RTDP (s, : State, // Initialize value function to admissible heuristic V(s) <- h(s); while not converged do L RTDPTRIAL(sl, G); 9 RTDPTRIAL (s, : State, G : GoalSet) s=s ; 11 depth = 0; 13 14 while (s 7 0) A (s G) A (depth < MAXdepth) do 18 // a 20 // 22 V(s):= Q(s, a); 24 26 // Sample next state for exploration from 7 s := PICKNEXTSTATE(s,a) 28 depth := depth + 16 29 31 2.1 Pick action greedily, equation: argmina/ Q(s, a' Perform the Bellman value update ; PICKNEXTSTATE (s : State, a: Action) L return s' ~ T(s,a, ); 191 Algorithm 9: The BRTDP algorithm. Uses a bounded value function and search heuristic that is driven by information gain. 1 BRTDP (s, : State, G : GoalSet) 2 while not converged do 4 BRTDPTRIAL(sI, G); BRTDPTRIAL 7 s = SI ; 5 9 10 (s, : State, G : GoalSet) depth = 0; while (s / 0) A (s G) A (depth < MAXdepth) do 12 // 14 a := argmina'EAL(s, 16 // Perform the Bellman value updates for both boundaries VL (s) L (s, a) , VH(s) mina'EAQH(s, a') // Sample next state for exploration according to highest expected information gain 18 20 22 Pick action greedily from lower boundary a') 24 s:= PICKNEXTSTATEBRTDP(s, 26 depth := depth + 1; 27 a); PICKNEXTSTATEBRTDP (s, s, : State, a: Action) 31 // Create vector of transition probability weighed value gaps Vs' E S, k(s') := T(s, a, s') (VH(s') - L s 33 K := 35 // 29 36 38 40 42 k(s'); Terminate when state of relative certainty, compared to sl, has been reached if K < (VH(sI) - VL(sI)) /T then K return 0; // Sample from normalized vector return s' k(-)/K; 192 Algorithm 10: The RTDP-Bel algorithm from (Geffner & Bonet, 1998) and (Bonet & Geffner, 2009) 1 RTDP-BEL (b0 : Belief, G: GoalSet) 2 while not converged do 4 d:= 0; 6 b:= bo; ~ b(-) while (d < MAXdepth) A (b 8 9 G) do 15 // Select action greedily a := argmina'cAb (b, a'); // Perform Bellman value update 17 17(b) := Q(b, a) // Update value of belief; 19 // 11 13 21 Sample next state and observation S' - 'T(s, a, ); 23 0 ~ 25 29 // Update belief via equation 2.3 b:= ba S := S 31 d:= d+ 1; 27 Q(a, s', -); 193 194 Appendix B Study Material 195 B.1 Questionnaire The questionnaire data was gathered using the online service Survey Monkey, which constrained some of the question formatting. Section headings were omitted. Task Load The following questions required free text responses. 1. Please enter your email. Use the same email address you used when you signed up for the study. 2. Describe how efficiently or inefficiently you feel the tasks were performed. Please explain why you think that was the case. The following six questions used a seven point Likert scale between Very low and Very high and were sourced from (Hart & Staveland, 1988). 3. How mentally demanding were the tasks ? 4. How hurried or rushed was the pace of the tasks ? 5. How successful were you in accomplishing what you were asked to do ? 6. How hard did you have to work to accomplish your level of performance ? 7. How insecure, discouraged, irritated, stressed, and annoyed were you ? 8. How often did your team assemble the engine that you originally intended to assemble ? The following question used a seven point Likert scale between Never and Always. 9. How often did your team assemble the engine that you originally intended to assemble ? Fluency The following questions used a seven point Likert scale between Strongly disagree and Strongly agree and were largely sourced from (Hoffman, 2014). 10. The robot tended to ignore me 11. The human-robot team improved over time 12. The human-robot team worked fluently together. 13. I tended to ignore the robot 14. The robot's performance improved over time 15. The robot contributed to the fluency of the interaction. 16. The human-robot team's fluency improved over time 17. What the robot did affected what I did 196 18. What I did affected what the robot did Competency The following question required a free text response. 19. Briefly describe how competent or incompetent you felt the robot was and why. The following questions used a seven point likert scale where 1 was designated with the first quoted word and 7 with the second. These were partly sourced from (Bartneck et al., 2009) and partly from (Berlo et al., 1969). 20. Please rate your impression of the robot on a scale between "incompetent" and "competent" 21. Please rate your impression of the robot on a scale between "untrained" and "trained" 22. Please rate your impression of the robot on a scale between "unqualified" and "qualified" 23. Please rate your impression of the robot on a scale between "unskilled" and "skilled" Team-Mindedness The following question required a free text response. 24. Describe how you felt about having the robot as a team-mate. Would you like to have the robot on your team in the future ? The following questions used a seven point likert scale where 1 was designated with the first quoted word and 7 with the second. 25. Please rate your impression of the robot on a scale between "unhelpful" and "helpful" 26. Please rate your impression of the robot on a scale between "inconsiderate" and "considerate" 27. Please rate your impression of the robot on a scale between "selfish" and "selfless" 28. Please rate your impression of the robot on a scale between "ego-oriented" and "teamoriented" The following questions used a seven point Likert scale between Completely unaware and Completely aware. 29. How aware of your plans do you think the robot was The following questions used a seven point Likert scale between Strongly disagree and Strongly agree and were largely sourced from (Hoffman, 2014) and (Wageman et al., 2005). 30. The robot was committed to the success of the team 31. I was committed to the success of the team 32. If it were possible then I would be willing to team up with the robot on other projects in the future 33. The robot had an important contribution to the success of the team 197 34. The robot and I are working towards mutually agreed upon goals 35. The robot does not understand what I am trying to accomplish 36. The robot perceives accurately what my goals are 37. The robot was cooperative Intelligence The following question required a free text response. 38. Briefly describe how intelligent or unintelligent you felt the robot was and why. The following questions used a seven point likert scale where 1 was designated with the first quoted word and 7 with the second and were largely sourced from (Bartneck et al., 2009). 39. Please rate your impression of the robot on a scale between "ignorant" and "knowledgeable" 40. Please rate your impression of the robot on a scale between "unintelligent" and "intelligent" 41. Please rate your impression of the robot on a scale between "uninformed" and "informed" Theory of Mind The following questions used a seven point Likert scale between Strongly disagree and Strongly agree and were adapted from from (Hutchins et al., 2012). 42. The robot understands that people's beliefs about the world can be incorrect 43. The robot understands that people can think about other peoples' thoughts 44. If I put my keys on the table, leave the room, and the robot moves the keys to a different room the robot would understand that when I returned, I would begin by looking for my keys where I left them Humanness The following questions used a seven point likert scale where 1 was designated with the first quoted word and 7 with the second. 45. Please rate your impression of the robot on a scale between "machinelike" and "humanlike" 46. Please rate your impression of the robot's behavior on a scale between "predetermined" (or programmatic) and "interactive" (or responsive) 47. Please rate your impression of the robot on a scale between "introverted" (shy, timid) and "extroverted" (outgoing, energetic) The following question used a seven point Likert scale between Strongly disagree and Strongly agree. 48. I think it is possible that the robot was controlled by a person behind the scenes Multiple choice: "Brown saw", "Orange pliers", "Green plastering tool", "Black chisel", "Red screwdriver", "Gray hammer" and "Blue level". 198 49. Please select a tool that you used in the tasks Enjoyment The following question required a free text response. 50. Briefly describe whether or not you enjoyed working with the robot (and why or why not) The following questions used a seven point likert scale where 1 was designated with the first quoted word and 7 with the second. Largely sourced from (Bartneck et al., 2009). 51. How much did you enjoy performing the tasks with the robot ? (1=Not at all, 7=Very much) 52. Please rate how much you liked or disliked the robot ? (l=Dislike, 7=Like) 53. Please rate your impression of the robot on a scale between "unfriendly" and "friendly" The following questions used a seven point Likert scale between Strongly disagree and Strongly agree and were adapted from from (Hoffman, 2014). 54. I am confident in the robot's ability to help me 55. The robot and I trust each other 56. I believe the robot likes me 57. The robot and I understand each other 58. I feel that the robot appreciates me 59. I feel uncomfortable with the robot Demographics The following questions required free text responses. 60. In what country do you currently live in ? 61. What is your age in years ? The following questions were multiple choice. 62. What is your gender ? ("Male", "Female", "Other/Don't want to answer") 63. What is your level of education ? ("Some high school", "High school degree", "Some college", "College degree", "Some graduate school", "Graduate degree") 64. How often do you play video games where characters are controlled in 3D environments (which are different from 2D games such as Tetris and Angry Birds etc.) ? ("Never", "Less than once a month", "1-4 times a month", "5-10 times a month", "11-20 times a month", "More than 20 times a month") 65. Do you own or have you used a robotic toy or appliance (e.g. Sony AIBO, iRobot Roomba) ? ("Never", "Used them once or twice", "Used them many times", "Own one or more") 66. Did you experience any errors or technical troubles while playing the game ? ("No errors", 199 "Very few", "Some", "Fairly many", "A lot") The following question asked "How much do you know about:" using a seven point Likert scale between Nothing and A lot. 67. Computers 68. Robotics 69. Artificial Intelligence The following question required a free text response. 70. Do you want to report any technical difficulty you had with the game or errors that you experienced ? 200 B.2 Study Pages Sign-up website: MIT Robot Study Pers..a Robats wmmp MIT MEDIA LAB Welcome and thank you for signing up for our robot study ! Before participating in the study, please read the "Consent to Participate in Non-Biomedical Research" section below. We recommend printing this page for your records [press here to print. If you have any questions regarding the study which you would like answered before you participate, please feel free to email them to mit .robot. studv@gmail cor Please enter your email, which will be used as your username for this study (We will never send spam or release this email address to a 3rd party). Please note that this same email should be used when filling out the questionnaire, and is where we will send you your Amazon gift card code if you complete all the tasks and questionnaire. We will send further instructions on how to participate in this study to your inbox. Please note that you must be 18 years old or older to participate and you can only participate once. Email: Please read the following consent form and provide your consent by checking the checkbox and pressing the 'submiV button on the bottom of the page. Consent to participate in non-biomedical research Collaboration and Learning with Local and Remote Robot Teams You are asked to participate in a research study conducted by Sigurdur Orn Adalgeirsson, M.Sc. and Cynthia Breazeal, Ph.D., from the Media Lab at the Massachusetts Institute of Technology (M.I.T.) You were selected as a possible participant in this study because you are a proficient speaker of English. You should read the information below, and ask questions about anything you do not understand, before deciding whether or not to participate. Participation and withdrawal Your participation in this study is completely voluntary and you are free to choose whether to be in it or not. If you choose to be in this study, you may subsequently withdraw from it at any time without penalty or consequences of any kind. The investigator may withdraw you from this research if circumstances arise which warrant doing so. Purpose of the study The purpose of this study is to learn about the strategies that people employ when working together with teams of one to more robots to solve situated, physical tasks. We are interested in how people try to teach new skills to robot teams, as well as in how they collaborate with robot teams to solve situated tasks in the real world or simulation. We are constructing robots that can team up to interact with and learn from people, and we hope that the results of this study will help us to improve the design of these robots' collaborative abilities. Procedures You will be controlling a human character in a video game and work collaboratively with a simulated robot to achieve an engine assembly task. We will be asking you to participate in a 3-4 rounds of two collaborative tasks with our simulated robot. You will be interacting with a robot via a graphical simulation interface similar to a video game. In each task, you will have a different 201 goal to achieve and the robot will attempt to be helpful. After you have finished the tasks, you will be asked to complete a questionnaire Each round of the tasks will take about 2-4 minutes, and the questionnaire will take about 15-20 minutes, so the total time for this experiment will be approximately 30-50 minutes. Potential risks and discomfort There are no risks that are anticipated while participating in this study. Potential benefits There are no specific benefits that you should expect from participating in this study; however, we hope that you will find the experience to be enjoyable and engaging. Your participation in this study will help us to build robots that are better able to interact with and learn from humans. Payment for participation You will receive a value of at least $5 in the form of an Amazon gift card for having completed participation in this experiment (playing the training round, all rounds of both tasks and filling out the questionnaire). An additional $50 gift card will be given to the three participants that complete the tasks most efficiently (in the shortest time). Finally there will be a lottery for one $100 gift card. The $5 gift card will be delivered to you via your email address within a week of your complete participation. To be eligible for the awards and lottery, completed participation is required. The lottery and best performance awards will be delivered at the end of running this study (within approximately two months). Confidentiality Any information that is obtained in connection with this study and that can be identified with you will remain confidential and will be disclosed only with your permission or as required by law. No data that would describe an individual participant will be used, we will only use aggregate data from all participants. At any time during or after the experiment you can request that all data collected during your participation be destroyed. Identification of Investigators If you have any questions or concerns about the research, please feel free to contact: Associate Professor Cynthia Breazeal 617-452-5601 MIT Media Lab, E15-468 Cambridge, MA 02139 cynthiab@rnedia.mit.edu Sigurdur Orn Adalgeirsson 617-452-5603 MIT Media Lab, E15-468 Cambridge, MA 02139 siggi@media.mit.edu Emergency care and compensation for injury Ifyou feel you have suffered an injury, which may include emotional trauma, as a result of participating in this study, please contact the person in charge of the study as soon as possible. In the event you suffer such an injury, M.I.T. may provide itself, or arrange for the provision of, emergency transport or medical treatment, including emergency treatment and follow-up care, as needed, or reimbursement for such medical services. M.I.T. does not provide any other form of compensation for injury. In any case, neither the offer to provide medical assistance, nor the actual provision of medical services shall be considered an admission of fault or acceptance of liability. Questions regarding this policy may be directed to MITs Insurance Office, (617) 253-2823. Your insurance carrier may be billed for the cost of emergency transport or medical treatment, if such services are determined not to be directly related to your participation in this study. Rights of research subjects You are not waiving any legal claims, rights or remedies because of your participation in this research study. If you feel you have been treated unfairly, or you have questions regarding your rights as a research subject, you may contact the Chairman of the Committee on the Use of Humans as Experimental Subjects, M.I.T., Room E25-143B, 77 Massachusetts Ave, Cambridge, MA 02139, phone 1-617-253 6787. I understand the procedures described above. My questions have been answered to my satisfaction, and I agree to participate in this study. I have been given an opportunity to print this form. Gubit 202 Participation email: Gaii MIT Robot Study mit-robot-study@media.mit.edu <mit-robot-study@media.mit.edu> To: siggioa@gmail.com Mon, Apr 14, 2014 at 5:57 PM Thank you for signing up for our study and helping to make our robots smarter! PREPARATION: 1. Please start by making sure you have the Unity3D plugin installed in your browser: http://unity3d.com/webplayer 2. Read through the instructions on how to play the game: http://prg-robot-study. media. mit. edu/?action=instructions 3. Get familiar with the game, use this test level to navigate the space and test the controls: http://prg-robot-study. media. mit.edu/?email=siggioa@gmail.com&eConfCode=d3897vou854toghbq4ugopap9& action=test&type=DaL9K PERFORM STUDY TASKS: 1. Complete the first task: http://prg-robot-study.media.mit.edu/?email=siggioa@gmail. com&eConfCode=d3897vou854t1oghbq4ugopap9& action=test&ty pe=R57Zv 2. Complete the second task: http://prg-robot-study.media.mit.edu/?email=siggioa@gmail.com&eConfCode=d3897vou854t1oghbq4ugopap9& action=test&type=qNKK6 AFTER PLAYING GAME: 1. Fill out this questionnaire immediately after completing all rounds of all tasks. In the questionnaire, please enter the same email address as used here. https://www.surveymonkey.com/s/JNSQKW9 2. Once all data has been verified (questionnaire and game data) then your amazon gift card will be emailed to you. This should happen within a few days of participation. If you have any questions, please reply to this email. Best regards -Siggi 203 Instruction pages: ,*f P-ser PIIT MIT Robot Study Rosr- MIT MEDIA LAB Game Instructions You will be controlling a human avatar in an environment that also has a robot. The robot's purpose is to provide assistance to you but it doesn't always know your goals. Please NEVER use your browser's back, forward, refresh/reload or simply re-enter the game url into the url entry, once a game has started loading. If you want to exit a game or reload it, please close the browser tab or window where the game is currently playing and follow the link in your email again. Perception When moving around the rooms, objects will fade in and fade out as they become visible to you. If you want to see more of the space you need to tum/move around to look. Objects There are a few different objects in the environment that can be picked up and used. One of the tasks is to assemble an engine so all of the items and tools are relevant to that task. Note: Ifyou want put down an object, it needs to be retumed to the same table it was picked up from. Engine parts The engine block and the air filter can be picked up. They can also be added to the engine base if standing in front of it when putting them down. Engine base: This is the site of an unassembled engine. Engine block Engine air filter Tools These can be picked up and applied to the engine only if there is a tool-tip visible above the engine indicating the tool in question. Once the tool has been applied to the engine the tool-tip disappears, indicating that the tool isn't needed anymore. Yellow screwdriver Wrench page 1 of 3 Next page 204 Red screwdriver NOTE: From certain angles, the red handle isn't very visible and only the metal part can be clearly seen. MIT Robot Study MIT MEDIA LAB Actions You can move around the rooms by either using your mouse or keyboard. The available actions are: " " " * Move forward (Keyboard: 'Up arrow' or clicking on forward arrow): Moves forward if it is possible (nothing is blocking) Turn left (Keyboard 'Left arrow' or clicking on left arrow): Rotates to the left. Turn right (Keyboard 'Right arrow' or clicking on right arrow): Rotates to the right Action (Keyboard 'Space bar' or clicking on the item in front of character): Performs the following actions: " Pick up: If you aren't holding anything, and you are standing in front of an item then you will pick it up " Put down: If you are holding something and standing in front of the table where you picked it up, you will put it down " Apply to engine: If you are holding the right item or tool for the engine when standing in front it, that item or tool can be applied to the engine. Game processing Please note that sometimes the game needs to process information. During that time you can not take any actions. If you notice that you can't move your character, please check to see if the icon in the top left corner of the game is saying "Please waW' or " Procead". Keyboard navigation For keyboard navigation to work, the game area of the webpage needs to be selected. This can be accomplished by simply clicking with the mouse anywhere within the Unity game. Use the arrow keys for navigating forward or turning left and right. Use space bar to perform actions. Mouse navigation The character can be controlled with the mouse as well, there are small "arrow" icons underneath the character which can be clicked to navigate. When the character is in front of a table with an object or the enginebase or a valid put-down location when holding an object, a mouse click on that location will perform the appropriate action (pick up, put down or apply). page 2 of 3 Previous page - Next page 205 ,o PIlT MIT Robot Study MIT MEDIA LAB Engine assembly The engine base can be assembled into a complete engine I the right Items are placed on it and the appropriate tools used. The following sequence is needed: 1. 2. 3. 4. 5. 6. Pick up Pick up Pick up Pick up Pick up Pick up an engine block and place it onto the engine base the yellow screwdriver and apply It to the engine base another engine block and place it onto the engine base the wrench and apply It to the engine base the air filter and place it onto the engine base the red screwdriver and apply it to the engine base This diagram will be provided next to the game window when playing the game. zidlliD Game screens This is the screen that you will see Once the level successfully loads, the Once the game finishes (either successfu or not) you will see this screen. In white immediately after the game has loaded and game should look something like this before the level has finished loading. (depending on which level you are joining). letters you will see whether you succeed( or not and what to do next. NOTE: Ifthis screen stays for more than about a minute then a problem might have occurred, and you should close the window and re-click your game link page 3 of 3 Previous page 206 Intermediate "rounds" page (three in first task, four in the second): MIT Robot Study MIT MEDIA LAB Your Task Your task is to assemble the engine as fast as you can. The robot will attempt to assist you. Please finish ALL of the following 3 rounds of this task. 1. Round nr. 1 2. Round nr. 2 3. Round nr, 3 Game screen for task 1: MIT Robot Study MIT MEDIA LAB htp//unitv3d.com ebplaver/ nstructions Your Task Your task is to assemble the engine as fast as you can. The robot will attempt to assist you. Game goes here 207 If you can see the character but the level hasnt loaded and it says "Proceed" in the upper left comer for more than 20 seconds then please close the browser tab and follow the link again (initial connection error). Engine assembly instructions (refresher) I ZIZZID Game screen for task 2: i'rr MIT Robot Study MIT MEDIA LAB htlpilunitv3d com/webolaver/ Instructions Your Task This level will have two engine bases but only enough parts for assembling one. Your task is to assemble a single engine as fast as you can. The robot will attempt to assist you. Here are the points for this level (the robot is unaware of the points): " Engine on the LEFT assembled is worth 15 points " Engine on the RIGHT assembled is worth 20 points Game goes here 208 - C.eCe Wth Ila& - If you can see the character but the level hasn't loaded and it says "Proceed" in the upper left corner for more than 20 seconds then please close the browser tab and follow the link again (initial connection error). Engine assembly instructions (refresher) I iziliD I E 209 210 References Adams, William, Trafton, JG, Bugajska, M.D., Schultz, A.C., & Kennedy, W.G. 2008. Incorporating Mental Simulation for a More Effective Robotic Teammate. In: Twenty-third conference on artificial intelligence (AAAI 2008). AAAI Press, Menlo Park. Storming Media. Baker, C L, Saxe, R, & Tenenbaum, J B. 2009. Action understanding as inverse planning. Cognition, 113(3), 329-349. Baker, C L, Saxe, R R, & Tenenbaum, J B. 2011. Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution. Proceedings of the Thirty-Second Annual Conference of the Cognitive Science Society. Barber, D. 2011. Bayesian reasoning and machine learning. Baron-Cohen, S. 1995. Mindblindness. MIT Press Cambridge, MA. Barrett, Anthony, & Weld, Daniel S. 1994. Partial-order planning: Evaluating possible efficiency gains. Artificial Intelligence, 67(1), 71-112. Bartneck, Christoph, Kuli6, Dana, Croft, Elizabeth, & Zoghbi, Susana. 2009. Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. robotics, 1(1), 71-81. 211 International journal of social Barto, Andrew G, Bradtke, Steven J, & Singh, Satinder P. 1995. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1), 81-138. Bellman, Richard. 1957a. A Markovian decision process. Tech. rept. DTIC Document. Bellman, Richard E. 1957b. Dynamic Programming. Princeton, N.J.: Princeton University Press. Berlin, Matthew, Gray, Jesse, Thomaz, A L, & Breazeal, C. 2006. Perspective taking: An organizing principle for learning in human-robot interaction. Page 1444 of: Proceedings of the National Conference on Artificial Intelligence, vol. 21. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Berlo, David K, Lemert, James B, & Mertz, Robert J. 1969. Dimensions for evaluating the acceptability of message sources. Public Opinion Quarterly, 33(4), 563-576. Blakemore, S J, & Decety, J. 2001. From the perception of action to the understanding of intention. Nature Reviews Neuroscience, 2(8), 561-567. Blum, A L. 1995. Fast Planning Through Planning Graph Analysis. Tech. rept. DTIC Document. Bonet, Blai, & Geffner, Hector. 2003. Labeled RTDP: Improving the convergence of real-time dynamic programming. Pages 12-21 of: ICAPS, vol. 3. Bonet, Blai, & Geffner, H6ctor. 2009. Solving POMDPs: RTDP-Bel vs. Point-based Algorithms. Pages 1641-1646 of: IJCAI. Boutilier, Craig, Friedman, Nir, Goldszmidt, Moises, & Koller, Daphne. 1996. Context-Specific Independence in Bayesian Networks. Boutilier, Craig, Dearden, Richard, & Goldszmidt, Mois6s. 2000. Stochastic dynamic programming with factored representations. Artificial Intelligence, 121(1aA$2), 49-107. 212 Breazeal, C, Gray, Jesse, & Berlin, Matthew. 2009. An embodied cognition approach to mindreading skills for socially intelligent robots. International Journal of Robotics Research, 28(5), 656-680. Brown, James Dean. 2008. Effect size and eta squared. JALT Testing & Evaluation SIG Newsletter, 12(April), 38-43. Butterfield, J, Jenkins, 0 C, Sobel, D M, & Schwertfeger, J. 2009. Modeling aspects of Theory of Mind with Markov random fields. International Journal of Social Robotics, 1(1), 41-51. Castelfranchi, Cristiano. 1998. Modelling social action for Al agents. Artificial Intelligence, 103(1-2), 157-182. Christian Keysers, Valeria Gazzola. 2008. Unifying Social Cognition. Chap. 1 of: Pineda, J.A. (ed), Mirror Neuron Systems: the Role of Mirroring Processes in Social Cognition. Springer. Csibra, G, & Gergely, G. 2007. 'Obsessed with goals': Functions and mechanisms of teleological interpretation of actions in humans. Acta Psychologica, 124(1), 60-78. Cumming, Geoff. 2013. The New Statistics: Why and How. Psychological Science. Dearden, Richard, & Boutilier, Craig. 1997. Abstraction and approximate decisiontheoretic planning. Artificial Intelligence, 89(1), 219-283. Dunbar, Robin. 2005. Why God won't go away. Durfee, E H. 1999. Practically coordinating. AI Magazine, 20(1), 99. Frith, Uta. 1989. Autism: Explaining the enigma. Gallese, V., & Goldman, A. 1998. Mirror neurons and the simulation theory of mindreading. Trends in cognitive sciences, 2(12), 493-501. 213 Geffner, H6ctor, & Bonet, Blai. 1998. Solving Large POMDPs using Real Time Dynamic Programming. In: In Proc. AAAI Fall Symp. on POMDPs. Gerstenberg, Tobias, & Goodman, Noah D. 2011. Ping Pong in Church : Productive use of concepts in human probabilistic inference. 1, 1590-1595. Ghallab, M, Aeronautiques, C, Isi, C K, Wilkins, D, & Others. 1998. PDDL-the planning domain definition language. Ghallab, M, Nau, D S, & Traverso, P. 2004. Automated Planning: theory and practice. Morgan Kaufmann Publishers. Goldman, A.I. 2006. Conceptualizing Simulation Theory. Chap. 2 of: Simulating minds: The philosophy, psychology, and neuroscience of mindreading. Oxford University Press, USA. Gopnik, Alison, & Wellman, Henry M. 1992. Why the child's theory of mind really is a theory. Mind & Language, 7(1-2), 145-171. Gray, Jesse, & Breazeal, Cynthia. 2012. Manipulating Mental States Through Physical Action. Pages 1-14 of: Social Robotics. Springer. Guo, H, & Hsu, W. 2002. A survey of algorithms for real-time Bayesian network infer- ence. In: AAAI/KDD/UAI02 Joint Workshop on Real-Time Decision Support and Diagnosis Systems. Edmonton, Canada. Hart, Sandra G, & Staveland, Lowell E. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Advances in psychology, 52, 139-183. Hauskrecht, Milos. 2000. Value-function approximations for partially observable Markov decision processes. J. Artif. Int. Res., 13(1), 33-94. 214 Helmert, M. 2006. The fast downward planning system. Journal of Artificial Intelligence Research, 26(1), 191-246. Hochberg, Yosef. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75(4), 800-802. Hoffman, G, & Breazeal, C. 2007. Cost-based anticipatory action selection for humanrobot fluency. Robotics, IEEE Transactions on, 23(5), 952-961. Hoffman, Guy. 2014. Measuring Fluency in Human-Robot Collaboration : Objective and Subjective Metrics. In: IROS. Hoffmann, J, & Nebel, Bernhard. 2011. The FF planning system: Fast plan generation through heuristic search. arXiv preprint arXiv:1106.0675, 14, 253-302. Howard, Ronald A. 1960. Dynamic Programming and Markov Processes. Hutchins, Tiffany L, Prelock, Patricia A, & Bonazinga, Laura. 2012. Psychometric evaluation of the theory of mind inventory (ToMI): A study of typically developing children and children with autism spectrum disorder. Journal of autism and developmental disorders, 42(3), 327-341. Ito, J Y, Pynadath, D V, & Marsella, S C. 2007. A decision-theoretic approach to evaluating posterior probabilities of mental models. In: AAAI-07 workshop on plan, activity, and intent recognition. Jara-ettinger, Julian, Baker, Chris L, & Tenenbaum, Joshua B. 2012. Learning What is Where from Social Observations. Kaelbling, Leslie Pack, Littman, Michael L, & Cassandra, Anthony R. 1998. Planning and acting in partially observable stochastic domains. 101(1-2), 99-134. 215 Artificial Intelligence, Kidd, C D, & Breazeal, C. 2004. Effect of a robot on user perceptions. In: IEEE/RSJ InternationalConference on Intelligent Robots and Systems, 2004. (IROS 2004). Proceedings, vol. 4. Korb, K B, & Nicholson, A E. 2004. Bayesian artificialintelligence. cRc Press. Kurniawati, Hanna, Hsu, David, & Lee, Wee Sun. 2008. SARSOP: Efficient PointBased POMDP Planning by Approximating Optimally Reachable Belief Spaces. Pages 65-72 of: Robotics: Science and Systems. Lee, Kwan Min, Peng, Wei, Jin, Seung-A, & Yan, Chang. 2006. Can robots manifest personality?: An empirical test of personality recognition, social responses, and social presence in human-robot interaction. Journal of communication, 56(4), 754-772. Leslie, A M. 1994. ToMM, ToBy, and Agency: Core architecture and domain specificity. Mapping the mind: Domain specificity in cognition and culture, 119-148. Littman, Michael L, Cassandra, Anthony R, & Kaelbling, Leslie Pack. 1995. Learning policies for partially observable environments: Scaling up. Pages 362-370 of: ICML, vol. 95. Citeseer. Macindoe, 0, Kaelbling, L P, & Lozano-P6rez, T. 2012. POMCoP: Belief Space Planning for Sidekicks in Cooperative Games. In: Eighth Artificial Intelligence and Interactive Digital Entertainment Conference. McMahan, H Brendan, Likhachev, Maxim, & Gordon, Geoffrey J. 2005. Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. Pages 569-576 of: Proceedings of the 22nd international conference on Machine learning. ACM. Nau, D S, Au, T C, Ilghami, 0, Kuter, U, Murdock, J W, Wu, D, & Yaman, F. 2003. SHOP2: An HTN planning system. J. Artif. Intell. Res. (JAIR), 20, 379-404. 216 Nichols, S, & Stich, S.P. 2003. Pieces of Mind: A Theory of Third-Person Mindreading. Chap. 3 of: Nichols, S. And Stich, S.P. (ed), Mindreading: An integrated account of pretence, self-awareness, and understandingother minds. Oxford University Press, USA. Nikolaidis, Stefanos, & Shah, Julie. 2013. Human-Robot Cross-Training: Computational Formulation, Modeling and Evaluation of a Human Team Training Strategy. Pages 33-40 of: Proceedings of the 8th ACM/IEEE internationalconference on Human-robot interaction. IEEE Press. Onishi, K H, & Baillargeon, R. 2005. Do 15-month-old infants understand false beliefs? Science, 308(5719), 255. Perner, J, Frith, U, Leslie, A M, & Leekam, S R. 1989. Exploration of the autistic child's theory of mind: Knowledge, belief, and communication. Child develop- ment, 689-700. Pineau, Joelle, Gordon, Geoff, Thrun, Sebastian, & Others. 2003. Point-based value iteration: An anytime algorithm for POMDPs. Pages 1025-1032 of: IJCAI, vol. 3. Poole, D. 1997. Probabilistic partial evaluation: Exploiting rule structure in probabilistic inference. Pages 1284-1291 of: INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 15. LAWRENCE ERLBAUM ASSOCIATES LTD. Pynadath, D V, & Marsella, S C. 2004. Fitting and compilation of multiagent models through piecewise linear functions. Pages 1197-1204 of: Proceedings of the Third InternationalJoint Conference on Autonomous Agents and Multiagent SystemsVolume 3. IEEE Computer Society. Ross, St6phane, Pineau, Joelle, Paquet, S6bastien, & Chaib-Draa, Brahim. 2008. 217 Online Planning Algorithms for POMDPs. J. Artif. Intell. Res. (JAIR), 32, 663704. Russell, S, & Norvig, P. 2003. Artificial Intelligence: A Modern Approach - 2nd Edition. Sanner, Scott, Goetschalckx, Robby, Driessens, Kurt, & Shani, Guy. 2009. Bayesian real-time dynamic programming. In: Proc. of IJCAI, vol. 9. Saxe, R. 2005. Against simulation: the argument from error. Trends in Cognitive Sciences, 9(4), 174-179. Scassellati, Brian. 2002. Theory of Mind for a Humanoid Robot. Autonomous Robots, 12(1), 13-24-24. Schiffer, Stephen. 2012. Propositions, What Are They Good For ? de Gruyter. Sebanz, Natalie, Bekkering, Harold, & Knoblich, Gunther. 2006. Joint action: bodies and minds moving together. Trends in Cognitive Sciences, 10(2), 70-76. Silver, David, & Veness, Joel. 2010. Monte-Carlo planning in large POMDPs. Pages 2164-2172 of: Advances in Neural Information Processing Systems. Singer, T, Seymour, B, O'Doherty, J, Kaube, H, Dolan, R J, & Frith, C D. 2004. Empathy for pain involves the affective but not sensory components of pain. Science, 303(5661), 1157. Singer, T, Seymour, B, O'Doherty, J P, Stephan, K E, Dolan, R J, & Frith, C D. 2006. Empathic neural responses are modulated by the perceived fairness of others. Nature, 439(7075), 466-469. Smith, Trey, & Simmons, Reid. 2004. Heuristic search value iteration for POMDPs. Pages 520-527 of: Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press. 218 Smith, Trey, & Simmons, Reid. 2006. Focused real-time dynamic programming for MDPs: Squeezing more out of a heuristic. Page 1227 of: Proceedings of the National Conference on Artificial Intelligence, vol. 21. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Sondberg-Jeppesen, N, & Jensen, F V. 2010. A PGM framework for recursive modeling of players in simple sequential Bayesian games. International Journal of Approximate Reasoning, 51(5), 587-599. Sondik, Edward Jay. 1971. The optimal control of partially observable Markov processes. Tech. rept. DTIC Document. Steinfeld, A, Fong, T, Kaber, D, Lewis, M, Scholtz, J, Schultz, A, & Goodrich, M. 2006. Common metrics for human-robot interaction. Pages 33-40 of: Proceedings of the 1st ACM SIGCHI SIGART conference on Human-robot interaction.ACM. Tauber, S, & Steyvers, M. 2011. Using Inverse Planning and Theory of Mind for Social Goal Inference. Proceedings of the Thirtieth Third Annual Conference of the Cognitive Science Society. Trafton, J.G., Cassimatis, N.L., Bugajska, M.D., Brock, D.P., Mintz, F.E., & Schultz, A.C. 2005. Enabling effective human-robot interaction using perspective-taking in robots. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 35(4), 460-470. Ullman, T D, Baker, C L, Macindoe, 0, Evans, 0, Goodman, N D, & Tenenbaum, J B. 2010. Help or hinder: Bayesian models of social goal inference. Advances in Neural Information Processing Systems (NIPS), 22. Vidal, J M, & Durfee, E H. 1995. Recursive agent modeling using limited rationality. Pages 376-383 of: Proceedings of the First International Conference on Multi- Agent Systems (ICMAS-95). 219 Wageman, Ruth, Hackman, J Richard, & Lehman, Erin. 2005. Team Diagnostic Survey Development of an Instrument. The Journal of Applied Behavioral Science, 41(4), 373-398. Wellman, Henry M., Cross, David, & Watson, Julanne. 2001. Meta-Analysis of Theory-of-Mind Development: The Truth about False Belief. Child Develop- ment, 72(3), 655-684. Wikipedia. 2014. Philosophy of mind - Wikipedia, The Free Encyclopedia. Wimmer Josef, H. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children's understanding of deception. Cognition, 13(1), 103-128. Woodward, A L. 2009. Infants' grasp of others' intentions. Current Directions in Psychological Science, 18(1), 53. Zettlemoyer, L S, Milch, B, & Kaelbling, L P. 2009. Multi-agent filtering with infinitely nested beliefs. 220