Are We There Yet?: The Role of Gender on the Effectiveness and Efficiency of User-Robot Communication in Navigational Tasks THEODORA KOULOURI, STANISLAO LAURIA AND ROBERT D. MACREDIE Department of Information Systems and Computing, Brunel University, UK SHERRY CHEN Graduate Institute of Network Learning Technology, National Central University, Taiwan ______________________________________________________________________________ ________ Many studies have identified gender differences in communication related to spatial navigation in real and virtual worlds. Most of this research has focused on single-party communication (monologues), such as the way in which individuals either give or follow route instructions. However, very little work has been reported on spatial navigation dialogues and whether there are gender differences in the way that they are conducted. This paper will address the lack of research evidence by exploring the dialogues between partners of the same and of different gender in a simulated Human-Robot Interaction study. In the experiments discussed in this paper, pairs of participants communicated remotely; in each pair, one participant (the instructor) was under the impression that s/he was giving route instructions to a robot (the follower), avoiding any perception of gendered communication. To ensure the naturalness of the interaction, the followers were given no guidelines on what to say, however each had to control a robot based on the user’s instructions. While many monologue-based studies suggest male superiority in a multitude of spatial activities and domains, this study of dialogues highlights a more complex pattern of results. As anticipated, gender influences task performance and communication. However, the findings suggest that it is the interaction – the combination of gender and role (i.e., instructor or follower) – that has the most significant impact. In particular, pairs of female users/instructors and male ‘robots’/followers are associated with the fastest and most accurate completion of the navigation tasks. Moreover, dialogue-based analysis illustrates how pairs of male users/instructors and female ‘robots’/followers achieved successful communication through ‘alignment’ of spatial descriptions. In particular, males seem to adapt the content of their instructions when interacting with female ‘robots’/followers and employ more landmark references compared to female users/instructors or when addressing males (in male-male pairings). This study describes the differences in how males and females interact with the system, and proposes that any female ‘disadvantage’ in spatial communication can disappear through interactive mechanisms. Such insights are important for the design of navigation systems that are equally effective for users of either gender. ______________________________________________________________________________ ________ 1. INTRODUCTION How we talk about places and objects in the world challenges researchers from a variety of disciplines. This research has resulted in the development of theories of human communication, cognition and behaviour and informs the design of computer applications and user interfaces including those related to spatial information, such as Geographic Information Systems, dialogue systems for robot navigation and spatially-aware artificial agents. It is widely recognised that there are large individual and group differences in how people process and communicate spatial information, related to, amongst other things, core spatial and verbal abilities (Vanetti and Allen, 1988), age (Golding, 1996), education, and previous experience (Newcombe et al., 1983). Research has shown that observed individual differences may also be ‘stylistic’ – that is, relating to preference rather than ability (Barkowsky et al., 2007). Gender-related differences have also been consistently reported in research looking at a variety of spatial tasks; contrary to popular belief, though, evidence remains inconclusive about male superiority in this area (Lawton, 1994; Allen, 2000a). Research into communicating spatial information has identified a complex pattern of findings with regards to gender differences and spatial and linguistic strategies employed. In particular, men frequently formulate their instructions using the cardinal system and metric distances, whereas women prefer to include references to proximal landmarks (Ward et al., 1986; Lawton, 1994). Research on wayfinding shows that women rely on local landmarks for orientation as well, whereas men tend to employ a global perspective that makes use of spatial relations within the environment (Coluccia et al., 2007a). Researchers have suggested that the wayfinding strategy that is more commonly associated with men is more efficient and robust. There is corroborating evidence that men outperform women in various navigational tasks (Chen et al., 2009; Coluccia and Losue, 2004;), though gender differences seem to reduce or disappear as the task becomes easier (Coluccia and Losue, 2004) or the field of view increases (Czerwinski et al. 2002). Lawton (1994) suggests that gender differences are not always observed in spatial tasks, but when they are that the results tend to favour males. Lawton (1994) also draws attention to the choice of methodological approach used in studies – namely, abstract lab tasks versus real-world tasks – which is argued to be a contributory factor in the diversity of findings reported. Findings from psychometric tasks (e.g., mental rotation) consistently favour males, but results are less clear when it comes to more ecologically-valid tasks, for instance map learning and navigation in a campus (Montello et al., 1999). In other domains of spatial research, men have been found to draw, read and interpret maps more accurately than women (Allen 2000a; Beatty and Troster, 1987; Coluccia et al., 2007b). Allen (2000a) attributes this advantage to better spatial working memory, though he points out that these differences apply to specific aspects of map reading and interpretation and should not be readily extended to general spatial abilities. The complexity in the findings suggests that there is a need for further targeted and systematic investigation of gender-related differences. The research study reported in this paper seeks to contribute to the existing corpus of research through an analysis of route instructions, with the focus on route-based dialogue systems where human users give route instructions to computerbased devices, such as robots. Route instructions are an interesting area in the study of spatial language as they are not only produced to describe the world, but also to elicit a particular navigational behaviour from an individual (or system) which will result in their reaching a destination efficiently (Daniel and Denis, 1998). 2. ROUTE INSTRUCTIONS: BACKGROUND Spatial language typically and naturally occurs in dialogue. As such, we hardly ever produce route instructions without an intended recipient. Moreover, communicating route knowledge is a collaborative, goal-oriented process, anchored in a specific spatial context. This makes it a prototypical dialogic situation. Spatial language is a lively area of research but, surprisingly, the overwhelming majority of studies explore spatial language in monologue – and often in highlycontrolled and artificial settings. There has been a range of empirical studies that have sought to bridge the gap between abstract and real-world route instruction tasks, yet this research tends to be from a particular perspective/set of assumptions or simplifications that are problematic. The most important of these is that the research often sees language production and comprehension in isolation, lacking interactivity between the parties. Among the most cited studies are those of Allen, whose contributions include a framework for the analysis of route instructions (Vanetti and Allen, 1988), investigation of individual differences (Allen, 2000a; 1997) and an account of properties of route instructions that facilitate wayfinding (Allen, 2000b). Similarly, the Human Cognition Group in Paris (Daniel and Denis, 1998; 2004; Denis, 1997) has provided analyses of spontaneously-produced route instructions, looking at the effect of their conciseness and effectiveness on wayfinding. In all of the studies mentioned so far in this section, the instructions that the participants followed had been independently produced by either another group of subjects beforehand or by the experimenter. Golding (1996) also describes a study in which subjects gave instructions, answering questions of the type: “how do you get to X?” and “where is X?”. While Golding (1996) acknowledges route explanations as a question-answering process that supports the addressee’s goals and identifies common ground between the parties involved, it is surprising that the addressees, who were informed confederates, were allowed minimal contribution to the interaction (being limited to providing only ‘yes’/‘no’ answers). The importance of the element of interaction that is missing from these studies is highlighted by Allen (2000b). The study’s findings suggest that instructors adhere to certain conventions related to the principles of “referential determinacy” and “mutual understanding” captured in the Collaborative Model of dialogue (Clark, 1996). That is, people are expected to produce route descriptions which minimise the uncertainty along the route, by concentrating on linguistic elements that: (i) provide specificity and additional information about the environment at points that offer potential orientation problems (such as crossroads); and (ii) that are easy for the listeners to make sense of. The participants in Allen’s (2000b) study followed scripts of route instructions that differed in terms of these elements and it was found that navigation errors increased when route instructions violated these principles. The lack of interaction in many studies is a simplification that is problematic, assuming that understanding how people produce and comprehend language in isolation can lead to an understanding of how people communicate. A normal dialogic situation is, however, more than an information transfer between speakers, and empirical research using dialogue paradigms has provided ample evidence of how the dialogic situation shapes language (e.g., Garrod and Anderson, 1987; Pickering and Garrod, 2004; Clark, 1996). In studies of this kind, language is seen as a collaborative activity in which partners introduce, negotiate and accept information. This is illustrated in studies that identify partner adaptation in a variety of tasks and contexts (see, for example, Brennan and Clark (1996); Schober (1993); Fischer (2007)). The interaction that occurs through dialogue means that interlocutors gradually align their linguistic expressions. This is evident in Garrod and Anderson’s (1987) maze task experiment which found that participants converged on similar spatial descriptions. Evidence of alignment has also been found within route-giving dialogues, where participants aligned in terms of reference and perspective (Filipi and Wales, 2009; Schober, 2009). Whereas monologue-based accounts treat language production and language comprehension as distinct autonomous processes, the Interactive Alignment Model of dialogue (Pickering and Garrod, 2004) assumes that they are closely coupled to each other in dialogue. According to the model, as the dialogue proceeds interlocutors come to align their language at many levels (phonological, lexical, syntactic, semantic, reference frames and situation models). In other words, an interlocutor matches the most recent utterance from his/her partner with respect to lexical choice, lexical meaning, syntax, etc. Alignment acts as a mechanism to promote mutual understanding and highlights the collaborative nature of dialogic communication. A specific value that can be gained from using dialogue methods to study route instructions is that they can elucidate the effects of comprehension on the subsequent production of route information (Pickering and Garrod, 2004). They can also help to identify and understand the relation between route descriptions and natural communication phenomena that are suppressed in monologue situations – such as feedback, clarification and confirmation requests, and repairs (Muller and Prévot, 2009). Such events are important as they can be used to classify the effectiveness and efficiency of communication and are likely to be important in future innovations in human-computer navigation systems. 3. AIMS OF THE STUDY The study reported in this paper adopts a dialogic approach to explore the conducting and completion of route tasks – an approach fruitfully applied in recent studies of car-based systems that stress the importance of navigation as a collaborative task (such as Forlizzi et al., 2010). This allows us to investigate the influence of gender – not only at an individual but also at a pair level – on communication efficiency and effectiveness by looking at the performance of spatial route tasks and the content and structure of the instructions given and the responses to them. Although existing literature points to males having an advantage in monologic direction-giving and -following tasks, it remains an open question whether this advantage persists in interactions with other people, whether of the same or opposite gender. Through applying a dialogic approach, this study aims to determine whether males are better at comprehending, executing and negotiating route instructions in real-time interaction with their partner. Moreover, the study explores the hypothesis that men are capable of producing more efficient route instructions than women. Essentially, it tests the hypothesis that pairs of male participants will outperform pairs that consist of at least one female, but where the pairings think that the instructor is a robot (therefore avoiding bias as a result of gender perceptions). The study’s focus, however, is not only to elucidate differences in performance but also in the content and structure of the instructions given, by looking at the use of certain linguistic components (namely, delimiters and landmark references) that are associated with effective instructions (Vanetti and Allen, 1988; Allen, 2000b; Michon and Denis, 2001). This will provide evidence in relation to the recurring finding (discussed in section 1) that women favour landmark-based references more than men do, in both roles of follower and instructor – which could, in turn, be used to improve the design of computer-based navigation systems. To address these study aims, an experiment was designed to elicit natural dialogues which contained spontaneously-generated route instructions within a controlled spatial network. The details of the method are set out in the next section. 4. METHOD The study employed a modified version of a Wizard-of-Oz experiment: in a Wizard-of-Oz experiment, two people interact, one of whom is under the impression that s/he is talking to a system. The instructors/users in this experiment were made to believe that they were interacting directly with a robot (the follower). However, in order to ensure the naturalness of the interaction, the ‘robots’/followers were also naive participants and they were given no guidelines on what to say and no dialogue script. The domain used in the experiment was navigation in a town and the user had to guide the robot to six designated locations. The cooperative nature of the task lay in two additional characteristics. First, in each pairing, only the user knew the destinations and had a global view of the environment, so the ‘robot’ had to rely on the user’s instructions and location descriptions. Secondly, the user needed the ‘robot’s’ descriptions to determine its current position and perspective. Participants were able to freely interact and develop their own strategies to carry out the experimental and discourse task. Placing a ‘robot’ (rather than making explicit that it was another person) at the other end of the communication channel serves three purposes. Firstly, the obvious merit of this approach is that the results can be used in the future design of robotic/computer systems and embodied conversational agents. Secondly, communication in normal conversational settings makes use of assumptions and shared knowledge as well as general linguistic conventions. These features are often transparent to those involved and are likely to be confounding in terms of the aims of this study. When talking to a ‘robot’, users are expected to avoid using this knowledge and to depend on assumptions and conventions set up within the course of the particular dialogue only, allowing clearer insights into their patterns of interaction. Thirdly, it masks the gender of participants, avoiding gender stereotype issues which might influence the communication, such as men being less likely to listen to instructions from female voices (see, for example, Jonsson et al., 2008). To allow gender differences in route-giving and -following tasks as they emerge from interaction to be explored, pairs were formed with all possible combinations of roles and gender: 1. Male user/instructor, Male ‘robot’/follower (henceforth referred to as MM) 2. Male user/instructor, Female ‘robot’/follower (MF) 3. Female user/instructor, Male ‘robot’/follower (FM) 4. Female user/instructor, Female ‘robot’/follower (FF) 4.1 Experimental set-up For the purposes of the experiment, a custom system was developed that supported the interactive simulation and enabled real-time direct text communication between the user- ‘robot’ pairs. The system connected two interfaces over a Local Area Network (LAN) using TCP/IP as the communication protocol. The system kept a log of the dialogues and also recorded the coordinates of the current position of the ‘robot’ at the moment that each message was transmitted, making it possible to analyse a textual description against a matching record of the robot’s position and reproduce its path with temporal and spatial accuracy. The interfaces consisted of a graphical display and an instant messaging facility (the dialogue box). The dialogue box displayed the participant’s own messages in the top part of the box, with the messages received by the other participant displayed in the lower part of the box. Figures 1 and 2 show the interfaces operated by the user/instructor and ‘robot’/follower, respectively. The interface of the user/instructor displayed the full map of the simulated town. The destination location was shown in red and the tasks that had been completed were shown in blue. Figure 1: The interface for the user/instructor. The ‘robot’s’/follower’s interface displayed a fraction of the map, the surroundings of the robot’s current position. The ‘robot’ was operated by the follower using the arrow keys on the keyboard. The dialogue box also displayed a history of the user’s previous messages. To simulate the ability of the ‘robot’ to learn routes, after each task was completed a button for this route appeared on the robot’s/follower’s screen. If the ‘robot’ was instructed to go to a previous destination, the robot/follower could press the corresponding button and the ‘robot’ would automatically execute the move. Figure 2: The interface for the ‘robot’/follower. 4.2 Experimental procedure A total of 56 participants (31 males and 25 females) were recruited from various departments of a UK university. The allocation of participants to the two roles (user/instructor versus ‘robot’/follower) was random and no computer expertise or other skill was required to take part in the experiment. The participants were allocated to pairs as shown in Table I. Table I: The pair configurations Pair Configuration Number of Pairs FF 5 FM 7 MF 8 MM 8 Users/instructors and ‘robot’/followers were seated in separate rooms equipped with desktop PCs, on which the respective interfaces were displayed. Participants received verbal and written instructions related to the task from their role perspective. For the ‘robots’/followers this included the fact that they were to pretend to be robots. The ‘robots’/followers were also given a brief demonstration and time to familiarise themselves with the operation of the interface. The users/instructors were told that they would interact directly with a robot, which for practical reasons was a computer-based, simulated version of the actual robot. They were informed that the robot had limited vision, but advanced capacity to understand and produce spatial language and learn previous routes, reducing the likelihood of users/instructors inferring during the interactions that the ‘robot’ was actually a person. They were asked to open each interaction with “hello” (which actually initialised the application used by the ‘robot’/follower) and end it with “goodbye” (which closed both of the applications used by the pair). Users/instructors were asked not to employ cardinal reference systems (such as “North”, “South”, “up”, “down”), since use of reference systems was not a focus of the study and it was thought that it may lead to confusion/ambiguity since no reference system was provided in the map. Instead ‘forward’, ‘backward’, ‘right’ and ‘left’ were to be used as directional statements. The users/instructors were further instructed to use the robot’s perspective. The users were given no other examples of, or instructions about, how to interact with the robot. The pairs attempted six tasks presented to each pair in the same order; the user/instructor navigated the ‘robot’/follower from the starting point (bottom right of the map) to six designated locations (pub, lab, factory, tube, Tesco, shop). The users/instructors were free to plan and modify the route as they wished. The destinations were selected to require either incrementally more instructions or the use of previously taught routes. Dialogues could run until the task was completed or the user/instructor chose to end them. At the end of the experiment, participants were debriefed and the full nature of the experimental set-up was disclosed and explained. Before this disclosure, the users/instructors were probed about their understanding of the experimental set-up. Each of them confirmed their confidence in the setup and expressed surprise when told subsequently that they had been interacting with a human acting as the ‘robot’. This gives confidence that any effects identified in the results are not a result of language adaptation by the users/instructors arising from them believing that they were instructing another person. 5. DATA ANALYSIS APPROACH The study yielded a corpus of 160 dialogues, which comprised 3,386 turns by the participants (1,853 user/instructor turns and 1,533 ’robot’/follower turns). The users/instructors produced 1,460 instruction units. Quantitative analysis of relevant data – such as the time taken to complete each task, and the number of words, turns and instructions in each dialogue – was undertaken alongside detailed qualitative discourse analysis of the dialogues – which identified the frequency of miscommunication and the type and granularity of the instructions. The approaches taken as part of the discourse analysis are outlined in the following three subsections. 5.1 User/instructor utterances: component-based analysis of instruction units The 950 instruction turns were segmented into 1,460 instruction units (also referred to as Turn Constructional Units by Tenbrink (2007) and Minimal Information Units by Denis (1997)). The primary, initial annotation of instruction units was based on the classification schemes of Denis (1997) and Tenbrink (2007). The main distinction made in Denis’s (1997) original scheme is whether the instructions contain references to landmarks. The categories in the scheme used in this study were: (i) action prescriptions without landmarks (e.g., go forward, turn right); (ii) action prescriptions with landmarks (e.g., turn left at the pub, cross the street); and (iii) introduction/description of landmarks with descriptive verbs such as “is”, “see”, or “find” (e.g., you’ll see a bridge on your left). Following Tenbrink (2007), we introduced a subdivision of landmarks, categorising them as: references to three-dimensional landmarks (such as buildings and bridges); two-dimensional landmarks (referred to as pathways, such as streets and junctions); or references to the destination location. An example of the analysis and tagging is shown in Table II, which is a dialogue turn comprising four instruction units. Table II: An example of a dialogue turn comprising four instruction units with tags used – DIR denotes action statements with verbs of movement, L denotes locations, P denotes pathways, DES denotes descriptive statements with descriptive/‘state of being’ verbs, D denotes the destination. Cross the bridge [DIR L] then turn right [DIR]. Turn right again at the next junction [DIR P]. The factory is to your left [DES D]. Finer-grained component analysis was then performed on the corpus of instruction units. The analysis used Allen’s Communication of Route Knowledge framework (Vanetti and Allen, 1988), which considers features and smaller constituents, such as frame of reference and modifiers. This framework complements the initially used scheme in two respects. Firstly, it further divides the Pathway category into ‘choice points’, which include junctions, intersections and crossroads, and ‘pathways’, which include channels of movement (streets, roads, etc.). Secondly, it introduces delimiters – features that define the instructions and provide differentiating information about an environmental feature (i.e., a landmark). The final scheme used in this study, bringing together Denis’ (1997), Tenbrink’s (2007) and Vanetti and Allen’s (1988) ideas, is outlined in Table III. The tags that were used are shown within the brackets. Table III: The final instruction units classification scheme used in this study (developed from the schemes presented by Denis (1997), Tenbrink (2007) and Vanetti and Allen (1988)). Action Type Tag Action only directive based on verb of movement DIR No action, reference to environmental feature DES Environmental Feature Tag Location L Pathways P Choice points C Destination D Delimiter Tag Distance designations: specify action boundary information, such as space separating points of reference (i.e., ‘until you see a car park’, ‘from the bridge to the church’) 1 Direction designations: specify spatial relations in terms of an intrinsic 2 body-based frame of reference (left, right) or cardinal directions (north, south, up, down, forward, backward) Relational terms: prepositions to specify the spatial relationship 3 between the ‘robot’/follower and the environmental feature, or between environmental features (on the left of, toward, away from, between, in front of, beside, behind, across from) Modifiers: adjectives to differentiate features (‘turn left at the big red 4 bridge’, ‘take the first/second/last road on the left’) Category 2 delimiters (such as left, right, down, forward) are the basic constituents of a route instruction since they specify the direction of movement. However, purely directional instructions are underspecified and provide minimal information to the follower (Tenbrink, 2007). Complementing the directional instructions with action boundary information (provided by category 1 delimiters), and/or terms that clarify the frame of reference (category 3 delimiters) and specify the target landmark (category 4 delimiters) increases the instruction’s level of granularity and reduces referential ambiguity (Allen, 2000b; Tenbrink, 2007). To estimate the specificity and level of granularity of user instructions – of interest given the study’s focus on efficiency and effectiveness of communication – the number of actions and delimiters embedded in each instruction were calculated. According to research (Denis, 1997; Michon and Denis, 2001 and Fischer, 2007), the inclusion of environmental features also decreases referential ambiguity, so such components were also considered. Examples of the annotation of the instruction units and the resulting calculation of components are given in Table IV. Table IV: Example of component-based annotation of user instructions (DIR: action directive based on verb of movement; C: choice point; L: location; numbers in the ‘tag’ column signify delimiter type from Table III). Instruction Unit Tags: Action, Delimiter and Number of Components Environmental Feature in the Instruction Unit Move forward DIR 2 2 Move forward until you get to DIR 2 1 4 C 3 the first junction on your right 6 Move forward until you reach DIR 2 1 L a bridge 4 The annotation is illustrated by considering the most complex instruction unit in the example captured in Table IV (‘Move forward until you get to the first junction on your right’): the instruction is a directive statement (DIR) based on the verb of movement, ‘move’; ‘forward’ is a category 2 delimiter designating direction; ‘until’ is a category 1 delimiter, providing boundary information for the action, ‘move forward’; ‘first’ is a category 4 delimiter specifying the target landmark, ‘junction’; the ‘junction’ is a choice point; and the choice point is further complemented by the category 3 delimiter, ‘on your right’, stating its position in relation to the frame of reference. This gives six components in the instruction unit. 5.2 ‘Robot’/follower utterances: analysis of responses As the study focuses on both sides of the interaction, ‘robot’/follower turns were also considered in the annotation. The responses by the ‘robot’/follower, immediately after a user instruction, were tagged based on whether they were statements (S) or questions (Q), and whether they contained references to locations (L), pathways (P), choice points (C), the destination (D) or simple directional designations (i.e., category 2 delimiters, such as left, right, forward, etc.). The Interactive Alignment model (see Section 2) proposes that the tendency of interlocutors to repeat each other’s lexical choices is an indication that they are aligned in terms of lexicon (Brennan and Clark, 1996). Higher alignment is associated with better understanding and dialogue success (Costa et al., 2008). Recognising matches is therefore important in making judgements on the effectiveness of the communication. To this end, the ‘robot’/follower tags were compared to the corresponding tags of the user instruction and the match rates were calculated. An example of tagged dialogue is shown in Table V: the first ‘robot’ response (2) is tagged as a ‘match’, repeating the user’s word “junction”; whereas the second ‘robot’ response (4) is not a ‘match’. Table V: Examples of instructions and responses with indications of whether or not they are matched. The columns denote (from left to right): the speaker (User or Robot), the Utterance Number, the Utterance, the annotation of the instruction, the annotation of the ‘robot’/follower’s response and the match between instruction and response. Speaker Utterance Number Utterance Instruction Tags User 1 turn right until you come to the junction DIR 2 1 C Robot 2 I am at the junction User 3 turn back, at the junction DIR 2, DIR 2 turn left, destination is on C, DES 3 D the left Robot 4 please give instructions further Response Tags Match SC Yes S No 5.3 Annotation of miscommunication Other judgements related to efficiency of the communication can be drawn from the identification and analysis of miscommunication. From a theoretical perspective, there are two types of miscommunication: misunderstandings and non-understandings (Hirst et al., 1994). Misunderstandings corresponded to execution errors, which refer to instances in which the ‘robot’/follower failed to understand the instruction and deviated from the described route. The coordinates (x, y) of the ‘robot’s’ position were recorded for each exchanged message and placed on the map of the town (which was defined as 1024 by 600 pixels), allowing the movements of the robot to be retraced when undertaking analysis of the dialogues. Execution errors were determined by matching the coordinates corresponding to each of the user’s/instructor’s utterances with those returned as a result of their execution by the ‘robot’/follower. An excerpt of a dialogue containing an execution error is shown in Table VI. Figure 3 illustrates the route which the user described and the robot followed during the interaction presented in Table VI. The ‘robot’/follower accurately executed the instructions in utterances 5, 6 and 7. However, the ‘robot’/follower misunderstood the next instruction (utterance number 8) and ended up in an unintended location. Table VI: An excerpt of a dialogue containing an execution error. The columns denote (from left to right): the speaker (User or Robot), the Utterance Number, the ‘robot’ coordinates and time that the utterance was sent and the utterance. Speaker Utterance Number Coordinates and Time Stamp Utterance User 1 1000,530 Hello @13:37:32 Robot 2 1000,530 Hello @13:37:36 User 3 1000,530 we are going to Tesco @13:37:42 Robot 4 1000,530 ok. directions please @13:38:5 User 5 1000,530 go straight ahead and turn right at the junction @13:38:20 User 6 909,464 @13:38:47 User 7 902,358 @13:39:12 User 8 675,259 then go straight and follow the road round the bend to the left you will pass a bridge on your right, continue going straight then cross the bridge and turn left @13:39:35 User 9 561,117 @13:40:8 Tesco will be on the right hand side and that is the destination Figure 3. The ‘robot’s’ execution of the instructions given in the dialogue presented in Table VI: the solid white line illustrates the accurately executed route; the grey long dashed line represents the route that the instructor described but the ‘robot’ failed to execute; the grey dotted line shows the deviation from the intended route; the numbers in brackets along the executed route indicate the utterances communicated at that point. The second type of miscommunication considered in the analysis are the utterances by the ‘robot’/follower that signalled non-understanding (typically formulated as clarification requests) (Gabsdil, 2003). The annotation of non-understandings follows the definition provided by Hirst et al. (1994) and Gabsdil (2003). Non-understandings occur when: (i) the ‘robot’/follower forms no interpretation of the user/instructor’s utterance; (ii) the ‘robot’/follower is uncertain about the interpretation s/he obtained; or (iii) the utterance is ambiguous to the ‘robot’/follower, leading to more than one interpretation of the instruction. Table VII contains examples of utterances corresponding to these different sources of non-understanding, but it should be made clear that the analysis did not consider each source separately. Non-understandings also included cases in which the ‘robot’/follower understood the meaning of the instruction but had a problem with its execution. An example of this final type of non-understanding is where the user/instructor is telling the ‘robot’/follower to move forward, but the instruction cannot be executed given the ‘robot’s’ current location on a t-junction, as in example (iv) in Table VII. Allen (2000b) practically demonstrated the validity of combining deviations from the described route with instances in which followers expressed non-understanding (i.e., they did not know where to go next) into a single measure – termed ‘information failure’. This approach was adopted in the study, with the two types of miscommunication (execution errors and nonunderstandings) being combined in one measure. Table VII: Examples of non-understandings produced by the ‘robot’/follower. Examples of understanding Non- Speaker (i) (ii) (iii) (iv) Utterance User Turn left. User There is a pub. The building next to you. Robot Please instruct which way exactly. User You must turn to your left and go to the end of the junction. Then you turn right. Robot Turn right when I can see the tree? User Go back to last location. Robot Back to the bridge or back to the factory? User Go forward. Robot There is a fork in the road. 6. RESULTS This section introduces and justifies the analysis approach used and reports the results of the quantitative and qualitative analysis of the dialogues between users/instructors and ‘robots’/followers. 6.1 Analysis Approach One-way ANOVA for independent groups was performed, the factor being the pair configurations (MM, MF, FM, FF). The efficiency of the interaction was determined using the following measures: the time taken and the number of words, turns and instructions per task. The effectiveness of the interaction was established by measuring the rates of miscommunication. Component analysis of the instruction units was undertaken to provide detail on the granularity and types of interaction, which are also important in determining effectiveness and efficiency. Finally, the match rates between the ‘robot’/follower responses and the user/instructor instructions were used as an indicator of alignment between partners. Two-way ANOVA was also undertaken, the factors being user and ‘robot’ gender. The results of the one- and two-way ANOVA were consistent for all three variables considered (time taken, the number of instructions per task, and rates of miscommunication). The high-level analysis and data for the two-way ANOVA are presented in Appendix A. The paper reports the results of the one-way ANOVA because this analysis emphasises (or ‘foregrounds’) the effect of the interaction of role/gender, which is expressed as the factor of group configuration (all possible combinations of role and gender). Particular caution was exercised with respect to the assumptions for the parametric tests. For all three variables (i.e., time, instructions per task and rates of miscommunication) the shape of the distributions were examined before performing the ANOVA. For the group of n=5, some of the histograms did not look particularly ‘normal’, but the assumptions were not grossly violated, since there were no signs of outliers in the boxplots and no ‘lumps’ or large gaps in the distributions. As such, the data are not inconsistent with being drawn from a normallydistributed population. Secondly, Levene’s test was used to ascertain equal variances between groups. Finally, the most ‘conservative’ (i.e., the lowest risk for type I error) post hoc test – the Scheffé test for pairwise multiple comparisons – was used to identify the levels of significance for specific differences between groups. To provide additional assurance of the suitability of adopting a parametric test, a nonparametric test was used as a comparison. The Kruskal-Wallis test – the non-parametric equivalent to one-way ANOVA, based on the ranks of scores – was performed. The MannWhitney test was used for post-hoc analysis. The results of the Kruskal-Wallis test were along the same lines as the parametric ANOVAs that were undertaken. The results in terms of pairwise differences were also supported by the Mann-Whitney test. The results of the nonparametric tests are given in Appendix B for completeness. 6.2 Time taken per task The results associated with the average time taken to complete each task suggest that pair configuration has a significant impact on the speed with which the pairs completed each task, F(3,24) = 4.038, p = 0.019. The post-hoc test indicates that statistically-reliable differences are found between the FM and FF pairs ( p = 0.05) and between the FM and MF pairs (p = 0.05). In particular, FM pairs were significantly quicker (306 seconds) – by almost two minutes – than FF and MF pairs (425 seconds and 409 seconds, respectively). Figure 4 shows the average completion time per task for the four pair configurations. The means and standard deviations are included in Table VIII. The additional two-way ANOVA analysis and non-parametric tests presented in Appendices A and B support the assertion that this is a pair effect. 450 Completion Time (sec) 400 350 300 250 200 150 100 50 0 FF FM MF MM Figure 4: Average completion time per task for the four pair configurations. 6.3 Number of words, turns and instructions In order to further explore communication efficiency, the number of words, turns and instructions required to complete each task were recorded. Comparisons of the number of words and turns (by users/instructors, ‘robots’/followers and totals per pair) showed no reliable differences. However, analysis of the mean number of instructions that users/instructors provided revealed an effect of pair configuration, F(3,23) = 3.771, p = 0.025. The largest difference, which provided the greatest contribution to the effect, was found between FM and FF pairs (p = 0.03), with the former using on average 40% fewer instructions to correctly reach the destination than the latter. The mean number of instructions per task and standard deviations for each pair configuration are shown in Table VIII. In brief, it seems that all interlocutors, irrespective of role and gender, were equally ‘talkative’ and claimed conversational ground at similar rates. However, female users/instructors in FF pairs seemed less efficient in the use of route instructions. Table VIII: The means and standard deviations for the three variables – time, number of instructions and miscommunication (number of execution errors and non-understandings) – per task, for the four pair configurations. Time per Task Pair Configuration Number of Instructions per Task Miscommunication per Task Mean Standard Deviation Mean Standard Deviation Mean Standard Deviation FF 424.97 67.21 12.340 4.080 2.180 1.443 FM 306.19 61.27 7.380 2.264 0.785 0.533 MF 409.15 65.99 8.787 2.454 1.234 0.671 MM 370.14 73.85 8.095 1.959 0.809 0.539 6.4 Frequency of miscommunication As previously noted, a combined measure of the two types of miscommunication ((i) execution errors and (ii) ‘robot’ turns that were tagged as expressing non-understanding miscommunication) was used in this study as a measure of effectiveness. The one-way ANOVA revealed a significant effect (F (3, 23) = 3.628, p = 0.028) with respect to this combined measure of miscommunication. Post hoc analyses using the Scheffé criterion indicated that the average number of errors and non-understandings per task was higher in the FF condition (M = 2.18, SD = 1.44) than in the FM condition (p = 0.035) (M = 0.78, SD = 0.53). Marginal specific differences (p = 0.08) were also found between FF and MM (M = 0.8, SD = 0.53) pairs. These results suggest that ‘robots’ in FF pairs were almost three times more likely to fail to understand and execute instructions than male ‘robots’ paired with users of either gender. The rates of miscommunication are summarised in Table VIII. 6.5 Instruction types and granularity The corpus of utterances contained 1,460 single instructions. Primary component analysis of the instructions revealed that the biggest single type of instruction was action prescriptions without landmarks (47% of the utterances). 53% of the instruction corpus contained a reference to a location or a path entity. In particular, users/instructors employed instructions that included location references in 19.4% of the instruction instances. Pathway references accounted for 18.7% of instruction instances. Finally, destination references with action constituted the first instruction (stating the destination) whereas destination references without action were used without exception as final instructions and formed 15% of all instructions. Figure 5 shows the distribution of the instruction types in the corpus. No Action + Pathway (DES P) Action + Destination (DIR D) No Action + Location (DES L) No action + Destination (DES D) Instruction Types Action Only (DIR) Action + Pathway (DIR P) Action + Location (DIR L) Figure 5: Overall distribution of instruction types. Comparing the distribution of instruction types across pair configurations yielded a reliable difference (χ2(3)=29.601, p<0.001), showing that users/instructors in MF pairs tend to use considerably fewer simple action prescriptions than users/instructors in the other pair configurations. Only 35% of their instructions were action-only descriptions as opposed to 50% for the other pairs. They also used more location references (27% versus 16%). The results of the analysis are schematically and numerically presented in Figure 6 and Table IX. Instruction Types Action Only (DIR) MM Action + Location (L) Action + Pathway (DIR P) MF Action + Destination (DIR D) FM No Action + Location (DES L) No Action + Path (DES P) FF No action + Destination (DES D) 0% 20% 40% 60% 80% 100% Figure 6: Schematic presentation of the use of each instruction category across pair configurations. Table IX: Percentages showing the use of each instruction category across pair configurations. Pair Action Only Configuration Action+ Action Action No action No action No action Location +Pathway +Destination +Location +Pathway +Destination FF 50.57% 15.52% 19.83% 6.03% 0.29% 0.00% 7.76% FM 51.29% 18.06% 15.16% 5.48% 0.32% 0.65% 9.03% MF 35.05% 26.80% 20.62% 9.79% 0.77% 0.00% 6.96% MM 51.45% 15.22% 17.39% 6.76% 0.24% 0.72% 8.21% The analysis revealed an association between pair configuration and level of granularity of the instructions provided (χ2 = 9.674, df= 3, p=0.02). Using the categories of ‘instructions with two components’ and ‘instructions with three or more components’, inspection of the frequencies showed that users/instructors in MF pairs were more likely to provide more detailed and explicit information (see Figure 7). Interestingly, approximately the same frequencies were observed across the other configurations, but ‘robots’/followers in FF pairs seemed to be the least capable of dealing with reference resolution problems, under-specification and missing boundary information, as indicated by the elevated miscommunication rates (reported in Table VIII). Instruction Component Number MM MF FM FF 0% 20% 40% 60% CompNo=2 CompNo>2 80% 100% Figure 7: Frequency of instructions with two, and three or more components for the four gender pairings. 6.6 ‘Robot’/follower responses Two points relating to gender differences can be identified from the ‘robot’/follower response data. The first concerns the ‘match’/‘no match’ rates – the extent to which the ‘robot’/follower responses either do or do not match the linguistic content of the previous instruction (see Table V for examples). The data suggest that an association exists between the pair configuration and ‘match’/‘no match’ rates (χ2 = 15.148, df= 3, p=0.002), with FF pairs most likely to use nonmatching responses (see Table X). The second point relates to the use of landmarks in responses. Female participants were found to use more reference-based descriptions than males. Though no inference was possible regarding the use of landmark references by ‘robot’/followers across the pairs configurations because of the large inter-subject variability that existed, it is interesting to note that female ‘robot’/followers in MF pairs used relatively more references than those in FF pairs. This may be reflective of the earlier reported finding that male instructors in MF pairs used the largest number of landmark references. Table X: ‘No match’ rates between user/instructor instruction and ‘robot’/follower response for different pair configurations. Pair Type Percentage of ‘no ‘robot’/follower responses FF 48.45 FM 39.32 MF 33.68 MM 39.42 match’ 7. DISCUSSION Although gender differences in user interface design and use are of great interest to researchers and developers alike, the interaction design process usually excludes gender considerations. As a result, even today ‘the user’ remains genderless (Bardzell, 2010). This study helps to address this gap and can be placed within the new subfield of HCI, termed ‘Gender HCI’ (Beckwith et al., 2006), which focuses on the differences in how males and females interact with ‘gender-neutral’ systems and, by taking gender issues into account, how systems can be designed to be equally effective for both men and women (Fern et al., 2010). Research in this area includes the pioneering work by Czerwinski and her colleagues (Czerwinski et al., 2002). Their approach was first to identify gender differences in Virtual Reality (VR) navigation and then to find solutions to offset these differences in display and VR world design (by provision of larger displays and wider views). Another example is the Gender HCI project of the EUSES consortium, which uncovered gender differences in end-user programming in terms of confidence and feature use and proposed solutions for the design of programming environments (Beckwith, 2007). Similarly, Fern et al. (2010) showed differences between male and female users, and the relation between their strategies and success in a debugging task. The position held in this body of research is that software design determines how well female problem solvers can make use of the software. Understanding how gender influences strategies, behaviours and success is the first step towards design that promotes successful behaviours and strategies by users of both genders. Along the same lines, the study reported in this paper contributes to ‘Gender HCI’ by detecting gender differences in the novel domains of Human-Robot Interaction (HRI) and spoken dialogue systems, which are prime examples of collaborative/goal-oriented interaction between humans and computer systems. As noted in section 2, there is a significant amount of research on spatial cognition and language, a considerable part of which has focused on the investigation of gender differences. The novelty of the current study, however, is that it has examined gender differences using the dialogue paradigm in a naturalistic but carefully controlled spatial setting. Most existing research identifies male superiority in a range of spatial activities and domains, leading to the prediction that all-male pairs would outperform all other groups and that allfemale pairs would be the least successful. Similarly, it might be expected that pairs with a male in either user/instructor or ‘robot’/follower role (i.e., MF or FM pairs) would show more efficient interactions than FF pairs. The study reported in this paper, however, reveals a more complex pattern of results. As anticipated, gender influences task performance and communication. However, the findings suggest that it is the interaction – the combination of gender and role – that has the most significant impact. In particular, in this study, pairs of female users/instructors and male ‘robots’/followers (i.e., FM pairs) are associated with the fastest and most accurate completion of tasks. Female users/instructors needed to give fewer instructions, but only when the person following them was a male. Male ‘robots’/followers in this pair configuration are associated with the lowest rates of execution errors and non-understandings. Whereas females in FM pairs were involved in the most efficient communication, when paired with female ‘robots’/followers they failed to produce similar results. In FF pairs, tasks took longer, female users/instructors gave more instructions and female ‘robots’/followers faced greater difficulty in understanding and executing instructions. MF pairs were also significantly slower than FM pairs in completing the tasks. These results do not, though, imply male superiority in direction interpretation and following since female ‘robots’/followers in MF pairs performed equally well in terms of mean number of instructions and were almost as ‘errorprone’ as male ‘robots’/followers paired with male users/instructors. While this analysis in terms of performance-related measures identifies a picture in which FM pairs were the most successful and FF pairs the least, the dialogue-based analysis refines this view and illustrates how MF pairs achieved successful communication through alignment of spatial descriptions. In terms of instruction type (action-only versus action + reference to environmental feature), MF pairs used considerably fewer action-only instructions and a greater number of incorporated landmark references (i.e., action + reference to environmental feature) compared to the other groups. Though there is ample evidence that females use landmarks as a strategy to find and describe a route, there must be a different explanation given that the instructor in this pair type was male. In this study, male users/instructors included significantly more landmark references only when interacting with a female as ‘robot’/follower. The explanation proposed here is that the male users/instructors adapted their own linguistic choices to match the needs of the female ‘robots’/followers, by incorporating more landmark references compared to when they were interacting with male ‘robots’/followers. Indeed, lexical alignment between partners in MF pairs was the highest among all pair types. This fits with studies that show that speakers adapt their utterances according to the perceived needs, characteristics and spatial capabilities of their partners (Sacks et al., 1974; Schober 1993; 2009). Purely spatial instructions, although simpler in form, are mostly underspecified and ambiguous (Tenbrink, 2007), whereas landmark references provide cues for (re-) orientation and are used to solve or prevent navigation problems (Michon and Denis, 2001). Users/instructors in MF pairs did not rely as frequently on purely spatial instructions, avoiding a source of potential miscommunication. Male users/instructors in MF pairs also employed the highest number of delimiters, thus decreasing ambiguity in their instructions and facilitating way-finding. On the other hand, users/instructors in all other pair configurations used a greater number of simple spatial instructions, and also provided instructions at a similar level of granularity. This may be because female users/instructors did not adapt as well as male users/instructors to the needs of their female partners. This inference is further supported by the low rates of alignment of ‘robot’/follower responses to instructions in FF pairs. If this interpretation of the findings is correct, it raises questions around how male users/instructors were able to perceive their partner’s needs within a very unusual communication situation of (albeit simulated) human-robot interaction. This presents opportunities for further experimental investigation. The fact that differences exist between how people provide instructions to humans compared to artificial agents in similar contexts is not counter-intuitive. However, the dimensions and extent of these differences merit additional in-depth research. Comparing the corpus collected in this study to similar corpora provides interesting insights into the subject. Studies that have used the same classification of instructions (action only, action + reference to environmental feature, etc.) across a variety of experiments and conditions report that simple action prescriptions do not exceed 19% of all instructions given (Denis 1997; Daniel and Denis, 2001, 1998). In Muller and Prévot (2009), the rate is even lower (5%). The common factor in all these studies, however, is that the ‘follower’ is a human. When the follower is a simulated robot, the proportion of action-only instructions rises – to, for example, 31% in the study by Tenbrink (2007) – suggesting that action-only instructions are less common when produced as part of navigation tasks for human participants. A likely reason for this is that people are generally naive about the linguistic and functional abilities of a robot, so they tend to employ a higher proportion of simple action-based descriptions that are not anchored on visually-recognised landmarks (see also studies by Moratz and Fischer (2000); Moratz et al. (2001)). This section concludes with a key recommendation that has to do with adaptability of the dialogue manager of the system. In this study, females were not found to use landmark-based spatial descriptions, although this has been described in other research as their default wayfinding and instruction-giving strategy (in non-interactive navigation tasks). Nor was it found that gender alone predicts whether and how compound descriptions (that is, descriptions with high granularity) will be employed. However, the findings do highlight the importance of the ‘input-output matching’ of spatial descriptions produced by user/instructor and ‘robot’/follower as a precondition for stable and successful communication. That is, although the agents initially start by using different spatial descriptions, as the dialogue progresses the most frequently used words become increasingly likely to be reused, inhibiting the other competing expressions. The process of input-output matching is rapid, often occurring in under 15 turns (i.e., soon after completion of the first task in the experiment). This phenomenon is of immediate practical concern to the design of human-computer dialogue systems and has implications for handling both user-generated and system-generated responses. Corpus-collection studies are essential for building the grammar of the dialogue manager of the system and, as the work presented in this paper exemplifies, they need to be naturalistic as well being tuned towards the future application. In deployment of the system, the dialogue manager is initially equipped with this grammar of expressions (for instance, a grammar containing the appropriate schemata of landmark-based, simple action-based and compound descriptions), all of them equally likely to be used by either the system or the user. As the dialogue unfolds, the findings from this study suggest that the dialogue manager should record and monitor the content and structure of user’s responses so that it is able to gradually narrow down the grammar to the preferred expressions. This could contribute to the accuracy of the spatial language understanding component and, possibly, bring us closer to what makes human-human communication and collaboration successful. Moreover, dialogue systems, like robots, are also destined for longterm interaction with the user. Hence, the adaptation occurring within a single interaction should be extended to adaptation between interactions to provide more stable and aligned dialogues. 8. CONCLUSION This study has identified pronounced gender-specific differences in the domain of dialoguebased navigation of a robot system. Previous research from diverse fields predicts male superiority in both roles of route interpretation and production, such that all-male pairs would outperform all other groups and all-female pairs would be the least successful. This, however, was only partially supported by our study in which ‘mixed’ pairs exceeded or matched the performance of all-male pairs. In particular, the results in sections 6.2 to 6.6 outline intricate patterns showing that female users/instructors with male ‘robots’/followers were the most successful. Male users/instructors with female ‘robots’/followers achieved their strong performance by taking advantage of a particular interactive mechanism, aligning their expressions to those of their partner. Male instructors/users adapted to the ‘needs’ of their female partners by adjusting the use of landmark references, highlighting the fact that the language one produces in monologue is different from language in dialogue. The results do not challenge previous studies, but complement them by suggesting that gender differences in accurate way-finding or direction-giving can be mitigated when females interact with males, either as instructors or followers. That is to say, if there exists a female ‘disadvantage’ it seems to disappear through mechanisms that emerge naturally in the interaction with males. This observation holds practical significance for the development of dialogue systems, as it points to the existence of dialogue features that equally benefit users of both genders. Because of the relegation of dialogue as a research paradigm, such interactive mechanisms have received comparatively little attention. The outcomes of this study raise questions that present rich opportunities for further experimental investigation. In particular, a next step is to pinpoint the dialogue features and strategies that relate to improved performance and communication and test them in isolation in a follow-up controlled dialogue study. As suggested above, a tentative hypothesis readily emerges from our current results: the coordinated use of landmark references that was observed in MF pairs could be the key to why they outperformed FF pairs and matched, in many respects, the performance of the other pair configurations. In this study, participants successfully coordinated in the presence of uncertainties arising from language and the environment. Thus, the element of interactive clarification becomes significant for successful communication and merits further investigation. This study also illustrates a valid methodology to assess the range of linguistic options that users are likely to employ in spatial Human-Robot Interaction and shows how the interplay of gender and role affects the content of the instructions. It identifies user patterns of adaptation; for instance, users appear to prefer to give short and incremental instructions, in contrast to strategies used in human/human spatial communication. The study showed a benefit deriving from partner alignment in choice of words – a strategy that was influenced by role and gender. Overall, we contend that these observations can serve to inform the requirements analysis and design of human-computer dialogue systems. From a wider perspective, these insights may also be useful for researchers and designers to better understand how spatial information should be displayed or communicated by systems and how the availability and presentation of such information may change the behaviour and experience of users of different gender. REFERENCES Allen, G. L. (1997). From knowledge to words to wayfinding: Issues in the production and comprehension of route directions. In Hirtle, S. & Frank, A. (eds.), Spatial Information Theory: A theoretical Basis for GIS. Berlin: Springer-Verlag, p.363-372. Allen, G. L. (2000a). Men and women, maps and minds: Cognitive bases of sex-related differences in reading and interpreting maps. In O'Nuallain, S. (ed.), Spatial Cognition: Foundations and Applications. Amsterdam: John Benjamins, p.3-18. Allen, G. L. (2000b). Principles and practices for communicating route knowledge. Applied Cognitive Psychology, 14(4), p.333–359. Bardzell, S. (2010). Feminist HCI: taking stock and outlining an agenda for design. In Proceedings of the 28th International Conference on Human Factors in Computing Systems (CHI '10). ACM, New York, USA, p. 1301-1310. Barkowsky T., Knauff M., Ligozat G. & Montello, D. R. (eds.) (2007). Spatial Cognition V: Reasoning, Action, Interaction, Lecture Notes in Computer Science, Berlin: Springer. Beatty, W. & Troster, A. (1987). Gender differences in geographical knowledge. Sex Roles, 16(11), p.565-89. Beckwith, L. (2007). Gender HCI Issues in End-User Programming, Ph.D. Thesis, Oregon State University. Beckwith, L., Burnett, M., Grigoreanu, V. & WiedenBeck, S. (2006). HCI: What about the software? Computer, p. 83–87. Brennan, S. E. & Clark, H. H. (1996). Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory and Cognition, 22 (6), p.1482–1493. Chen, C., Chang, W. & Change W., (2009). Gender differences in relation to wayfinding strategies, navigational support design, and wayfinding task difficulty. Journal of Environmental Psychology, 29, p.220–226. Clark, H. H. (1996). Using language. New York: Cambridge University Press. Coluccia, E., Bosco, A. & Brandimonte, M. A. (2007). The role of visuo-spatial working memory in map drawing. Psychological Research, 71, p.359–372. Coluccia, E. & Losue, G. (2004). Gender differences in spatial orientation: a review. Journal of Environmental Psychology, 24(3), p.329–340. Coluccia, E., Losue, G. & Brandimonte, M. A. (2007). The relationship between map drawing and spatial orientation abilities: a study of gender differences. Journal of Environmental Psychology, 27, p.135–244. Costa A., Pickering, M. J. & Sorace, A. (2008). Alignment in second language dialogue. Language and Cognitive Processes, 23(4), p.528–556. Czerwinski, M., Tan, D. S. & Robertson, G. G. (2002). Women take a wider view. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Changing Our World, Changing Ourselves (Minneapolis, Minnesota, USA, April 20 25, 2002). CHI '02. ACM, New York, NY, p.195-202. DOI= http://doi.acm.org/10.1145/503376.503412. Daniel, M. P. & Denis, M. (1998). Spatial descriptions as navigational aids: A cognitive analysis of route directions. Kognitionswissenschaft, 7, p.45-52. Daniel, M. P. & Denis, M. (2004). The production of route directions: Investigating conditions that favour conciseness in spatial discourse. Applied Cognitive Psychology, 18, p.57–75. Fern, X., Komireddy, C., Grigoreanu, V. & Burnett, M. (2010). Mining problem-solving strategies from HCI data. ACM Trans. Computer-Human Interaction 17, 1, p. 1-22. Filipi, A. & Wales, R. (2009). Situated analysis of what prompts shift in the motion verbs come and go in a map task. In Coventry, K.R., Tenbrink, T., & Bateman, J.A. (eds.), Spatial Language and Dialogue. Oxford: Oxford University Press, p.56-70. Fischer, K. (2007). The Role of Users' Concepts of the Robot in Human-Robot Spatial Instruction. In Barkowsky T., Knauff M., Ligozat G. & Montello D.R. (eds.), Spatial Cognition V: Reasoning, Action, Interaction, Lecture Notes in Computer Science. Berlin: Springer, p.76-89. Forlizzi, J., Barley, W. C. & Seder, T. (2010). Where should I turn?: Moving from individual to collaborative navigation strategies to inform the interaction design of future navigation systems. In Proceedings of the 28th international Conference on Human Factors in Computing Systems (Atlanta, Georgia, USA, April 10 - 15, 2010). CHI '10. ACM, New York, NY, p.1261-1270. DOI= http://doi.acm.org/10.1145/1753326.1753516. Gabsdil, M. (2003). Clarification in spoken dialogue systems. In Proceedings of the 2003 AAAI Spring Symposium Workshop on Natural Language Generation in Spoken and Written Dialogue. Stanford, USA. Garrod, S. & Anderson, A. (1987). Saying what you mean in dialogue: A study in conceptual and semantic co-ordination. Cognition, 27, p.181–218. Golding, J. M., Graesser, A. C. & Hauselt, J. (1996). The process of answering direction-giving questions when someone is lost on a university campus: The role of pragmatics. Applied Cognitive Psychology, 10, p.23-29. Hirst, G., McRoy, S., Heeman, P., Edmonds, P. & Horton, D. (1994). Repairing conversational misunderstandings and nonunderstandings. Speech Communication, 15, 3-4 (Dec. 1994), p.213-229. Jonsson, I., Harris, H. & Nass, C. (2008). How accurate must an in-car information system be?: Consequences of accurate and inaccurate information in cars. In Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems (Florence, Italy, April 05 - 10, 2008). CHI '08. ACM, New York, NY, p.1665-1674. DOI= http://doi.acm.org/10.1145/1357054.1357315. Lawton, C. A. (1994). Gender differences in wayfinding strategies: Relationship to spatial ability and spatial anxiety. Sex Roles, 30, p.765-779. Michel, D. (1997). The description of routes: A cognitive approach to the production of spatial discourse. Current Psychology of Cognition,16(4), p.409-458. Michon, P.E. & Denis, M. (2001). When and why are visual landmarks used in giving directions? In: Montello, D.R. (ed.), Spatial Information Theory. Berlin: Springer, p.400–414. Montello, D. R., Lovelace, K. L., Golledge, R. G. & Self, C. M. (1999). Sex-related differences and similarities in geographic and environmental spatial abilities. Annals of the Association of American Geographers, 89(3), p.515–534. Moratz, R. & Fischer, K. (2000). Cognitively adequate modelling of spatial reference in human-robot interaction. In 12th IEEE International Conference on Tools with Artificial Intelligence. Vancouver, British Columbia, Canada, 13-15 November. Moratz, R., Fischer, K & Tenbrink, T. (2001). Cognitive modeling of spatial reference for human-robot interaction. International Journal on Artificial Intelligence Tools, 10 (4), p.589-611. Muller, P. & Prévot, L. (2009). Grounding information in route explanation dialogues. In Coventry, K.R., Tenbrink, T. & Bateman, J.A., (eds.), Spatial Language and Dialogue. Oxford: Oxford University Press, p.166-176. Newcombe, N., Bandura, M. M. & Taylor, D. G. (1983). Sex differences in spatial ability and spatial activities. Sex Roles, 9, p.377386. Pickering, M. & Garrod, S. (2004). The interactive alignment model. Behavioural and Brain Sciences, 27 (2), p.169-189. Sacks, H., Schegloff, E.A. & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 50, p.696–735. Schober, M. F. (1993). Spatial perspective-taking in conversation. Cognition, 47, p.1–24. Schober, M. F. (2009). Spatial dialogue between partners with mismatched abilities. In Coventry, K.R., Tenbrink, T. & Bateman, J.A., (eds.), Spatial Language and Dialogue. Oxford: Oxford University Press, p.23-39. Tenbrink, T. & Hui, S. (2007). Negotiating spatial goals with a wheelchair. In Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, Antwerp, Belgium,1-2 September, p.103-110. Vanetti, E. J. & Allen, G. L. (1988). Communicating environmental knowledge: The impact of verbal and spatial abilities on the production and comprehension of route directions. Environment and Behavior, 20, p.667-682. Ward, S. L., Newcombe, N. & Overton, W. F. (1986). Turn left at the church, or three miles north: A study of direction giving and sex differences. Environment and Behavior, 18(2), p.192–213. APPENDIX A: TWO-WAY ANALYSIS OF VARIANCE This section presents the results of the two-way ANOVA, performed along with the one-way ANOVA presented in the main body of the paper, which examined the effect of gender and role of participant on the three dependent variables: (i) time taken per task; (ii) number of instructions per task; and (iii) miscommunication per task. The between-participants factors were: (i) user gender (female users vs male users); and (ii) ‘robot’ gender (female ‘robots’ vs male ‘robots’). As the data and analysis that follows shows, the two-way ANOVA revealed significant interaction effects for the instruction and miscommunication variables (two out of the three variables in the experiment). There was also a main effect of ‘robot’ gender for the third variable, time per task (male ‘robots’ were faster than female ‘robots’). However, when simple effects are significant, the second step is to examine the error bar charts. It became apparent from the plots that only groups with male ‘robots’ paired with female users were significantly different from the other groups. This meant that the results of the one-way ANOVA and the two-way ANOVA were replicated for all three variables. The findings and analysis in relation to each of the three dependent variables will now be presented. The raw data, is also provided at the end of the appendix (see Table XXI). A.1 Time per Task The two-way analysis of variance revealed a main effect of ‘robot’ gender (F(1,24) = 9.225, p = 0.006), which indicated that the mean time per task was significantly lower for male ‘robots’ (M = 340.3 seconds per task, SD = 73.66) than female ‘robots’ (M = 415.23 seconds per task, SD = 64.11). The main effect of user gender and the user gender X ‘robot’ gender interaction were not found to be significant. This suggests that only ‘robot’ gender was related to completion time. The summary analysis is given in Table XI and the detailed two-way ANOVA/betweensubjects effects is given in Table XII. Table XI: Mean time taken per task and Standard Deviations for all conditions. Number Mean time of per task Std. Deviation pairings User Robot F F 424.9733 67.21813 5 M 306.1905 61.27250 7 Total 355.6833 86.20870 12 F 409.1542 65.99631 8 M 370.1458 73.85026 8 M Total Total 389.6500 70.59377 16 F 415.2385 64.11688 13 M 340.3000 73.66592 15 Total 375.0929 78.03486 28 Table XII: Time per task – two-way ANOVA table showing tests of between-subjects effects. Source Type III Sum of Squares df Mean Square F Sig. 55150.243a 3 18383.414 4.038 .019 3848314.805 1 3848314.805 845.283 .000 user 3908.349 1 3908.349 .858 .363 robot 41996.727 1 41996.727 9.225 .006 user * robot 10734.415 1 10734.415 2.358 .138 Error 109264.636 24 4552.693 Total 4103865.120 28 164414.879 27 Corrected Model Intercept Corrected Total R Squared = .335 (Adjusted R Squared = .252) A.2 Number of Instructions per Task The two-way ANOVA yielded a main effect for the ‘robot’ gender (F(1, 24) = 4.376, p = 0.04), such that female ‘robots’ (M = 10.15, SD = 3.5) required a higher number of instructions per task than male ‘robots’ (M = 8.24, SD =2.79) (see tables XIII and XIV). However, a significant interaction effect was observed, (F(1, 24) = 5.195, p = 0.03), revealing large differences between FM pairs (M = 7.38 instructions per task, SD = 2.26) and FF pairs’ (M = 12.34, SD = 4.08). This result suggests that FM pairs required fewer instructions to complete a task than FF pairs (see Table XV) . The presence of the interaction effect relegates the importance of the main effect as there was no effect of ‘robot’ gender when the users were male. The interaction was further investigated with t-tests (see Table XVI) which confirmed that the number of instructions that female users produced depended on the gender of the addressee (t(10)= 2.714, p = 0.02). Table XIII: Mean number of instructions per task and Standard Deviations for all conditions. User Robot F F 12.340000 4.0802642 5 M 7.380952 2.2642845 7 Total 9.447222 3.9206127 12 F 8.787500 2.4542763 8 M 9.000000 3.1370798 8 Total 8.893750 2.7231577 16 F 10.153846 3.5070178 13 M 8.244444 2.7958775 15 Total 9.130952 3.2341787 28 M Total Mean Number Std. Deviation of Pairs Table XIV: Number of instructions per task – two-way ANOVA table showing tests of betweensubjects effects. Source Type III Sum of Squares df Mean Square F Sig. 74.008a 3 24.669 2.841 .059 2373.057 1 2373.057 273.277 .000 user 6.305 1 6.305 .726 .403 robot 38.002 1 38.002 4.376 .047 user * robot 45.112 1 45.112 5.195 .032 Error 208.409 24 8.684 Total 2616.898 28 Corrected Model Intercept Corrected Total 282.418 27 R Squared = .262 (Adjusted R Squared = .170) Table XV: Mean number of instructions per task and Standard Deviations for male and female ‘robot’ interactions in the female user/instructor condition. Robot N Mean Std. Deviation Std. Error Mean F 5 12.340000 4.0802642 1.8247496 M 7 7.380952 2.2642845 .8558191 Table XVI: T-test table showing the analysis of simple effects to determine the differences between female ‘robots’ and male ‘robots’ in the female user/instructor condition. Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Difference Sig. (2- F Instructions Equal 1.480 Sig. t .252 2.714 df 10 Mean Std. Error tailed) Difference Difference Lower Upper .022 4.9590476 1.8269987 .8882408 9.0298545 variances assumed Equal 2.460 5.767 .051 4.9590476 2.0154745 variances - 9.9393868 .0212916 not assumed A.3 Number of Miscommunication Instances per Task There was a main effect of ‘robot’ gender (F(1,24) = 3.933, p = 0.05), suggesting that female ‘robots’ (M = 1.6, SD = 1.08) are more prone to miscommunication than male ‘robots’ (M = 0.98, SD = 0.89) (see Tables XVII and XVIII). The user gender X ‘robot’ gender interaction was found to be marginally significant (F(1,24) = 2.954, p = 0.08) and showed differences between FM pairs (M = 0.78 errors/non-understandings per task, SD = 0.53) and FF pairs (M = 2.18, SD = 1.44) (see Table XIX). Analyses of the simple effects using t-tests were performed to explore the interaction effect (see Table XX). The t-tests confirmed that the main effect should be cautiously interpreted, as both male and female ‘robots’ were equally error-prone when paired with male users/instructors. On the other hand, female ‘robots’ when paired with female users/instructors seem to be three times more likely to fail to understand/execute an instruction than male ‘robots’ in FM pairs (t(10)= 2.376, p = 0.039). Table XVII: Mean number of miscommunications per task and Standard Deviations for all conditions. User Robot Mean F F 2.180000 1.4436836 5 M .785714 .5332837 7 Total 1.366667 1.1951924 12 F 1.237500 .6717514 8 M 1.166667 1.1270132 8 Total 1.202083 .8970296 16 F 1.600000 1.0889172 13 M .988889 .8919985 15 1.272619 1.0177864 28 M Total Total Std. Deviation N Table XVIII: Number of miscommunications per task – two-way ANOVA table showing tests of between-subjects effects. Source Type III Sum of Squares df Mean Square F Sig. Corrected Model 5.876a 3 1.959 2.128 .123 Intercept 48.638 1 48.638 52.836 .000 user .532 1 .532 .578 .455 robot 3.621 1 3.621 3.933 .059 user * robot 2.954 1 2.954 3.209 .086 Error 22.093 24 .921 Total 73.317 28 Corrected Total 27.969 27 a. R Squared = .210 (Adjusted R Squared = .111) Table XIX: Mean number of miscommunications per task and Standard Deviations for male and female ‘robot’ interactions in the female user/instructor condition Robot N Mean Std. Deviation Std. Error Mean F 5 2.180000 1.4436836 .6456349 M 7 .785714 .5332837 .2015623 Table XX: T-test table showing the analysis of simple effects to determine the differences between female ‘robot’ and male ‘robot’ in the female user/instructor condition. Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Difference Sig. (2- F Miscommunication Equal Sig. t 3.814 .079 2.376 df 10 Mean Std. Error tailed) Difference Difference Lower Upper .039 1.3942857 .5868046 .0868037 2.7017678 variances assumed Equal variances not assumed 2.061 4.787 .097 1.3942857 .6763666 - 3.1564243 .3678529 Table XXI: Raw Data Case User (1: Number Female, 2: Male) Robot (1: Female, 2: Male) Time per task (in secs) Number of Instructions per task Number of Miscommunication instances per task 1 1 1 357 9.1667 0.5 2 1 1 425 18.8333 4.3333 3 1 1 376.67 11.5 2.6667 4 1 1 529.8 13.4 1.4 5 1 1 436.4 8.8 2 6 2 1 334.4 6 1.8 7 2 1 350.4 8.4 0.2 8 2 1 519.83 11 2 9 2 1 460.2 1.2 10 2 1 434.83 12.5 2 11 2 1 368.4 1.2 12 2 1 448.33 6.3333 0.5 13 2 1 356.83 6.6667 1 14 1 2 285.17 9.8333 0.6667 15 1 2 273.5 8.5 1 16 1 2 398.5 6 1 17 1 2 216.5 3.8333 0.3333 18 1 2 306.5 7 1.6667 19 1 2 371.33 10.1667 0.8333 20 1 2 291.83 6.3333 0 21 2 2 346 10 1.8 22 2 2 408 7 1 8.4 11 23 2 2 277.17 11 0.5 24 2 2 396.33 15.3333 3.6667 25 2 2 278.5 0.8333 26 2 2 422.33 7.6667 1 27 2 2 340.83 8 0.3333 28 2 2 492 0.2 5 8 APPENDIX B: NON-PARAMETRIC ANALYSIS A Kruskal-Wallis one-way ANOVA – the non-parametric equivalent to one-way ANOVA which transforms the initial data to their ranks before performing the ANOVA – was performed on the four groups (FF, FM, MF, MM). The Mann-Whitney test was used to perform pairwise comparisons. The analyses were performed for the three dependent variables: (i) time taken per task; (ii) number of instructions per task; and (iii) miscommunication per task. The raw data is provided in table XXI. The results of the Kruskal-Wallis test were along the same lines as the parametric ANOVAs: there were significant differences for the variable time per task (p=0.03) and marginal significant differences for instructions per task (p=0.06) and miscommunication (p=0.08). The ‘elevated’ p values were anticipated since non-parametric tests are less powerful than parametric tests. The previous results in terms of pair-wise differences were also supported by the Mann-Whitney test. The findings and analysis in relation to each of the three dependent variables will now be presented. B.1 Time Taken per Task The Kruskal-Wallis one-way ANOVA identified significant differences for the time taken per tasks between the different pair configurations (x2 = 8.629, p = 0.035) with mean ranks of 20.2 for FF pairs, 7.71 for FM pairs, 17.88 for MF pairs and 13.50 for MM pairs (see Table XXII). As such, FF pairs had the longest completion times whereas FM pairs had the shortest. Table XXII: Output of the Kruskal-Wallis test. Pair Configuration N Mean Rank FF 5 20.20 FM 7 7.71 MF 8 17.88 MM 8 13.50 Time Chi-Square df Asymp. Sig. 8.629 3 .035 Pairwise comparisons using the Mann-Whitney test revealed significant differences between FM and FF pairs (p = 0.019, U= 3.000, z = -2.355) and FM and MF pairs (p = 0.021, U= 8.000, z = 2.315), suggesting that pairs consisting of female users/instructors and male ‘robots’/followers were associated with faster completion times (see Tables XXIII and XXIV). Table XXIII: Output of the Mann-Whitney test for FM and FF pair configurations. Pair Configuration N Mean Rank Sum of Ranks FF 5 9.40 47.00 FM 7 4.43 31.00 Total 12 Time Mann-Whitney U 3.000 Wilcoxon W 31.00 0 Z -2.355 Asymp. Sig. (2-tailed) .019 Exact Sig. [2*(1-tailed Sig.)] .018a Table XXIV: Output of the Mann-Whitney test for FM and MF pair configurations. Pair Configuration N Mean Rank Sum of Ranks FM 7 5.14 36.00 MF 8 10.50 84.00 Total 15 Time Mann-Whitney U 8.000 Wilcoxon W 36.00 0 Z -2.315 Asymp. Sig. (2-tailed) .021 Exact Sig. [2*(1-tailed Sig.)] .021a B.2 Number of Instructions per Task The Kruskal-Wallis one-way ANOVA showed a marginal significant difference between the pair configurations (x2 = 7.105, p = 0.06). The medians were 22.0 for FF pairs, 10.2 for FM pairs, 14 for MF pairs and 12.07 for MM pairs (see Table XXV). Table XXV: Output of Kruskal-Wallis test. Pair Configuration N Mean Rank FF 5 22.00 FM 7 10.21 MF 8 14.00 MM 7 12.07 Total 27 Instructions Chi-Square 7.105 df 3 Asymp. Sig. .069 Pairwise comparisons revealed differences between FF and the FM and MM configurations (p = 0.028, U= 4.000, z = -2.192 for both comparisons) and marginal differences between FF and MF (p = 0.056, U= 7.000, z = -1.908). These results suggest that the combination of gender and role influences the number of instructions necessary to complete the task (see Tables XXVI to XXVIII). An outlier was detected in the MM group (case 24) and removed prior to the analysis. Table XXVI: Output of the Mann-Whitney test for FF and FM pair configurations. Pair Configuratio n N Mean Rank Sum of Ranks FF 5 9.20 46.00 FM 7 4.57 32.00 Total 12 Instructions Mann-Whitney U 4.000 Wilcoxon W 32.000 Z -2.192 Asymp. Sig. (2-tailed) .028 Exact Sig. [2*(1-tailed Sig.)] .030a Table XXVII: Output of the Mann-Whitney test for FF and MF pair configurations. Pair Configuratio n N Mean Rank Sum of Ranks FF 5 9.60 48.00 MF 8 5.38 43.00 Total 13 Instructions Mann-Whitney U 7.000 Wilcoxon W 43.000 Z -1.908 Asymp. Sig. (2-tailed) .056 Exact Sig. [2*(1-tailed Sig.)] .065a Table XXVIII: Output of the Mann-Whitney test for FF and MM pair configurations. Pair Configuration N Mean Rank Sum of Ranks FF 5 9.20 46.00 MM 7 4.57 32.00 Total 12 Instructions Mann-Whitney U 4.000 Wilcoxon W 32.000 Z -2.196 Asymp. Sig. (2-tailed) .028 Exact Sig. [2*(1-tailed Sig.)] .030a B.3 Number of Miscommunication Instances per Task The Kruskal-Wallis one-way ANOVA yielded a x2 = 6.756 with an associated probability value of p = 0.08. The groups differed significantly on the miscommunication measure with FF pairs having the most misunderstanding problems (medians were 20.6 for FF pairs, 10.43 for FM pairs, 16 for MF pairs and 10.57 for MM pairs) (see Table XXIX). The results of the Kruskal-Wallis analysis were in line with the parametric ANOVAs reported in the main body of the paper. The ‘elevated’ p values were anticipated since non-parametric tests are less powerful than parametric tests. Table XXIX: Output of the Kruskal-Wallis test. Pair Configuration N Mean Rank FF 5 20.60 FM 7 10.43 MF 8 16.00 MM 7 10.57 Total 27 Miscommunication Chi-Square 6.756 df 3 Asymp. Sig. .080 The Mann-Whitney test revealed marginal significant differences between FF and FM pairs (p = 0.06, U= 6.000, z = -1.871), and between FF and MM pairs (p = 0.05, U= 5.500, z = -1.956), suggesting that role and gender had an impact on frequency of miscommunication (see Tables XXX and XXXI). An outlier was detected in the MM group (case 24) and removed prior to the analysis. The other results related to pairwise differences in miscommunication reported in the main body of the paper were also supported by the Mann-Whitney test. Table XXX: Output of the Mann-Whitney test for FF and FM pair configurations. Pair Configuration Mean Rank N Sum of Ranks FF 5 8.80 44.00 FM 7 4.86 34.00 Total 12 Miscommunication Mann-Whitney U 6.000 Wilcoxon W 34.000 Z -1.871 Asymp. Sig. (2-tailed) .061 Exact Sig. [2*(1-tailed Sig.)] .073a Table XXXI: Output of the Mann-Whitney test for FF and MM pair configurations. Pair Configuratio n N Mean Rank Sum of Ranks FF 5 8.90 44.50 MM 7 4.79 33.50 Total 12 Miscommunication Mann-Whitney U 5.500 Wilcoxon W 33.500 Z -1.956 Asymp. Sig. (2-tailed) .051 Exact Sig. [2*(1-tailed Sig.)] .048a PRIOR PUBLICATION STATEMENT Koulouri and Lauria’s most closely related prior papers (or concurrently submitted papers) have focused on how users interact with ‘robots’ in a navigation task (studying the management of miscommunication, use of spatial descriptions, linguistic resources, experimental design methodologies, etc.). Though in the same domain as this work, this submission to TOCHI has a very different focus that has not been looked at in any of their other papers: the submission’s unique contribution is the analysis of gender differences in spatial navigation dialogues to investigate HCI/HRI (Human/Robot Interactions) for route instructions. Chen and Macredie have published papers that analyse interaction/use patterns based on a number of individual differences (including gender) in a range of domain areas (most notably Hypermedia Learning Systems), but not in the analysis of route instructions/route navigation. There is therefore no direct overlap between this submission to TOCHI and any of their published or concurrently submitted papers.