Modeling the Effect of Trend Information on Human Failure Detection and Diagnosis in Spacecraft Systems by Rachel L. Owen B.S. Space Systems Engineering United States Air Force Academy, Colorado Springs CO, 2008 Submitted to the Department of Aeronautics and Astronautics in partial fulfillment of the requirements for the degree of Master of Science in Aeronautics and Astronautics at the MASSACHUSETTS INSTITUTE OT TECHNOLOGY ARCHNES MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 2 3 2010 LIBRARIES June 2010 @2010 Massachusetts Institute of Technology. All rights reserved. A uth or................................................................ .................. Department of Aeronautics and Astronautics May 21, 2010 .. R. John Hansman Professor of Aeronautics and Astronautics Thesis Supervisor Certified by ................................................................................... Certified by ..................................... Lauren J. Kessler Principal Technical Staff, Charles Stark Draper Laboratory // Accepted by ........................................................... .. .................. /Thesis TeS Supervisor v .......... Eytan H. Modiano Associate Zofessor of Aeronautics and Astronautics Chair, Committee on Graduate Students The views expressed in this thesis are those of the author and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the U.S. Government. Modeling the Effect of Trend Information on Human Failure Detection and Diagnosis in Spacecraft Systems by Rachel L. Owen Submitted to the Department of Aeronautics and Astronautics on May 21, 2010 in partial fulfillment of the requirements for the degree of Master of Science in Aeronautics and Astronautics. Abstract Systems are performing increasingly complicated tasks, made possible by significant advances in hardware and software technology. This task complexity is reflected in the system design, with a corresponding increased demand on comprehensive design efforts. Fundamental to the safety and mission success of these systems is the tradeoffs between human tasking and system tasking, and the resultant human interface. The research presented in this thesis was motivated by the development of an early-stage system design tool. This tool includes models of human decision making in order to evaluate system design tradeoffs with regard to human performance. An experiment was conducted to evaluate the effect of trend information displays on human decision making performance. Decision latency and accuracy were examined as performance metrics. To elicit information regarding the subjects' decision making process, the Lens model was used to gather metrics on achievement and decision consistency. The experimental results showed that both detection latency and diagnosis accuracy improved when trend information about dynamic system parameters is explicitly available to operators of spacecraft systems. The presence of this additional information also improved decision consistency. However, it made no significant difference for subjects' detection accuracy, diagnosis latency or achievement. Other predictors of latency and accuracy included the type of failure and the spacecraft trajectory. This was expected as both of these factors are important contributors to an operator's mental model of normal system behavior, which is critical to detecting and identifying failures. From these results, it can be concluded that operators of spacecraft systems could benefit from the inclusion of trend information, since it improves failure detection and diagnosis performance which can improve overall mission safety and success. Acknowledgments This thesis was prepared at the Charles Stark Draper Laboratory under an internal research and development program. Publication of this thesis does not constitute approval by Draper Laboratory of the findings or conclusions contained herein. It is published for the stimulation and exchange of ideas. Many special thanks to Lauren Kessler for being an excellent mentor and friend in my time at Draper and for offering me a wide range of interesting opportunities over the last two years. The balance of guidance and independence and her unwavering confidence in my ability during the course of this project has been incredibly rewarding for me. I am so grateful to have had the opportunity to work under her supervision. Thanks to Professor John Hansman of MIT for his consistent advice and immensely valuable outside perspective on this project. His critical mix of challenging technical and theoretical questions and suggestions for alternatives were instrumental in shaping this project and thesis. Thanks to Kevin Duda for welcoming me onto his project team at Draper and sharing his considerable expertise in human factors engineering. He has provided me with invaluable support through all stages of the project and thesis preparation, going above and beyond to help me prepare a quality product. Special thanks to many other members of the technical staff at Draper Laboratory, Brian Collins, Nicholas Borer, and John Irvine for their invaluable help with the complex mathematics, modeling, and statistical analysis required for this project. Their open doors and sacrifices of time and resources were invaluable to my learning. To Joel Cohen, friend and outstanding computer programmer, who made possible the coding of my experimental interface in an incredibly short amount of time. Without his generous, faithful support, this project never would have been completed on schedule. Finally, to my parents, brother, and friends for their constant love, encouragement, and unwavering faith in me. Rachel L. Owen Table of Contents A bstract ......................................................................................................................................... 3 A cknow ledgm ents ...................................................................................................................... 4 List of Figures ............................................................................................................................ 10 List of Tables..............................................................................................................................13 1 2 Introduction ......................................................................................................................... 1.1 Background and M otivation ........................................................................ 14 1.2 Research O bjectives...................................................................................... 16 1.3 Thesis O rganization ...................................................................................... 18 Background .......................................................................................................................... 2.1 H um an Inform ation Processing ................................................................. 19 19 2.1.1 M em ory.................................................................................................... 20 2.1.2 Situation A w areness .............................................................................. 22 2.1.3 Decision M aking Heuristics................................................................... 28 2.2 Hum an Decision M aking ............................................................................ 2.2.1 3 14 Decision M aking Perform ance .............................................................. Lens Model......................................................................................... 32 35 38 3.1 History and Formulation............................................................................. 3.1.1 4 5 38 Limitations of Correlation and Skill Score.......................................... 43 3.2 Previous Applications................................................................................. 46 3.3 L imitation s...................................................................................................... 47 3.4 Implementation and Data Post Processing.............................................. 48 Focus Area ......................................................................................... 52 4.1 Failure Types.................................................................................. 4.2 Failure D etection.............................................................................................. 55 4.3 Failure D iagnosis .......................................................................................... 58 Methodology ...... ...... .............................................................................................. 5.1 Motivation.............................................................................. 5.2 Research Questions and Hypotheses............................................ 61 .............. 61 5.2.1 Research Questions................................................................................... 5.2.2 Hypotheses 5.3 52 .................................................. Experiment Overview ................................................................................. 62 62 ... 62 63 5.3.1 Background vevi...................................................................................... 63 5.3.2 Task.... ................................................... 71 .......................................... Subjects ..................................................................................................... 78 Procedure......................................................................................................... 78 5.3.1 5.4 5.4.1 Apparatus and Graphical User Interface ............................................ 79 5.4.2 Data Collection ........................................................................................ 80 6 Data Analysis and Results ............................................................................................ 82 6.1 Purpose for Analysis................................................................................... 82 6.2 Analysis and Results................................................................................... 84 6.2.1 Latency Data ............................................................................................ 84 6.2.2 Accuracy Data........................................................................................... 88 Lens M odel Achievement and Consistency Data..................................... 93 6.3 6.3.1 Detection Achievement and Consistency ............................................ 94 6.3.2 Diagnosis Achievement and Consistency............................................ 95 6.4 Detection False Alarm Data ............................................................................ 96 6.5 Discussion...................................................................................................... 97 6.5.1 Detection.................................................................................................... 97 6.5.2 Diagnosis ................................................................................................... 101 6.5.3 Lens M odel Parameters ...................................................... 103 7 Conclusions and Future Work........................................................................................ 105 7.1 C on clusion s ..................................................................................................... 105 7.2 Futu re W ork .................................................................................................... 107 Appendix A: Lens Model Implementation Validation ................................................... 110 Naval Aircraft Identification Example.............................................................. 110 Appendix B: Graphical User Interface Details.................................................................. 113 B.1 User Interface Development ........................................................................ 113 B.2 Information Displayed.................................................................................. 115 B .3 V ariable D efinitions ...................................................................................... 116 Appendix C: Experimental Consent Forms and Training Materials ............................ 119 Appendix D: Demographic Survey and Results............................................................... 144 Appendix E: Detailed Analysis Results Tables and Plots .............................................. 146 E.1 D etection Latency .......................................................................................... 146 E.2 Detection Accuracy ....................................................................................... 149 E.3 D iagnosis Latency .......................................................................................... 150 E.4 Diagnosis Accuracy ....................................................................................... 152 E.5 Detection Achievement and Judgment Consistency and False Alarms 153 E.6 Diagnosis Achievement and Judgment Consistency ............................... 155 Appendix F: Linear Multiple Regression Overview........................................................ 158 Referen ces................................................................................................................................. 166 List of Figures Figure 1: Wickens' model of the human information processing cycle[14]..... 19 Figure 2: Endsley's model of human situation awareness.................................... 23 Figure 3: Representations of memory heuristics and where they act on the decision m aking process ....................................................................................................... 30 Figure 4: A visual representation of Brunswik's Lens Model ............................. 40 Figure 5: Diagram of Skill Score and its components ........................................... 45 Figure 6: Diagram of three different lunar trajectories selected for experiment and highlighted which portion of the trajectory is incorporated in the experimental task.... 65 Figure 7: Square configuration of RCS thruster pods around lunar lander (left), full lunar lander with axis definition and close up of RCS Thruster pod (right)...... 67 Figure 8: Conceptual schematic of the RCS including fuel tanks, thrusters and fu el p u m p s .................................................................................................................................. Figure 9: Display presented to subject at the start of each scenario ................... 68 71 Figure 10: Failure button to be selected when subjects detected a failure.......... 72 Figure 11: Failure identification panel where a subject selected a button to diagnose a failure after detection.......................................................................................... 73 Figure 12: Trend information display shown to each subject during half the scen a rio s ...................................................................................................................................... 74 Figure 13: Primary flight display shown to subjects for every scenario............ 75 Figure 14: Trend display at scenario completion .................................................. 76 Figure 15: The area of a deviation is the integral between the normal trajectory signature (top curve), and the failure case (bottom curve) from the scenario start to the time of decision indicated by the white line. .................................................................... 83 Figure 16: Data suggesting that trend information has an effect on detection la ten cy .......................................................................................................................................... 86 Figure 17: Data suggesting that trend information has no significant effect on d iagnosis laten cy ........................................................................................................................ 87 Figure 18: Detection accuracy results showing no significant effect for trend in fo rm atio n ................................................................................................................................. 89 Figure 19: Results indicating a significant effect on decision accuracy for order. 90 Figure 20: Results showing the effect of failure type on detection accuracy ........ 91 Figure 21: Diagnosis accuracy data showing a significant effect for trend in fo rm atio n ................................................................................................................................. 92 Figure 22: Results showing the effect of failure type on diagnosis accuracy ........ 93 Figure 23: Detection achievement data showing no effect for trend information 94 Figure 24: Diagnosis achievement data suggesting a possible effect for trend in fo rm atio n ................................................................................................................................. 95 Figure 25: Results suggesting a trend towards increasing numbers of false alarms w ith tren d inform ation .............................................................................................................. 97 Figure 26: Results showing the combined effects of trend information, approach, and failure type on detection latency ................................................................................... 98 Figure 27: Results suggest an increase in the number of false alarms with trend in form ation p resent ................................................................................................................. 100 Figure 28: Results of the combined effects of trend information, approach, and failure type on diagnosis latency ........................................................................................... 102 Figure 29: Primary Flight Display shown to all subjects for each trial ................ 114 Figure 30: Trend Information display with same system parameters represented, show n to subjects for half of the trials .................................................................................. 114 Figure 31: Example of multiple regression plot with regression model trend line ..................................................................................................................................................... Figure 32: Example normal histogram with distribution curve............ 15 8 163 Figure 33: Plot of residuals vs. the independent variable for a data set that shows a likely linear relationship ...................................................................................................... 164 Figure 34: Example spread and level plot for determining equal variance ........ 165 List of Tables Table 1: Experimental test matrix for all three trajectory types .......................... 77 Table 2: Factor Effects for Detection Accuracy Data (Significance: p<0.05) ..... 90 Table 3: Factor Effects for Diagnosis Accuracy Data (Significance: p<0.05)..... 92 Table 4: Table showing the effects of trend information on Lens model output parameters for detection (Significance: p<0.05)................................................................ 94 Table 5: Table showing the effects of trend information on Lens model output parameters for diagnosis (Significance: p<0.05) ................................................................ 96 Table 6: Beta-weights calculated for synthetic validation case and compared to B isan tz, et al. (2000).................................................................................................................. 112 1 Introduction 1.1 Background and Motivation Advances in hardware and software have created an opportunity to build systems that are capable of increasingly complicated tasks. This opportunity is accompanied by a corresponding increase in system complexity, making the design effort particularly demanding. As part of this effort, engineers are tasked to balance the demands of four drivers: reliability, safety, cost, and performance [1]. Because these drivers have aspects that are often mutually exclusive, engineers are forced to trade one characteristic for another. Evaluating these tradeoffs becomes more difficult, as the system complexity increases. The complexity increases even further when human interaction coupled with high levels of automation are added to the design equation [2]. Even though humans have unique ways of processing information, and are influenced by factors such as stress, fatigue, and hunger, there are still tasks for which humans are better suited than automation. This, in turn, makes the design of the human-system interaction critical to both mission safety and success [3]. Exploring these design challenges was the driving motivation for this thesis, focused on the area of humansystem interaction. While systems can now be engineered to be highly reliable for many known conditions, the noisiness of the real world still introduces complications. Real operating environments introduce unplanned variations in operating conditions, not to mention 14 the inherent difficulty of predicting future conditions. Real systems must cope with unexpected behavior of system components or human operators, and system malfunctions. All of these inconsistencies mean that there will always be a set of conditions under which the system will fail or automation will reach an incorrect decision [3]. In order for the mission to continue, these failure conditions must be compensated for either by the human or the system[11]. The possibility of failure makes human failure detection and diagnosis critical functions of human operators in both manual control situations and monitoring situations [4, 12]. Poor failure detection and resolution can result in not only loss of productivity but also in loss of expensive equipment and ultimately human lives[10]. Ensuring good failure detection and diagnosis requires understanding and evaluating the operator's decision making process. The area of failure detection and diagnosis was further refined in this research to investigate how the addition of trend information displays for dynamic system parameters affect these decisions. Trend information on system parameters eliminates extra cognitive workload for operators who would normally be required to store this information in memory. This should allow operators to make faster, more accurate and more consistent decisions, three key decision making performance metrics. A model of human decision making was selected from current literature to aid in examining operator decision performance in a human subject experiment [4-9]. Then, an experiment was conducted to evaluate the effect of trend information specifically in failure detection and diagnosis tasks. Improving performance in these tasks is an important issue in the design of complex systems because good failure detection and diagnosis are prerequisites for increased reliability, safety, system availability, and also for the enhancement of system performance, and reduced cost [7, 10]. 1.2 Research Objectives The goal of this thesis was to select and evaluate a model of human decision making from past literature and to apply this model in a dynamic environment to investigate the effect of trend information on failure detection and diagnosis tasks. In order to select an appropriate decision making model, several basic requirements were established. These requirements were set in the hope that this model could be useful in future human-system modeling work. These requirements included: * The model must be applicable and expandable to a variety of different decision types and task domains. " It must be capable of switching between different human mental models and system task models in order to allow modeling of full mission phases with different actions and tasks * The model must highlight human decision making performance in both nominal and failure conditions The Lens model was selected and implemented as the model that best fulfilled these requirements. The Lens model was originally developed for the study of perceptual psychology and is unique in its ability to compare the decisions of a human judge to the true condition in the corresponding environment through a variety of correlations coefficients which are the outputs of the model[5]. Once selected and implemented, the actual implementation needed to be validated to ensure that it had been implemented accurately. Secondly, the model needed to be calibrated using actual human decision data to prove that it was a legitimate choice that satisfied the basic requirements. The validation of the model in the context of a dynamic complex system was done via human subject experimentation. The experiment was designed to investigate the possible human performance benefits of explicitly displayed trend information of system parameters for failure detection and diagnosis. Performance metrics included the speed and accuracy for each of these decisions. The Lens model was also used to post process data and two model parameters were also selected as metrics of human decision making performance. The speed and accuracy results have implications for designers as they attempt to build safe and functional complex systems. The model parameter results are useful, both to calibrate the model for future use, and to illustrate the model's use for two different decision types in a dynamic complex system domain. Since most realistic decision making environments are dynamic, it was important that the model be able to represent dynamic system behavior. However, the Lens model is primarily a static model that has not been used to incorporate dynamic information in its analysis of human decision making. Therefore, this research also extends the use of the Lens model in the area of dynamic decision making. While the Lens model has been previously applied in dynamic environments in literature[13], 17 dynamic information has not been directly inserted into the Lens model as a part of the human judgment strategy. This thesis seeks to represent dynamic trend information in the Lens model to get a more accurate representation of how this information and its dynamic characteristics affect the judgment performance of complex system operators. 1.3 * * * " " " * Thesis Organization Chapter 1, Introduction, outlines the motivation, research objectives, and provides an overview for this research. Chapter 2, Background, provides information on previous work in human performance modeling, situation awareness, judgment theory, and failure detection and diagnosis. Chapter 3, Lens Model, describes the selection and implementation of the Lens model, as well as its history and formulation. Chapter 4, Focus Area, describes the two specific decisions, failure detection and diagnosis, that this thesis is focused on. Chapter 5, Methodology, describes the procedures and design of a human performance experiment used to test the hypotheses of this research regarding the effect of trend information on failure detection and diagnosis. Chapter 6, Data Analysis, Results and Discussion, details the analysis conducted on the data from the experiment, presents the results and a discussion about interesting findings. Chapter 7, Conclusions and Future Work, reviews important conclusions drawn from the results section and discusses directions for future work in this area. 2 Background A key issue in complex system design involves understanding the way humans perceive and process system information and how they make decisions. While this thesis is focused on the decision making activities, components of the other cognitive processes help to provide the foundation for decision making. 2.1 Human Information Processing In order to examine human decision making, it is important to understand the human information processing cycle that it fits into. This cycle defines how humans interact with their environment as they perceive, process, and respond to environmental cues and stimuli. A basic framework of human system interaction is provided in Figure 1 below. -- -- -- - -i -- -__- Attention Resources -- -- -- -- -- -- -- -- - _~~~~~~- Long-term Memory Working Figure 1: Wickens' model of the human information processing cycle[14] This model of human information processing includes the following stages: sensory processing, perception, attention, memory, and response. Although this model is an abstraction of the complex nature of human sensation and cognition, it provides a useful platform from which to approach human performance. This thesis examines the portion of the information processing cycle that begins with the outputs of perception and addresses the cognitive tasks of situation awareness and decision making. Perception is distinct from cognition in that cognition requires significantly more time, effort and attention than perception alone, though the lines between these two processes can be blurred[15]. Cognitive processes are conscious activities which transform or retain information, are resource limited, and are highly vulnerable to disruption when attentional resources are diverted to another task. Situation awareness, which is achieved through perception and augmented by these cognitive transformations, is often the basis for appropriate response selection[15]. 2.1.1 Memory Memory is a foundational piece of the human information processing cycle. It is at work in virtually every stage of human cognition. Memory is both the place where information is stored long term and where information is integrated and transformed. There are generally thought to be two different types of memory: working memory and long-term memory. Working memory is the temporary, attention demanding store that humans use to retain information for a short period of time until they are ready to use it or commit it to long-term storage. Working memory is also the "cognitive workbench" where we examine, evaluate, and compare different mental representations. It is where judgment and decision-making take place. Working memory has limitations in both capacity and duration. Research has shown that when rehearsal is prevented, humans retain information for approximately 20 seconds. Working memory also has capacity limits which interact with time. The capacity of working memory is generally limited to 7 +/- 2 pieces of unique information. Faster decay occurs the more items are held in working memory because rehearsal is not instantaneous. Therefore working memory tends to perform best if immediate rehearsal of information is possible and if this information is approximately 5-9 pieces in length. This type of memory is significantly affected by stress, and is considered a fragile resource in the context of decision making, because it can so easily be derailed by distractions[16]. Information can be encoded from working memory into the more permanent type of memory: long-term memory. Long-term memory stores facts about the world and how to do things that operators use repetitively. This is a more stable form of memory which is ideally where operators draw on their experience and pattern recognition skills to identify problems with a system or environmental conditions. However it is largely developed through individual experience and therefore not always reliable, particularly for novice operators or monitors who do not have the repetition necessary to have committed such information to long-term memory. As a result, the success or failure of human memory can have a major impact on the success and safety of a system. 2.1.2 Situation Awareness Situation awareness (SA) is another metric for evaluating human performance and a critical component of human decision making[17]. In the information processing model, it generally falls somewhere on the border between perception and cognition. SA is described as an emergent property of the model, as opposed to a structural part of the model itself. It is the product and result of many different components and in practice is often treated as a "black box", where the internal workings are essentially unknown[18]. Endsley defines situation awareness as "the perception of elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future"[19]. Situation awareness is tasked with three things: recognizing and interpreting environmental cues, assessing present and future risk, and assessing the time available to make decisions [16, 20]. It is closely tied to understanding the interrelationships inside the system and the system interaction with the environment or larger domain. relationships can result in adverse consequences. Misunderstanding these Therefore, situation awareness is critical to an operator's ability to adapt and function effectively in its environment. Endsley presents the following model of situation awareness and its different levels below in Figure 2 [19]. System Capability - Interface Design - Stress & Workload - Complexity - Automation nform a n i sm Z~esAuto mabticty Figure 2: Endsley's model of human situation awareness Situation awareness is a complex integration of a variety of information which begins with basic perception and ends with projection of environmental states and decision outcomes into the future. In order to plan or problem solve effectively in dynamic, changing environments operators must have a relatively accurate awareness of the evolving situation which is the role of situation awareness. It is a prerequisite for accurate decision making and response selection[15]. It is not purely a raw sense impression, but includes projections of future states and an integrated knowledge of the complete current picture[18]. Understanding which possible states of the world might be in effect, forms the foundation of effective choice[15]. This awareness of the current system condition resides mostly in working memory, making the link between SA and working memory direct. Therefore, awareness degrades as both memory and attentional resources are allocated to different competing tasks. While accurate situation awareness depends on selective attention and memory, it is not the same. The process of maintaining situation awareness, supported by attention, working memory and long term memory is not the same as the awareness itself. It is usually domain specific and is affected by the expertise of the operator. This means that good situation awareness must be developed for every operator and task individually, and is critical to operator's response in new situations. Therefore, supporting broader situation awareness has been shown to be a critical component in decision making. Situation awareness is measured using a variety of subjective, performancebased, and physiological techniques. Memory probe measurement is a common method which seeks to evaluate the contents of the users working memory for a given task. This is generally done in the form of a questionnaire or evaluation which is designed to determine the overall picture that the operator had of a task situation at the current time. This is a widely used method and generally accepted as a viable measurement of SA[18]. The most common memory probe is known as the Situation Awareness Global Assessment Technique (SAGAT), which freezes a screen and queries subjects on their knowledge of the situation at given times during an experimental task. This is a knowledge based measurement of situation awareness, which is more accurate at providing detailed theoretical assessment of an operator's awareness. However performance based measurements look directly at user response to realistic situations and are sometimes preferred since knowledge based measurement can only make inferences about what a subject would do given their current knowledge base. performance based measures of situation awareness include: workload, Some user confidence, latency, and accuracy [21]. 2.1.2.1 Environmental Cues and Stimuli The information that operators require to make good decisions and maintain SA is acquired through cues in the environment. Typically cues are estimated initially through perception, then information is selected and integrated by attention and cognition using long term memory and working memory to provide background update and revise different hypotheses about the exact nature of the situation. Initially, the decision maker is confronted with a series of cues or sources of information that have some impact on the true state of the world. Some or all of these cues are attended to with the goal of using them to influence the operators current belief in one of several alternative hypotheses about what is going on within the system. Each cue can be characterized by three properties: - Diagnosticity-Represents how much evidence a cue can offer the decision maker regarding one or the other hypothesis. Cues may be high or low in diagnosticity and also have polarity to favor one hypothesis or another. A perfectly diagnostic cue means that the presence of this cue is a perfect predictor or indicator of a - particular hypothesis, and obviates the need for a decision. Reliability-Refers to the likelihood that the physical cue can be believed or that the information is accurate. The product of reliability and diagnosticity reflects the information value of a cue. The higher the information value of a cue, the more useful that cue is in terms of evaluating decision alternatives. - Physical features - Refer to the actual physical representation of the cue such as its salience. These characteristics have bearing on the attention and subsequent processing that any given cue receives. Selective attention must be utilized in order to process a variety of different cues and give them a subjective weight associated with their predicted information value. First, all of the raw perceived information from any available cues must be integrated together to form some sort of understanding of the current situation. Then, the operator incorporates any expectancies or prior beliefs from long-term memory and mental models, which bias one hypothesis over any others. Finally, they iteratively test and retest that hypothesis in order to attain the final belief which forms the basis for choice[15]. This can be interrupted by three vulnerabilities of human attention and cue integration that can prevent cues from effectively informing decision making: - Informational cues are missing- The operator doesn't have all the necessary information at hand. Good decision makers are often aware of what they don't - know and seek out these missing cues. Information overload-Available cues are numerous. Beyond three distinct cues, people do not use the information to make more accurate decisions. This is - because human operators deploy a selective filtering strategy on available information. This filtering can be time consuming and affect decision quality. Cue salience- How noticeable a piece of information is to the operator. This affects the attention and processing devoted to any particular cue. If one cue is highly salient it may end up being paid too much attention and over processed. For example, alerting cues (high salience) are not necessarily compatible with effective fault diagnosis since salience should be related to the information value of a piece of information for full failure resolution, not just in detecting that a failure has occurred. Once cues are processed and integrated, the operator must assign a weight or importance to each one to use it in their judgment strategy. Processed cues are often not differentially weighted effectively based on their information value. More valuable cues would theoretically be weighted higher; however this requires estimating the information value for all cues. It requires less cognitive effort to simply weight all cues equally. Operators subjective weighting tends to vary in an "all or none" manner, instead of as a linear function of the cue's correlation to the actual environment. When people are asked to estimate differences in the reliability of a set of cues, they can often do so accurately. However, when these estimates are to be used as part of a larger aggregate, the values become distorted. This is more commonly known as the "as-if" heuristic. Values are typically interpreted "as-if" they were all equal in their predictive value. The relative insensitivity of humans to differences in the predictive validity or reliability of a cue should infer that we are poorly equipped for performing tasks where diagnosis or prediction involves multiple cues with different information values. It has been suggested that humans should only be involved in identifying the relevant pieces of information for a task, and machines should be tasked with the diagnosis decision [15]. 2.1.3 Decision Making Heuristics A current trend in the study of human judgment under uncertainty suggests that it is largely based on heuristic processes. A heuristic is a mental shortcut involving a variety of psychological processes that help humans asses information without using probabilities in their reasoning[16]. Heuristics are useful because we often don't know the true probability of each possible outcome, and so we use tactics like similarity, past experience, and examples to guide our decisions and save time in dynamic situations. This leads to the view that actual human information processing is not as accurate as normative models of classical rational decision theory which involve actual event probabilities. Heuristics explain our deviations from these optimal decision frameworks. They trade accuracy for speed and attempt to save time and cognitive resources by using shortcuts or samples of available information. However, some research has shown that well developed heuristics can be both fast and accurate. Under normative theory, the decision maker must select the best course of action based on the highest expected utility. Expected utility can be calculated by taking the "cost" of an outcome and multiplying it by the probability that particular outcome will occur. Since humans naturally avoid using probabilities in their decision making however, this expected utility is often difficult to determine, and that normative decision making theory may not be the most accurate way to model real world human decision making [16]. Human problem solvers often select what they perceive to be the current best plan with no guarantee or even probabilistic assurance that it is the absolute best plan. From an information processing standpoint, decisions represent a many to one mapping of information to responses. The simplest decisions are go-no go scenarios, with more complexity entering the equation as the decision options become multiple choice. More complex decisions take more time and cognitive effort to go through the decision making process. A basic set of information processing activities and the biases and heuristics that commonly affect them is presented below. The information processing model lays out the steps of the decision making process as[22]: 1. Cue sampling 2. Situation Assessment (Diagnosis) a. Hypothesis Generation 3. Choice a. Criterion Setting b. Risk Assessment c. Action Generation 4. Action Within each of these steps are specific heuristic processes that affect the output of that particular stage. Figure 3 shows a diagram of this process, with square "keys" representing the effect of different decision making biases and heuristics on that stage. Perception and Attention C L Working Memory I I Cues As Situation Assessment R (Diagnosis) 44i- -)0 Criterion Setting Choice p Ato Long Term Memory Figure 3: Representations of memory heuristics and where they act on the decision making process Different memory heuristics affect each step of the decision making process. The first step, cue sampling, is subject to the salience bias (S). The salience bias is a human bias where people focus their attention on things that are highly salient or prominent and overlook more subtly presented, modest information in the environment[23]. Situation assessment is affected by the "as-if" heuristic (As) and the representativeness heuristic (R). The "as-if" heuristic means that operators weight all incoming information equally when making decisions, with no one piece of information being considered more important or reliable than another. The representativeness heuristic biases operators to judge the probability of an outcome by how much it resembles available data. The inputs for situation assessment, like mental models, come from long term memory and are affected by the availability heuristic (Av). 30 The availability heuristic acts from the long-term memory because this is where recollections of previous events, mental models, and patters are stored. It influences the operator to judge the likelihood of a choice as being optimal based on how quickly and easily they can recall situations to support that particular event[16, 24]. Other heuristics that affect the operator's decision making strategy originate from within the system. It has been suggested that automated systems introduce opportunities for new decision making heuristics and associated biases[25]. Automated systems introduce new highly salient cues that supplant the operator's traditional reliance on assessing patterns or combinations of familiar cues. This bias, known as an automation bias, reduces situation awareness and leads to overreliance on cues provided by an automated system. 2.1.3.1 Operator Mental Models The prevalence of heuristics in decision making is largely connected to the mental models that the operator has established about the specific task they are attempting to perform, or others similar to it. Humans form mental models about the tasks they are doing and the environment. These models help humans integrate and comprehend the meaning of an aggregate of different sources of state information. A mental model is a set of well-defined, highly-organized yet dynamic knowledge structures developed over time from experience [26, 27]. These models are highly susceptible to heuristic biases, and are generated from an operator's personal experience. They play a key role in operator decision making, and are actually a metric for assessing human performance. Accurate operator mental models are one of the foundations for achieving good situation awareness and subsequently making good decisions[28, 29]. Mental models are primarily developed through training and experience. This explains why novice operators have more difficulty making decisions and get overloaded with information more quickly[30]. In contrast, experienced decision makers assess and interpret the current situation (Level 1 and 2 SA) and select an appropriate action based on the different conceptual patterns stored in their long-term memory as mental models[31]. This stable source of information allows them to recognize specific situations quicker and more consistently. Specific environmental cues activate these stored mental models in experienced operators, which then guide their decision making process. Mental models, therefore, reflect the user's experience with and understanding of a system and its environment. Accurate mental models are one representation of operator experience and are helpful when other learned procedures fail as well as improving performance at novel tasks, because they draw on similarities between the current situation and different stored models. One of the goals of training is therefore to give operators sufficient experience to create accurate mental models and facilitate efficient system operation and decision making. 2.2 Human Decision Making Decision making is the act of choosing between alternatives under the conditions of uncertainty[16]. Decision making has been studied extensively, and decision making research generally falls into one of three classes: - Rational or normative decision making approaches are focused on how people should make decisions according to some optimal framework. It includes factoring both the value of an outcome and its subjective likelihood into the choice. This type of research looks at how and why humans depart from these optimal strategies. - - Cognitive or information processing approaches are focused on the biases and processes used in decision making related to limitations in attention, working memory, strategy, or familiar decision routines such as heuristics. This type of research examines the causes of these biases Naturalistic decision making places great emphasis on how people make decisions in a realistic environment where they have expertise, and where the decisions are complex. This can be combined with other classes of research. The decisions that are studied are typically classified according to several factors including their level of complexity and their quality. The environment has the largest influence on the actual decision. In the context of flight planning it has been shown that nearly half of the variability in pilots' problem solving behavior is due to environmental or domain features. The domain characterizes the type and complexity of the decision being made. Planning difficulty increases for tasks and environments where there are more choices available for action. Domains that generate good decision making, and those that are more likely to generate poor decision quality have certain identifiable characteristics[15]. Domains that lend themselves to good decision making are dynamic, repetitive, and have feedback available to the operator. Humans tend to perform better in these environments when the decisions are about things, and are decomposable problems. Examples include chess, physicians, accountants, and weather forecasting. Characteristics of poor decision making domains are just the opposite. They are static, more uncertain, and incorporate less feedback. The decisions may be about people or behavior and are not easily decomposable, such as clinical psychology, personnel selectors, stock brokers, and court judges. Aside from the domain, other key features of decision making include: the amount of uncertainty in the decision, expertise of the operator, and time allotted to make the decision. Uncertainty is associated with the consequences of a decision and is also termed risk. Information is defined as the reduction of uncertainty. Before the occurrence of an event we are less sure of the true state of the world than after that event occurs. This reduction in uncertainty comes from the information gained from the event. In decision making, this evidence is particularly important when decisions have more than two outcomes, which is often the case in real decision making tasks. Providing the operator with as much useful information as possible in order to reduce the uncertainty improves decision performance[8]. Expertise is a characteristic of the decision maker or operator that decreases the time they need to make a decision and helps alleviate time pressure. Time pressure can have a critical influence on the decision making process. The more time pressure the operator experiences, the more likely they are to use heuristic decision making methods 34 and the more likely they are to make a rash, incorrect decision. With time pressure, normal retrieval of relevant mental models is prevented; they do not have time to adequately search their long term memory. Operators are also influenced by stress and fatigue, which reduce both decision optimality and operator confidence. Decision making can be studied from the standpoint of the actual decision or from the strategy employed to reach that decision. Judgment theory examines decision strategies using two different classifications of strategies. Judgments can be made using a compensatory strategy or a noncompensatory strategy. Compensatory strategies weigh both good and bad attributes of a decision option or choice, and allow these strengths and weaknesses to "compensate" for one another. These strategies generally make better use of available information, although struggle when the decision is being made with a high level of uncertainty. Noncompensatory judgment strategies do not consider a balance of the good and bad attributes of a choice and as a result may not consider all the information available to the decision maker. An example of a noncompensatory strategy is setting an absolute minimum for a certain characteristic. All alternatives that fall below that minimum would be rejected as valid choices. This is a common practice in aviation and simplifies a complex decision situation by reducing the cognitive effort required to make the decision[16]. 2.2.1 Decision Making Performance 2.2.1.1 Decision Quality The quality of a decision is typically based on the outcome. This quality metric is retrospective, and based on comparison to "expert" decisions. This is not necessarily an accurate measure of quality since it has been shown that experts don't always make better decisions. It also assumes the expected value of the decision is common among all decision makers, which depends on assigning universally agreed upon values to the outcomes of a choice. Since this is not very realistic, it may be more important to look at the ways different environmental and information characteristics influence processing operations and outcomes. Decision quality is influenced by stress, which interrupts or exerts pressure on the decision maker and leads to a suboptimal decision making process, especially when the task involves a high level of spatial working memory [15]. Quality may be better determined by examining the process by which the decision was reached, and how consistently the operator applies this strategy, rather than the actual outcome. 2.2.1.2 Decision Performance Evaluating an operator's decision making process or strategy is difficult because this information is challenging to elicit accurately. Humans are notoriously poor at accurately understanding and self-reporting cognitive activities. As a result, metrics available to examine decision making processes tend to be limited in their scope and effectiveness. While decision quality is an important consideration, other performance metrics are commonly used which are more objective as to their definitions and easily measured. These common metrics are decision latency and accuracy. Latency is the time it takes for an operator to make a decision. Accuracy is simply whether or not the decision was correct as compared to the actual environment. The ideal decision would be both fast and accurate; however there is generally a tradeoff between these two metrics. Pilots who are good decision makers tend to take longer to understand a situation or decision problem and acquire evidence, even though they select and execute actions quicker[15]. Depending on the domain and the possible consequences, accuracy is generally considered the more important of the two metrics, although, in dynamic systems, excessive decision latency can also have an adverse affect on system performance which must be taken into consideration. 3 Lens Model The Lens model was selected as the human decision making model that both met the initial requirements, and provided the greatest amount of descriptive information about the human judgment and its corresponding environment. The Lens model is traditionally a static model. For this work, it was expanded to incorporate dynamic information as part of the human judgment strategy, and implemented to post process experimental data about failure detection and diagnosis decisions in the context of a lunar landing scenario. 3.1 History and Formulation The Lens model is part of a class of normative decision making models that was developed and proposed by Ergon Brunswik originally for the study of perceptual psychology. Brunswick believed that psychology should seek to provide descriptions of behavior that actually occurred rather than discover laws of behavior[32]. This led him to develop the theoretical foundations for the Lens model in the early 1950's. Ken Hammond was the first to apply and popularize Brunswik's work through application to judgment and decision making theory, where the model is still primarily used today. The model's foundations are based on the concepts of representative design and probabilistic functionalism[5]. Representative design is concerned with the selection and inclusion of stimulus conditions in a scientific study. The goal is to ensure that any stimulus to be tested is 38 sampled from a representative population in the same way that subjects are sampled from a representative population. A representative population means that all facets of the population of interest that exist in the actual world are also represented in the study group. This is commonly practiced when selecting a subject pool, but experimenters typically only present subjects with a limited set of stimuli that is often unrepresentative of the true environment[32]. Brunswik believed that psychology generally focused too much on how subjects were sampled and not enough on ensuring that the environment was well represented in a variety of different contexts. The normal systematic design of experiments that is so common in the science communities to date strives to control variables through factorial design. This allows the results to be generalized to other subjects and populations of subjects but does not allow for generalization to other conditions or contexts within the same environment. Probabilistic functionalism has two main premises: 1. Psychology should be concerned not just with the human organism but, more importantly, with the interrelations between the organism and its environment. 2. This relationship of organism to environment is based on uncertain or probabilistic relations among environmental variables. These variables are the proximal information or cues which are used to make inferences about the environment. These cues are not consistent, and not always available. This second point leads to Brunswik's theory of vicarious functioning, which is essentially the redundancy of cue information in the environment. An organism may use any number of available means to achieve the end goal[33]. These theoretical concepts are all encompassed in the basic formulation of the Lens model, shown graphically in Figure 4. Achievenent, r. Cue Utilization r,, Cues X, Figure 4: A visual representation of Brunswik's Lens Model Several mathematical representations of the model have been developed, the formulation of the Lens model presented here is based on multiple regression and correlation techniques[34]. It begins on the left side of the model with the environmental variable, Ye. Ye is modeled as a linear function of a set of k cues, Xj where j = 1, ..., k. Thus, the environment is modeled by the regression equation: Y = Z =e,jX; + Le Where rej is a vector of weights that represents the relative change of different cues with changes in the environmental state, and e, is the error term of the regression of Ye on the cue values of Xj. Similarly the human judgment about the environment, denoted Y, , is modeled in the same way, yielding the formula: Y = E> rs ,jXj + Es Where rs,j is a vector of weights that represents the relative change of different cues that the human judge ascribes to the environment in their decision strategy, and Es is the error term of the regression of Y on the cue values of X. The cue weights for the judgment, rs,j, and the environment, rej, cannot be compared to one another initially because they are all in different units, depending on the cue the weight was generated from. In order to compare cue weightings to effectively understand the judgment strategy and environmental prediction from a certain set of cues, we must standardize the weights. These standardized coefficients are called beta weights and denoted fe,j and3s,j. These are relative weights that indicate the relative importance of one cue over another in the actual prediction of the environment and the corresponding judgment strategy. Given the fact that the judgment variable and the environmental variable are based on the same set of observable cues present in the environment, a person's decisions should match the environment to the extent that the weights that the judge assigns to environmental cues match those used in the actual model of the environment. This is the same as identifying the correlation between pe,j and 3s,j for all j = 1, ..., k. This multiple correlation between the judgment and the criterion, pyey, ,is also called achievement and denoted by ra. This value is the primary performance metric generated by the Lens model and is calculated based on the Lens model equation: ra =GReRs+C (1-R3)(1-R) [35,36]. Where: judgment and criterion r Achievement: correlation between G Knowledge: correlation between the linear models of the environment Re Environmental Predictability: multiple correlation between the environment and the cues R Participant's C onsisten: multiple correlation between the cues and the judgment C Consistency: = judge and the Unmodeled knowledge: correlation between the residuals of both models The primary values of interest in the model are the beta weights that represent the cue influence in both the environment and the judgment strategy and the achievement. The other Lens model parameters then provide supplementary information about the performance of the model, the environment, and the performance of the judge individually. Other alternative performance measures have been defined from the achievement equation. For situations where the unmodeled knowledge C = 0, the human component of the achievement equation can be expressed independently as the product, GR, , named "linear cognitive ability". This neatly captures the ways in which judges match the environment and how consistent they are in their strategies. Similarly, the quantity GRe, captures the validity of the judges strategy if it were to be applied in a perfectly consistent manner and Rs = 1. This is what would happen if the human judge were replaced by their regression model[34]. While regression is the most common formulation of the model, other formulations have been suggested to overcome the inherent limitations of regression analysis. The use of the Lens model in judgment theory was popularized by Ken Hammond who is quoted as saying: "a [...] sin of commission on my part was to overemphasize the role of the multiple regression (MR) technique as a model for organizing information from multiple fallible indicators into a judgment. There is nothing within the framework of the lens model that demands that MR be the one and only model of that organizing process"[33]. Other methods have been used to get these judgment weights including fast and frugal heuristics[37], non-compensatory formulations [38], and Bayesian methods[39]. 3.1.1 Limitations of Correlation and Skill Score Correlation is the basis for calculating all of the Lens model output parameters and so is an important piece of the formulation. The correct use and interpretation of correlation coefficients can significantly influence the interpretation of any Lens model results. Correlation is highly dependent on the assumptions made with respect to the data that is being correlated and on the basic principles forming this index of association. In situations where these assumptions are violated, the correlation coefficient yields little valuable insight into the relationship between variables, in this case the human judgment and the corresponding environment[40]. It is unwise to assume that the correlation coefficient neatly captures the relationship between two variables in a single value. Many different types of relationships can have the same correlation coefficient and therefore the relationship should be examined more thoroughly. In order to accurately capture the relationships in a set of data, it is important that the data be sampled from a representative space in order to capture the relationships within a whole population and not just a segment [40]. Ensuring representative sampling is difficult and creates two primary biases in the calculation of correlation: scale errors and magnitude errors. Regression bias: The regression bias is a scale error representing the degree to which the standard deviation of human judgments is not scaled to the standard deviation of the environment and imperfect achievement. A regression bias is common in uncertain environments. This bias occurs when an operator does not appropriately regress their judgments towards the mean but instead is biased by case-specific information. They do not adaptively balance both base-rate and case-specific information. Base rate bias: A base rate bias is a magnitude error that exists when the mean of the distribution of the human judgment variable does not equal the mean of the distribution of the environment. This occurs because a judge must reason and make predictions about a specific case and appropriately balance both case-specific information and base rate information about the whole population or environment. This means that the judgment and the actual environment distributions will be slightly offset, and the judges decisions are consistently under or over estimated[33]. Since correlation analysis is so critical to most Lens model formulations, the limits of correlation have led to the creation of another performance metric called the skill score. The skill score originated from weather forecasting[41], and was applied to human factors and situation awareness for judgment analysis[42, 43]. A diagram of the skill score as a measure of judgment quality is found in Figure 5. Figure 5: Diagram of Skill Score and its components The skill score is a measure of "distance", the squared Euclidian distance, between data sets rather than the shared shape of a distribution, which is how correlation is determined. It accounts for the regression bias and the base rate bias that are inherent in the calculation of correlation and can make its interpretation more ambiguous. The equation given for skill score is: SS = r2 - (ra Sludge Senv ) 2 (,Y) 2'where senv and Seny Sjudge are the standard deviations of the judgment and the environment, and YS and Ye are the means of the judgment and environment[33]. High quality judgments have a high correlation between the judgment and the environment but generally have low regression bias and base rate biases which leads to a high skill score. 3.2 Previous Applications The Lens model has been explored, enhanced, and developed for five decades. As a result there is a wealth of previous research to draw upon in a wide variety of domains. According to a meta-analysis of Lens model studies done in 2007, the Lens model has been applied to at least 259 different task environments in 78 different papers[34]. This work synthesized the Lens model statistics presented above and found overall high levels of achievement and noted that people can achieve similar decision performance in both noisy and predictable environments. One of its primary applications is for cognitive engineering and the Lens model has been applied in domains ranging from military command and control to music education, from laboratory experiments to field studies, and also studies containing an emphasis on learning. The Lens model is in a unique position to provide feedback for learning studies and determine what the most effective form of feedback is for a particular task[44]. It has business applications in marketing, accounting, and engineering, and has continued to be used in social research areas like preventing violent behavior, moods, and consciousness studies. This large body of previous work on the model and its extensions gives the model some credence for further use in more new domains. It has been used effectively on a wide variety of problems which has shown that despite its limitations it is a useful model for examining human judgment in an array of contexts. Few papers that contain Lens model data provided individual data and so assumptions cannot be made about variation within the different environments[34]. The model does appear however to be relatively robust to differences in both domain and task. The same meta-analysis provides generalized information from the aggregate of much of this past work. 3.3 Limitations While the Lens model provides us with an interesting way to examine judgments and environments there are four major limitations that it does not accurately model. First, it does not model a judge's adaptive use of cues not included in the model. This is true of most modeling efforts but still leaves the question of whether or not the most appropriate cues have actually been included. Secondly, the Lens model does not inherently capture the judge's ability to adaptively respond to non-linearity in the environment, and cue-criterion relationships. This means that any environmental variables that affect the criterion value in a nonlinear way will be represented in the quantity C, but the Lens model will not help determine how the operator might be using those nonlinear cues and their relationships to any other cues in the model. These adaptations can significantly alter an operator's decision making strategy. In light of this, it is critical to recognize that environments with highly nonlinear components will have lower achievement values and the predictions of the Lens model will be less accurate. Next, the model does not explain any possible chance agreement between errors of both the human and environmental models. It is therefore important to understand that the Lens model does not capture or identify errors in the models. When selecting cues for the environmental model, if they can be selected or engineered (i.e. do not occur naturally), it is critical to select the best possible linear model of the environment with which to compare judges. This can be done by statistically investigating the effects of a variety of sets of "possible" cues to determine which ones are most important to include in the model[45]. One last important limitation of the Lens model is that it is primarily a static model without the capability to investigate dynamic cues. Though it has been used to try to evaluate decisions in time-dependent domains[13], this is generally done by ignoring the dynamically changing nature of information in the model directly and building models from snapshots of data at the instant of a judge's decision. Dynamically changing information is implied in the results, but not actually directly investigated in the model. 3.4 Implementation and Data Post Processing The Lens model was implemented in Mathematica's MATLAB 2009a as a generic m-file that can be used with any set of data in either text file, Comma Delimited (CSV), or Excel spreadsheet format. The implementation currently requires the following input data which was collected in CSV format from the experimental interface: - Cues : These can be represented in any number of column vectors of data which correspond to the different observable cues or inferred dynamic - information being used to predict the judgment and the environment. Actual Judgment: The actual judgment vector must also be a column vector - that is the same length as the number of cue observations, one judgment response for every set of cues, this data is often binary or in some other scaled integer format. Actual Environment: The actual environment vector, represents the true condition of the environment for each set of cues, and has the same requirements as the actual judgment and should be in the same data format (binary, integer scale, etc.). The model provides the following outputs, one value for each data file that is run through the model: - - - Achievement: The measure of how well the judge did at predicting the environment, or how accurate they actually were. It is essentially a correlation coefficient but also accounts for nonlinear components of the environment and the different parameters listed below in its calculation. Skill Score: This metric of decision quality is described above in reference to the limits of correlation coefficients and their interpretation. It is a measure of quality which accounts for the correlation between two variables and also accounts for both the scale and magnitude errors inherent in its calculation. Judgment Strategy: This can be discerned by looking at the standardized weights assigned to each of the cue variables. These weights can be compared relative to each other to see where the human judge allocates more of their confidence or importance in terms of the cues they believe best predict the environment. - Environmental Predictability and Judgment Consistency: These measures are multiple correlations between the environment, the judgment, and their corresponding models generated in the multiple regression. This gives us some idea of how well the environment can be linearly predicted, and how consistently the judge applies their same judgment strategy each time they are faced with a decision - Matching: This is a correlation between the judge and the environment. It describes how well the judge understood the linear component of the environment and adapted their judgment to match it. - Unmodeled Knowledge: This accounts for any nonlinearities in the environment that cannot be explained by the linear combination of the cues presented. This is the correlation of the model residuals for both the judge and environment. It gives us an indication of how much outside information the judge was using in their decision making. In this form, the model was used to post process data from the experiment to determine each subject's achievement and decision consistency, which were also performance metrics for their decision making. The experimental data files recorded the subject's decision as well as the values of each system parameter at each time of decision as well as calculating the inferred deviation of any of these system parameters from their corresponding normal states. First the detection and diagnosis data were separated for each subject so that the impact of the trend information could be evaluated for each decision type individually. Then these recorded parameter values became the actual judgment values and the cues or inferred states that were used to build the regression equation for the human judgment side of the model. The true values of the environment were added to the CSV file manually by the experimenter after each session. Once the environmental data was added, the data with trend information present and without trend information present were separated for each subject and then run separately through the model implementation. This generated two sets of Lens model output parameters for each subject, one that was achieved when trend information was present, one without it. These values were all aggregated into a single data file so that statistical analysis could be performed to explore the impact of the trend information on the Lens model output parameters and compare it to the results of the other human performance metrics like speed and accuracy for both failure detection and diagnosis. 4 Focus Area In order to utilize a model of human decision making and evaluate the impact of explicitly displayed trend information for system operators, it was necessary to narrow the field of decision making to a subset of decisions applicable to complex system design. Failure resolution is a critical task in human-system interaction and was selected as the focus area for this thesis, specifically two stages of the failure resolution process: detection and diagnosis. Some considerations for modeling human failure detection and diagnosis must include: the types of failure modes that can be modeled, the complexity of the model implementation, the decision performance, and the robustness in the presence of modeling errors[6]. In addition to these, several other studies suggest that participatory mode of the system operator should also be a consideration. Participatory mode is the amount of interaction the human operator has with the system. It is classified into two categories: operator and monitor, one in the control loop, one outside of it[12, 46]. 4.1 Failure Types In order to understand failure detection and diagnosis, it makes sense to begin with the definition of a failure. A failure is any critical changes in the system parameters, or the inherent dynamics of the system[10]. In a complex system with hundreds of parameters that could change, this definition may be too narrow. Therefore, a failure can also be defined as any unpermitted or uncommanded deviation of at least one characteristic property of a variable from acceptable behavior[7]. The types of failures that an operator may encounter are dependent on the specific domain and systems involved. All components in a system have some unique set of failure modes. However, there are some general failure types that have been previously investigated. Three primary types of failures are discussed below based on the time it takes the failure to develop: Step Failures: One of the most common types of failures is a "hard over" failure or step failure. This is a failure which involves adding some kind of bias or significant error instantaneously to a parameter value. It is generally sudden and well above the detection threshold for humans and thus detection is often assumed. A step failure can be made more subtle by using an output-additive time function which transforms it into a ramp failure, by altering the mean of the observed process[9]. Ramp Failures: A more subtle type of failure is a ramp or degradation failure. This type of failure is closer to the threshold of detection and is more difficult to identify because it occurs slowly over the instrument lifetime and incipient. This failure is caused by the error measurement in a sensor or other instrument increasing very slowly, almost unnoticeably over time as the sensor generating the data ages or degrades and manifests itself in an indicator that is inconsistent with others[9]. Intermittent Failures: An intermittent failure is a repetitive on-off failure. This could be something like a communications link going down and then coming back online, cell phone reception, or other examples in signal transmission and receipt. It can be very difficult to detect because it is only apparent if the operator needs access to that particular piece of equipment or that information at a time when the instrument is not functional. For example, you only notice that you have poor cell phone reception when you need to make a call. Another way to look at classifying failures is in terms of overall system impact. Up till now failure types have been discussed based on how they develop over time. Each of these types can have different effects depending on whether or not they can be resolved by the operator or the system. This "reparability" gives another dimension along which different failures can be investigated. Failures can either be permanent or transient. Permanent failures are typically the result of a hardware or software issue that cannot be fixed by the operator in real time during the current mission. The operator must operate with that failure or abort the mission. Transient failures are less serious in the sense that they are correctible during operation and are more often simply a disturbance. They require time and operator cognitive resources but do not usually necessitate a mission abort. Failure distributions can be modeled in a number of different ways. Failure times can be distributed randomly among participants or components, or failure times can be modeled as a stochastic process which is a function of the Mean Time between Failures (MBTF), for that specific component or system. Failures have commonly been investigated singly, as opposed to cascading failures or multiple single-point failures, assuming that system failures are independent which is unrealistic. In experimentation, failures are also modeled at higher failure rates than would normally occur because true failure rates can be low and infeasible for human subject testing. These unrealistic failure rates however, may induce expectancy or bias in the human detector to look for failures more than they normally would in a real world situation. 4.2 Failure Detection Failure detection is of primary importance to the operator of a complex system. Key to balancing the performance, reliability, and cost of a system is the ability of that system or the human involved to perform accurate and timely failure detection and diagnosis. Engineering systems will eventually will encounter unexpected failures and environmental disturbances which must be detected to be resolved [11]. If the human is monitoring, their primary role is to be an accurate and efficient failure detection mechanism and problem solver who intervenes after a failure to ensure a safe outcome[12]. System failures can potentially result in loss of productivity, expensive equipment, and human lives[10]. Full failure resolution in a system follows a three step process: detection, diagnosis, and compensation[10]. The work in this thesis focuses on the detection and diagnosis steps. Detection is defined as "determining if a malfunction has actually occurred in the system"[10]. Other authors incorporate the human operator, stating that failure detection is, "the process whereby a human operator decides that an event has occurred"[4]. This indicates the importance of decision making in the human failure detection process. Failure detection is believed to be largely domain and tasks dependent, but a general framework of four different process measurements that humans typically look for in order to detect failure has been identified[8]. These four include: the magnitude of changes between successive measurements, reversals in direction in successive measurements, simultaneous occurrence of large magnitudes and reversals, and localized magnitude changes. Based on these types of trends in system information, and their perceived accuracy of that information, operators must decide if they believe a failure has occurred[15]. There are circumstances where detection represents a source of uncertainty or potential bottleneck in performance because it is necessary to detect events that are subtle and near the threshold of human perception. Humans' ability to do this effectively is based on their sensitivity as detectors in different situations. This is known as the sensitivity level. It is important to understand what these levels might be in a system so that it can be designed such that critical failures can be detected as quickly as possible. There are a variety of things that affect detection sensitivity of an operator. One is the number of states of categorization. The process of detection may involve a binary go-no go decision or it may require the human to choose between three or four possible levels of uncertainty about an event, or detect more than one kind of event. This decreases detection performance by increasing the time it takes for the operator to evaluate multiple options. Sensitivity, like detection, is domain and task specific, influenced by operator training, display design, and a myriad of other factors such as physiological state, operator expectations, and experience. Participatory mode of the operator is another consideration. Though there are conflicting results, it is generally accepted that the amount of interaction the operator has with the system as part of the control loop, does have some effect on their detection performance. Sensitivity can also be impacted by the operator's level of preoccupation with secondary or side tasks or mental workload. Increased mental workload has been shown to decrease detection sensitivity[47]. However, despite all their possible limitations, humans have been shown to be relatively keen when contrasted with the detection resolution of machines[15]. Aside from sensitivity other measures of detection performance include latency, accuracy, and false alarms[6]. Detection latency is the time interval between the introduction of the failure and when the human can cognitively detect its presence. It has been argued that detection time is not the best measure of detection performance and that identification and recovery times should be incorporated as well. Near the operators detection threshold, this is a valid and interesting parameter to evaluate[47]. Accuracy is a measure of how well a detector performs. This is often presented as the fraction of the failures that the human got correct. Another consideration that is closely tied to accuracy are false alarms, which are a special case of inaccuracy in detection which is of primary concern to designers in many different system domains such as medical monitoring. A false alarm occurs when a failure is detected when it is not actually present in the environment. 4.3 Failure Diagnosis Failure diagnosis is the next step in failure resolution and seeks to identify the cause of the failure. The more complex the system, the more difficult it is to diagnose failures and accurately evaluate system reliability, particularly where the human is involved. This is typically a categorical decision with multiple possible outcomes. The human must be able to discriminate between these different categories of failures, gather evidence about system performance, and receive feedback to accurately diagnose a failure. To do this, human operators use knowledge of cause and effect relations. The ways that faults propagate through a system and manifest themselves as observable symptoms generally follow physical cause and effect relationships. These relationships help operators to identify possible causes, assuming they are familiar with the system dynamics. Since the physical properties of the different system parameters are 58 connected quantitatively by the system dynamics and by time, these dynamics define the symptoms of each failure in a unique way. Diagnosis may require learning decision strategies and criteria for cross checking instrumentation as well as developing a set of rules to identify a failed component[48]. Diagnosis decisions then, are based on the observed symptoms of the systems current behavior up to the current time[7]. If the operator does not know the failure-symptom relationship prior to the failure occurring, they apply classification methods, such as neural networks, which classify symptoms into failure categories that make implicit sense from their knowledge about the system. If the causal relationship is known ahead of time, diagnostic reasoning strategies and logic are used to try and narrow down the possible failures. Rule based decision makers and Boolean algebra are models that fit this type of decision making process[7]. Like detection, diagnosis is affected by fatigue, stress, and complexity. Diagnosis is often a complex decision involving a large number of cues and a large number of possible outcome categories. Difficulty in making diagnosis decisions arises when the failure symptoms are not essentially unique, so that symptoms in the system dynamics could be attributed to multiple possible failures that look similar on multiple levels. Here operator experience, extensive knowledge of the system dynamics and functions, and accurate mental models are incredibly valuable for accurate failure diagnosis. Pattern recognition and associative memories are effective means for dealing with this complexity; comparing the characteristics of failures with the signatures of previously known failures. Other tactics for failure identification include either hardware or analytical redundancy so that system parameters can be compared to a nominal value in order to determine what has failed[11]. Diagnosis performance is predicated on the initial step of failure detection, and is likewise measured by latency and accuracy. The diagnosis latency can be calculated as the time between the detection of a failure and the identification of its cause. For accuracy, inaccurate diagnoses take several forms. First, the operator both detected and diagnosed a failure that was not present. Secondly, the operator detected a failure correctly, but diagnosed it incorrectly. Next, the operator detects a failure but never made a diagnosis as to what occurred. And lastly, the operator failed to detect or diagnose a failure that was present. Some mistakes are mistakes only in detection, some are errors in diagnosis only, and some may be errors in both detection and diagnosis. 5 Methodology This chapter discusses the failure detection and diagnosis experiment that was conducted to examine the effect of trend information on human failure detection and diagnosis performance and generate data to demonstrate the Lens model implementation. 5.1 Motivation This experiment is designed to investigate the effect of explicitly displaying trend information about system parameters to operators on their performance in failure detection and diagnosis tasks. Understanding what constitutes normal for the system is critical to failure detection and diagnosis so that deviations from this normal state can be detected and identified. The system operator must have both good situation awareness and a mental model for what normal conditions should be to perform well at these tasks. Understanding the past trend of system parameter behavior is part of maintaining awareness of the current system state, as well as projecting those states into the future. Trend information also allows the operator to learn patterns and signatures for specific failure modes. These specific mental models and improved pattern recognition are important factors in accurate and timely failure diagnosis. Typically, these trends must be stored in the operator's working memory which is known to be limited in both duration and capacity. Displaying them explicitly removes this extra cognitive workload from the operator, theoretically improving detection and diagnosis performance. Research Questions and Hypotheses 5.2 5.2.1 Research Questions The experiment was designed to explore the following research questions: Does explicitly represented parameter information in the form of past trend behavior improve the detection and diagnosis performance, in terms of reduced latency and increased accuracy, of the system operator? " Does the addition of explicit trend information improve the achievement of system operators? " Does the addition of trending information increase the consistency of operators' decision making? * 5.2.2 Hypotheses Adding trend information will decrease latency and increase accuracy for both the detection and diagnosis decisions. Explicit trend information was expected to decrease detection latency and increase accuracy because deviations from the nominal behavior of a system should be more quickly recognizable if the past nominal behavior is immediately visible versus stored in working memory. Similarly for diagnosis, it makes sense that patterns would be more quickly and accurately identified if the past behavior is explicitly displayed. The corresponding reduction in the demands on memory was expected to help subjects be both faster and more accurate decision makers. Trend information will improve a subject's ability to consistently use the same judgment strategy in detecting and identifying failures. The increase in consistency should occur for several reasons; first it has been shown that additional cues (in this case trend information) increase an operator's ability to make decisions consistently to a certain extent before they become overwhelmed by information overload[49]. This information provides a critical crosscheck that will help confirm or deny a subject's hypothesis about which failure has occurred. This extra evidence reduces uncertainty and therefore should increase both accuracy and consistency. Explicit trend information will increase the achievement of subjects when compared to the recall of this information from working memory. This makes sense when we look at the hypotheses above and the equation for achievement in the Lens model: ra = GReRs + C (1 - R2)(1 - R 2). Achievement is another measure of subjects' accuracy which takes into account other model parameters. It has been shown that Rs and G are positively correlated, and for a dynamic task it is assumed that the trend information is an important component of the actual environment, G. Given this assumption, and the belief that explicit historical information will increase the consistency of judges and their accuracy, an increase in achievement is expected. 5.3 Experiment Overview 5.3.1 Background The focal point for all experimental failures was the reaction control system (RCS) in the context of a lunar landing scenario. This subsystem is responsible for the attitude control of the spacecraft in the roll and yaw directions during descent. It was selected because failures of this system would all be manifested in some way on the instruments that might be available in the cockpit. 5.3.1.1 Lunar Landing Spacecraft Due to the nature of the detection task, it was important to present subjects with multiple trajectories, in order to prevent them from becoming overly familiar with one trajectory signature and thereby make the detection task trivial. Three different trajectories were used based on the angle of their approach to the surface: 15 degrees, 30 degrees, and 45 degrees. The 30 degree approach angle was defined as nominal because it most closely approximated the approach angle used for lunar landing during the Apollo space program. The 15 and 45 degree trajectories were then defined as shallow and steep, respectively. All three trajectories had the same acceleration profile of 1.2 lunar gravities and began at a slant range of 1000 km from the desired landing point. Below in Figure 6 is a notional diagram of the three trajectories used. The figure also depicts the portion of the trajectory that was selected for the experiment from the end of the pitch over maneuver to the start of the terminal descent to the surface. scaA*nguNuver Steep approach Miing this part of Normai approach Shallow approach P_____ ___en Braking Phase (PCI) ApproachPase Terminal Descen Phase Figure 6: Diagram of three different lunar trajectories selected for experiment and highlighted which portion of the trajectory is incorporated in the experimental task The trajectory is important to the operator because different trajectories exhibit different burn patterns during this phase. These burn patterns are important to an operators understanding of "normal" trajectory conditions, which means that fuel is a primary indicator of RCS system failures. Steep trajectories burn less fuel overall but come in much higher and faster than shallow trajectories. Shallow trajectories begin closer to the lunar surface and have larger burns required in order to complete the spacecrafts large rotation and pitch over maneuvers early on in the approach phase. Normal and steep trajectories have larger burns towards the end of the approach phase as they attempt to maintain proper attitude during deceleration for the terminal descent phase. 5.3.1.2 Spacecraft Subsystems: RCS The RCS is primarily responsible for spacecraft attitude control. Guidance receives position information from the spacecraft navigation filter and commands the RCS thrusters accordingly in order to maintain a stable attitude. Spacecraft attitude is a three dimensional vector quantity comprised of three angles: pitch, roll, and yaw. Pitch is a forward and backward rotation around the Y-axis. In the spacecraft, positive pitch is a pitch backward or a pull up of the "nose" of the spacecraft. Roll is rotation about the X-axis where positive roll is roll to the right, negative roll to the left. Yaw is a rotation around the Z-axis. Positive yaw is defined to be motion to the right, and negative yaw to the left. In the lunar lander pitch is controlled by the spacecrafts gimbaled descent engine, called thrust-vector control (TVC) and so is largely unaffected by firings from the RCS thrusters. The configuration for this test case was drawn from conceptual designs for the next lunar lander[50]. In this configuration, there are 16 RCS thruster located in four clusters of four around the spacecraft. Two thrusters on each cluster are oriented in the Z-axis and can affect pitch behavior when not overshadowed by the descent engine. The other two thrusters on each cluster stick out at an angle to the X and Y axes and can affect both the roll and yaw angles of the spacecraft which are coupled due to the dynamics and this physical orientation. In order to maintain the proper stable attitude during descent to the lunar surface, the thrusters fire in short bursts in pairs on opposite sides of the spacecraft. The spacecraft axis convention and thruster configuration are shown below in Figure 7. The thrusters are fixed and cannot be individually pointed. Each thruster can produce approximately 100 lbs of force. +Y V ("starboard") up- ("Caft") RCS Thruster Pod s +X (forward") (crew windows) -Y Windows +Z ("down") Figure 7: Square configuration of RCS thruster pods around lunar lander (left), full lunar lander with axis definition and close up of RCS Thruster pod (right) Fuel System: The RCS uses hypergolic fuel technology for propulsion, which consists of a liquid fuel and oxidizer that combust on contact with each other, requiring no ignition mechanism and allowing an indefinite storage period as long as they are kept separate. The fuel for the RCS is liquid monomethyihydrazine, designated MMH. The oxidizer is nitrogen tetroxide designated NTO. The thruster system is fed by fuel pumps in order to keep the fuel/oxidizer ratio at the correct level. A conceptual schematic of the RCS that was modeled for the experiment can be shown below in Figure 8. Figure 8: Conceptual schematic of the RCS including fuel tanks, thrusters and fuel pumps The system is designed to consume approximately 460 kg of total RCS fuel for both the descent to the surface and the return ascent to lunar orbit[50]. Half of this should be remaining at touchdown in order to make a safe return to lunar orbit. Four different failure conditions plus normal operating conditions were presented to subjects during the experiment and are discussed in detail below. Fuel pump failure: This failure would result in the loss of an entire cluster of four thrusters. Each cluster of thrusters is equipped with one fuel pump installed in the fuel line, which branches to feed each of the four thrusters individually. If a fuel pump were to seize or stop working, this would result in fuel starvation for an entire cluster of thrusters, rendering them inoperable. The spacecraft would exhibit deviations in roll and yaw in the direction of the failed cluster of thrusters as other thrusters on opposite sides of the spacecraft continue to fire as planned and are not compensated for by those thrusters which are now non-operational. The fuel behavior would remain normal with the exception of a few additional burns necessary to correct the deviations in attitude that result from the loss of the four thrusters. With system weight a prime consideration, it is unlikely that there are auxiliary pumps on board, and thus a failed fuel pump would require increased compensation from the other thrusters in order to maintain proper landing attitude. Fuel leak: This failure could result in a mission abort, as fuel is a critical resource for the return to lunar orbit. A leak will appear only in the cockpit fuel gauge. It should not cause any attitude deviation, but will affect the spacecraft's center of mass, which can affect its dynamics. If a leak occurred, the fuel gauges would drop linearly without any commanded burns. Typical fuel system behavior is a step function, with each thruster that fires creating a small step down in the fuel level. Depending on the trajectory, there may be long periods of time where there is no fuel decrease because there are no burns occurring. Valve Failure: This failure mode could result in severe vehicle instability and possibly a loss of the mission. A valve failure is a very rare failure mode that could be caused either by the build up pressure pockets in the fuel lines around a valve or a short circuit in the valve. Either of these conditions, would cause intermittent thruster firings. These intermittent thruster firings would induce a mild oscillation in the roll or yaw directions which would slowly increase in magnitude as guidance attempts to correct the oscillation. Since the intermittent firings are a random behavior, guidance would always be one step behind in its corrections increasing the magnitude of the oscillation slowly over time. The additional thruster firings and the compensation to damp out the oscillation and return to stable flight would cause fuel to decrease in a step fashion that depended on the number of random thruster firings. Thruster Failure: This failure, modeled as a continually firing thruster, would result in a persistent deviation in roll and yaw that would be compensated for by continual firings from opposing thrusters. This would cause not only attitude deviations but a decrease in fuel similar to a fuel leak since at a minimum the stuck thruster would be continuously burning fuel. This failure was intentionally selected to be similar in its symptoms to a fuel pump failure. Deviations in roll and yaw for these two failures could look almost the same to the human operator. The distinguishing factor between the two failure modes is the fuel behavior. The gradual linear fuel decrease that occurs when a thruster is stuck on does not occur in a fuel pump failure, which burns fuel normally. All of the data for each of these failures was taken from a high fidelity MATLAB simulation of the descent trajectory of a lunar landing spacecraft. The dynamics were altered in a simplified linear fashion to reflect the given failure mode. All of the failures could be detected and identified using three parameters: roll, yaw, and fuel quantity. RCS failures can have a significant impact on the success of the mission. Depending on the failure, the consequences of such system failures can be relatively minor such as additional compensation from other thrusters, or could be serious and require immediate landing or abort to preserve the crew and the spacecraft. 5.3.2 Task The experimental task was a supervisory control task, where the subject was monitoring the progress of an autopilot conducting the final approach to landing of a lunar lander spacecraft to a specific landing aimpoint. Each scenario started with the subject being presented a screen as shown below in Figure 9. Figure 9: Display presented to subject at the start of each scenario This screen allowed the subject to start each scenario when they were ready by clicking the "OK" button. The scenario would then begin to dynamically update. The subject's responsibility was to detect and identify failures that occur in the RCS as described above. It was also possible for no failure to occur and the system to perform normally. If a failure was detected the subjects were instructed to press the "Failure" button indicated by the red circle in Figure 10. Figure 10: Failure button to be selected when subjects detected a failure Once the failure was detected, subjects were then asked to diagnose the failure as one of the four predetermined types or to cancel their detection decision if they felt they had made an error. This was done by a series of five buttons on the emergency procedures panel which activated after a the failure button was pressed. The emergency procedures panel is shown in the red oval below in Figure 11. Figure 11: Failure identification panel where a subject selected a button to diagnose a failure after detection Once the failure was diagnosed, the scenario ended automatically and the next scenario for that particular subject was automatically loaded by the simulation. The subjects were returned to another screen where they could select "OK" when they were ready to begin the next scenario as shown initially in Figure 9 above. A trend information display shown below in Figure 12, was displayed in half of the scenarios to compare the impact of this additional information on subjects' performance. Figure 12: Trend information display shown to each subject during half the scenarios Three different trajectories were used to prevent the detection task from becoming trivial through subjects rapidly learning the failure signatures and normal trajectory pattern for a single trajectory over many scenarios. For each trajectory 20 scenarios were developed, 4 of each failure condition plus normal, for a total of 60 scenarios overall. The monitoring task was 60 seconds in duration for each scenario. Scenarios were blocked in groups of 30 based on the two information conditions described below, and the subjects were counterbalanced and randomized as to whether they received one or the other condition first. Each subject saw all 60 scenarios. Within the blocks of 30 scenarios, scenarios were evenly divided between trajectory and failure type such that there were two of each combination of these factors, and the scenarios were randomized within these blocks. Failures occurred randomly between 10 seconds and 50 seconds so that subjects still had enough time to make a decision if the failure occurred late. 74 Below are short descriptions of the experimental factors: Trend Information Conditions: - Not Present: In this condition the operators was presented with only the primary flight display as a means of detecting and diagnosing system failures. The trend screen was present on the monitor but is blank grey. The primary flight display is shown in Figure 13 below. Emergency Procedures F-1 Leak Figure 13: Primary flight display shown to subjects for every scenario - Present: The trend information was added to the primary flight display on a secondary monitor which the subject may cross reference or utilize at their discretion in order to detect and diagnose failures. The trend information was updated at the same rate as information on the primary flight display. The trend display is shown in Figure 14 below. Figure 14: Trend display at scenario completion System States: - Failure: The system encounters one of the four failure conditions described above, with corresponding symptoms displayed on either the primary flight display or both displays depending on the information condition. - Nominal: The scenario proceeds from start to finish with no failure. This is a control condition to ensure that people are not able to cheat the experiment by instantly declaring a failure at the start. It also attempts to remove subjects' expectancy that a failure will occur in every case which could bias the results. Trajectory Types: - Shallow: The shallow trajectory is a long range, low altitude trajectory which uses more fuel in a large burn at the beginning of each scenario. - Steep: The steep trajectory is a short range, high altitude trajectory that is much more fuel conservative, but requires an additional large burn at the end in order to maintain stability as it decelerates. - Nominal: The nominal trajectory has a blend of these characteristics, with a small burn exhibited at the scenario outset, and a larger burn towards the end similar to the steeper trajectory in order to stabilize during deceleration. Independent Variables - Trend Information Condition o Not Present o Present - System States o Failure - Fuel Leak - Fuel Pump Failure - Thruster Failure - Valve Failure o Normal - Trajectory o Shallow o Steep o Nominal Dependent Variables - Detection/Diagnosis Latency Detection/Diagnosis Accuracy Judgment Consistency Lens Model Achievement Below in Table 1 is the test matrix for all trajectory types. There are ten cases for each trajectory type and one replication of each of these conditions to produce 60 total scenarios. Table 1: Experimental test matrix for all three trajectory types Not Present Present Leak Leak Fuel Pump Fuel Pump Thruster Valve Normal 5.3.1 Thruster Valve Normal Subjects After conducting an a priori sample size calculation for a power of 0.8, a minimum of 24 subjects was required. The total number of subjects tested was 28, which should provide robust statistical results. The subject pool included 19 males and 9 females, ages 21-57, recruited from the Charles Stark Draper community. In order to ensure that no experience bias affected the results, because the display was based on a modern glass cockpit, subjects with little to no flight experience were preferred. Two subjects with flight experience were tested and their data was cross checked and no significant difference was found. Subjects were also asked to report their use of flight simulators and any experience that they had with spacecraft systems. 5.4 Procedure Subjects first completed an informed consent form and a background questionnaire that gathered their demographic information. Next, they completed a 20 minute, computer-based PowerPoint tutorial that outlined the experimental tasks, explained the software interfaces, described the RCS, and covered all possible system states (See Appendix C: Experimental Consent Forms and Training Materials). Subjects were given a cheat sheet that listed the failures and their symptoms in bullet format, and told that they could take notes on this sheet while they were viewing the tutorial if they so chose. The subjects then completed a series of five interactive practice sessions (one with each failure type plus normal) in the experimental task environment. The practice sessions were active trainings in which subjects detected and diagnosed a system failure with the experimenter present for feedback and so they could ask any questions they had regarding the interfaces and the information presented to ensure they were comfortable before proceeding. During practice, important functionalities of the interfaces were explained and the subject was given the opportunity to view any of the trials over again if they desired. Each subject then completed 60 full task scenarios in blocks of 30 with a 5 minute break in between blocks. After the last set of scenarios, subjects took part in a postexperiment interview to gather feedback about the experiment, their judgment strategies, the information they felt was most helpful, and how much learning they felt took place. These interviews were audio recorded for future data processing. The experiment was conducted over a period of 12 days, each session lasting approximately two hours. 5.4.1 Apparatus and Graphical User Interface The experiment was conducted at Draper Laboratory. Participants monitored the scenarios on two 17" monitors situated directly next to one another and pressed a button on the lower portion of the primary flight display when they detected and identified a failure. A standard keyboard and mouse were available on the desk. The room contained one other desk for the experimenter to observe and was otherwise cleaned of all other information. The data was refreshed to the screen at a rate of 25 Hz, displayed to participants using the graphical user interface described in detail in Appendix B. On the screens, the primary flight display was located flush with the upper right corner of the left monitor, and the trend information display was located flush with the upper left corner of the right monitor. This was done so that they would be close enough to scan but also not collocated on the same screen. The graphical user interfaces (GUI) were implemented in the Python programming language. Each of the GUIs was implemented so that all of the available data and time was recorded in a data file for later analysis. The GUIs included button press functions that allowed the user to start each scenario at their own discretion when a pop up box indicated that they could move to the next scenario. Buttons were also included at the bottom of the display to indicate that a failure had been detected, and then diagnose the failure as one of the four predetermined failure types or cancel their detection decision. 5.4.2 Data Collection The values of each parameter presented to the subject along with the time stamp, and decision at each decision point, indicated by a button press, were recorded in a text file for each subject and each scenario. These parameter values provide the cue data that will be used in the Lens Model for data post processing to determine judgment consistency, achievement, and subject's judgment strategy[5]. Latency is the difference from the time of failure injection, which was fixed for each individual scenario, to the time the subject identified a problem via a button press on the interface. Detection accuracy is based on whether a failure was actually present or not and the subject's response. Diagnosis latency is the time difference between the detection of a failure, and when it is categorized as one of the four possible failure types. This is calculated by the difference in time between the two button presses on the interface. Diagnosis accuracy is determined by which button was pressed, compared to the true failure condition. 6 Data Analysis and Results 6.1 Purpose for Analysis The purpose for the analysis was to investigate the hypotheses regarding the effect of trend information on operator failure detection and diagnosis performance. Performance is defined by operator latency and accuracy. Latency is the time difference between either the introduction of a failure or its detection, and the corresponding detection or diagnosis. Accuracy is determined by whether the subject's decision matched the actual state of the environment. Lens model parameters like decision consistency and judgment achievement, will also be evaluated as performance metrics. Improved performance results in decreased latency and increased accuracy, achievement, and decision consistency. Another purpose for the analysis was to calibrate the Lens model by using it to post process dynamic data. This involved coding the dynamic trend information into the model. This has not been done previously even in dynamic cases where the Lens model has been applied. In order to represent the dynamic nature of trend information in the model, it was assumed that subjects infer the size of the deviation from normal that occur during a failure from the trend display. It was assumed, and supported by experimental observations of subject behavior, that larger, more dramatic deviations from an operator's normal mental model would typically generate a quicker and more urgent response because they are dramatic and hence appear more serious to the operator. In order to evaluate this intermediate cognitive processing of the trend information and its impact on subject decisions, the dynamic information for use in the model was reduced to a ratio of the total visual display space which is taken up by any deviations from normal. A figurative representation of this is shown in Figure 11. This type of display was not actually shown to subjects in the experiment. Figure 15: The area of a deviation is the integral between the normal trajectory signature (top curve), and the failure case (bottom curve) from the scenario start to the time of decision indicated by the white line. Each approach type has a normal signature for each parameter that it follows when no failure is introduced. These normal scenarios are the baseline normal for this calculation. The percentage of the display taken up by the deviation is then determined by taking the integral between the current trajectory and the normal scenario. This area is divided by the area of the entire trend graph axes for that particular parameter to give the percentage of visual area of the deviation. This representation of dynamic trend information strives to capture the impact the deviation had on the operator's response. Analysis and Results 6.2 The analysis was designed to investigate the effects of trend information on subjects' detection and diagnosis latency and accuracy, as well as Lens model achievement and consistency data. Analysis was conducted using a combination of SPSS 15.0 Statistical Software Package and MATLAB. Other factors that also influenced the results included the approach type, failure type, the order the subject saw the different scenario pairs, and the whether they saw the trend information in the first block or the second block of experimental trials. These different components of the experimental design were evaluated to determine if they had a statistical effect on the results. Each of the steps of this analysis is discussed in detail below for each type of data collected. Additional analysis results in support of the discussion and conclusions can be found in Appendix G. 6.2.1 Latency Data The first dependent variable was latency. Initially, the raw data was examined to predict the results of the statistical analysis. After drawing these initial conclusions it is important to support them with the appropriate statistical tests. Since latency is a continuous variable, parametric statistical techniques are most appropriate. The primary technique was either a Univariate or repeated measures ANOVA (Analysis of Variance), which are special cases of regression. First the data were cleaned so that, outliers and other invalid cases didn't skew the results. Then the data was examined for normality, homogeneity of variance, and independence. If all these criteria were met, a Univariate ANOVA model with multiple factors was used to determine the significant factors and their interaction effects. If the assumptions were not met, data transforms were applied to help with normality and homogeneity. If the data could not be improved sufficiently, then nonparametric testing was used. If independence was violated, the simpler Univariate model was replaced with a repeated measures ANOVA model. This ANOVA is used when each subject is exposed to multiple levels of an experimental factor, meaning the observations for each experimental factor are not independent since more than one factor was observed by the same subject. Detection Latency 6.2.1.1 For detection, all misses, false alarms, and correct rejections were cleaned from the data. Misses are scenarios where a failure was present but it was not detected. False alarms are defined as detecting a failure where none was present. Correct rejections are normal scenarios where no failure was present and the subject never detected a failure. Figure 12 below shows a pair of box plots of latency data broken up by whether it was achieved with or without trend information present. From this plot, there appears to be a significant effect for trend information which appears to decrease the overall mean latency. Significant effects for the type of failure and the approach type were also expected, as these affect the subjects' perception of "normal", which is a critical for developing a mental model to detect deviations from normal system behavior. 6000 50.00- C * 20- C 10D- - 0.00- 0.00Presr NotPrest Trend Information Figure 16: Data suggesting that trend information has an effect on detection latency. This raw latency data failed to meet two primary assumptions: normality, homoscedasticity. The cube root transformation was used to give the data both a normal distribution and homoscedasticity. The results from running a Univariate ANOVA on this transformed data are presented in more detail in Appendix G. As expected, history, failure type, and approach type all had highly significant effects (p<0.0005). The experimental block, which indicated whether subjects saw trend information first or second, did not have a significant effect (p=.924) and neither did the order in which the subjects saw each pair of scenarios (p=.6 2 4 ), indicating that there was no significant learning effects for detection. . 6.2.1.2 . ...... . I.. .. .... .. .... ............................... ......................................... DiagnosisLatency Any cases where a diagnosis decision was not made were eliminated from the diagnosis data. These were cases where subjects detected a failure but did not make a diagnosis before the scenario time ran out. In examining the raw diagnosis latency data, Figure 11 shows the latency as it was broken up by whether trend information was present or not. 60.00- 50.00 * ** * C 2000 30001 1000- Nt Presrt Preswt Trend Information Figure 17: Data suggesting that trend information has no significant effect on diagnosis latency This box plot of the data does not suggest a significant effect for trend information. Other expected effects include the failure type and approach, which are important since some failures and trajectories were intentionally selected to be easier to diagnose than others, and hence should have a significantly lower latency. The data is homogeneous, but not normal or independent. Therefore it is inappropriate to use a repeated measures ANOVA was used. ANOVA is robust to departures from normality 87 but not homogeneity of error variance, so no data transformation was necessary. Detailed results are shown in Appendix G. As expected there was no significant effect for the presence of trend information (p=0.722). As expected, there was a significant effect for both failure type (p<0.0005) and approach (p=0.024). 6.2.2 Accuracy Data Accuracy is determined by whether the subjects' decisions correctly matched the actual state of the environment. Since this decision only has two possible categories, correct or incorrect, accuracy data ends up being binary, and does not meet ANOVA assumptions and so nonparametric testing was used. For simple single factor analysis with more than two levels, like failure type and approach type, the most appropriate test is the Kruskal-Wallis H test, which is essentially a nonparametric ANOVA. However, this test only evaluates one independent variable at a time. It was used primarily to get a sense of what to expect from further more inclusive analysis. In order to create a nonparametric model with multiple factors and interaction effects, binary logistic regression must be used. The regression model was built using a stepwise forward likelihood ratio procedure which checks at each step for variables that add significant effects to the model and adds those variables into the regression model until it converges on a solution that is the optimal predictive model for the given data. 6.2.2.1 DetectionAccuracy The inherent trends in this data are best observed by a bar graph of the percent correct versus the presence of trend information as seen in Figure 14 below. Trend Information NotPresent Present 100.0%- 80.0%- 60.0%400% CL 40.0% 80.595% 79.881% 20.0%- 0.0% Accuracy Figure 18: Detection accuracy results showing no significant effect for trend information Since these percentages are fairly close together whether or not trend information was present, no significant effect on detection accuracy was expected. Significant effects were expected for both failure type and approach type because these factors are important to how subjects differentiate between normal and failure conditions. Some failure types and approach types have specific symptoms or characteristics that make them easier or harder to detect which would give the failure and approach type significant influence. A table of the initial Kruskal-Wallis tests performed is shown below in Table 2. Table 2: Factor Effects for Detection Accuracy Data (Significance: p<0.05) Factor Trend Kruskal-Wallis Significance 0.622 Failure Type 0.000 Approach Block Order 0.018 0.806 0.010 As expected, trend information did not show a significant effect, while failure type and approach type were both significant. While the block was not significant, the order that participants saw the pairs of scenarios did have a significant effect which suggests that participants learned to detect a certain failure/trajectory combination better with time, i.e. they were more accurate at detection the second time they saw the same conditions, shown in Figure 15. Order EFrst 100.0%- M Second 80.0%- 60.0%- 0- 40.0%- 20.0%- 0.0%- Accuracy Figure 19: Results indicating a significant effect on decision accuracy for order The regression model confirmed these Kruskal-Wallis results and also showed which specific trajectories and failure types were significant over the others. I i f~ Valve 714A% 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% Percent Correct Figure 20: Results showing the effect of failure type on detection accuracy As for individual failures, thruster and valve failures were significantly easier to detect accurately than the other types of failures. The shallow trajectory appears to have a negative effect on detection accuracy, indicated by a negative coefficient estimate (B=0.489) for this trajectory type. The regression analysis agreed with the Kruskal-Wallis tests that the trend information had no significant effect. The detailed table for binary logistic regression, using a forward stepwise model is presented in Appendix G. 6.2.2.2 DiagnosisAccuracy The diagnosis accuracy data shows a different trend as shown in the bar chart of the percentage correct versus the presence of trend information in Figure 21 below. Trend Information NotPresent Present 60.0%- 040.0%- * 0. 66.67% 56.79% 20.0%- 0.0% Accuracy Figure 21: Diagnosis accuracy data showing a significant effect for trend information The trend here implies a large difference between trend information conditions. When trends are present, there are almost 10% more correct diagnoses, indicating that trend information has a significant positive effect on diagnosis accuracy. Significant effects for failure type and approach are expected, since some failures are intentionally easier to detect and diagnose than others. A table showing the results of the KruskalWallis tests is shown below in Table 3. Table 3: Factor Effects for Diagnosis Accuracy Data (Significance: p<0.0 5 ) Factor Trend Kruskal-Wallis Significance 0.000 Failure Type 0.000 Approach Block Order 0.092 0.079 0.079 As expected there were significant effects for both trend information and failure type. Both the block and the order that the scenarios were presented were marginally significant, indicating that for the diagnosis task specifically, there may be some important learning effects over time. A table of the regression results in Appendix G shows that two failure types, thruster (B=-1.804) and pump failures (B=-1.101), had negative effects on accuracy, while valve failures were easier to diagnose accurately. This is supported by the bar chart below in Figure 22. Pc 600% Percent Correct Figure 22: Results showing the effect of failure type on diagnosis accuracy Pump and thruster failures are very similar in their symptoms and so this drop in accuracy seems plausible. There are also significant two way interactions in the model that suggest that history was significant in helping diagnose some failure types but not others, specifically pump and valve failures. 6.3 Lens Model Achievement and Consistency Data The achievement and consistency values were collected by running each subject's data, with and without trend information separately, through the implemented Lens model code. These two parameters are part of the implementation output along with other Lens model parameters discussed earlier. This data was then evaluated using a repeated measures ANOVA. 6.3.1 Detection Achievement and Consistency The Lens model achievement data is shown below in Figure 23. There is no clear trend which is supported by the lack of statistical significance for detection accuracy. 0 0 ~ E 00 0 o D.25- o A A 0Q A A AA 0 A A A A O A o Not Present A Present 0AA A A A Trend Information 00 AOA 0 A 0 0 0 e ji; A Subject Figure 23: Detection achievement data showing no effect for trend information Below in Table 4 is the significance of trend information from a repeated measures ANOVA on Lens model achievement, consistency and other parameters. Table 4: Table showing the effects of trend information on Lens model output parameters for detection (Significance: p<0.05) Parameter Achievement Consistency Predictability Linear Unmodeled Knowledge Knowledge Skill Score RM ANOVA Significance 0.531 0.447 0.000 0.001 0.995 0.578 There was no significant effect on achievement, which is not surprising given that achievement is similar to overall accuracy, and trend information did not make a difference for detection accuracy. Consistency and environmental predictability on the other hand did show significant improvement with the addition of trend information. 6.3.2 Diagnosis Achievement and Consistency Given that trend information had a significant effect on diagnosis accuracy significant effects for both achievement and skill score are expected. Since it also seemed to improve operator judgment consistency for detection, it seems reasonable to expect a significant effect for diagnosis as well which is predicated on detection. Figure 24 shows the achievement data. 1.00- Trend Information 0 0 A A 0 0 o 0 Not Present A Presert 0 OAA A 0 * 0A A 0 06-A U M4 0 0 A 0 0.80 0 A A A .2000 0 A 0 ~0.40 0 A A A0 0 0 0.200 0.00 0 20.00 10.00 Subject Figure 24: Diagnosis achievement data suggesting a possible effect for trend information Table 5 shows the ANOVA results for the Lens model parameters for diagnosis. Table 5: Table showing the effects of trend information on Lens model output parameters for diagnosis (Significance: p<0.05) Parameter Achievement Consistency Predictability RM ANOVA 0.170 0.032 0.054 Linear Knowledge 0.153 Unmodeled Knowledge 0.474 Skill Score 0.177 Significance It was surprising to find no effect on achievement and skill score. For consistency there is still a significant effect, and so it seems that trend information improves a subject's ability to apply the same diagnosis strategy consistently versus no trend information. Environmental predictability was marginally significant, indicating that trend information improves the overall predictability of this decision environment. 6.4 Detection False Alarm Data The last question raised from the data concerned the number of false alarms. It is important to evaluate whether trend information lowers the false alarm rate. In order to answer this question, the number of false alarms for each participant with trend information and without was calculated. Graphing these values as shown in Figure 25 below, suggests that there may be a significant effect on the number of false alarms. Trend information actually produced an increase in the number of false alarms for 75% of subjects. Trend Information 0 0 NotPresent A Present 0 AO A 0A0 A A 0 0 A 00 AA A 0A 0 AA A 0 OAA A A 0 J&0 0 A 0 A 0 0 A OAAO0J 0 0 0 0 Subject Figure 25: Results suggesting a trend towards increasing numbers of false alarms with trend information The full results of the ANOVA can be found in Appendix G. The effect of trend information on the absolute number of false alarms was marginally significant (p=0.082). Given the overwhelming majority of subjects who experienced a negative trend, this is surprising. With more observations it is possible that this effect could become significant, showing an overall negative impact on the false alarm rate. 6.5 Discussion 6.5.1 Detection For detection trend information explicitly displayed to the subjects was helpful in reducing detection latency. It did not make any significant difference on the accuracy of detection or the occurrence of false alarms. It is worth examining each of the results a little closer to understand what the results might suggest on a larger scale. For detection latency, the original hypothesis was supported by a significant effect. Plots of the mean latency for each failure and approach type shown below in Figure 26 indicate that trend information decreased detection latency for all failure types for steep trajectories, and all failure types except fuel leaks for both nominal and shallow trajectories. This was surprising for nominal approaches since their fuel signatures are similar to a steep approach. at ApproachType = Shallow FailureType - U115.000 - Leak Pump Thruster Valve 4,12.50B _J10.000 0 .750- 5.00- NotPresent Present Trend Information at ApproachType = Nominal 20.00- at ApproachType = Steep FailureType FailureType - - o17.50- 16.00Leak *0 Pump Thruster u 14.00Valve a15.00C - 12.00-I 12.50- 10.00210 008.00* 7.50- 6.00NotPresent Trend Information Present Not Present Present Trend Information Figure 26: Results showing the combined effects of trend information, approach, and failure type on detection latency Steep and nominal trajectories burn less fuel and have longer periods of time where no burns occur which would make detecting a leak easier. Shallow trajectories, 98 Leak Pump Thruster Valve on the other hand, exhibit a large RCS burn at the very beginning of the section of the trajectory that is being monitored and more burns in general. Although the subjects were trained that this would occur, and reminded of it by the experimenter in practice, this was often falsely detected as a leak early on in the experimental trials as they were learning what constituted normal system behavior. Leaks also appear as more gradual deviations from normal which would have encouraged subjects to watch the historical information longer to confirm their suspicions before detecting the failure. In terms of detection accuracy, the trends did not have an effect. This indicates that participants can gain an accurate mental model of what "normal" looks like for this task, and apply it, with or without the additional information. It does not improve their ability to see and understand deviations from normal, just the time it takes for that to occur. The deviations in system parameters would have been visible at the same time to the subject whether they were monitoring the trend display or the primary flight display, as long as they recognized the deviation in both cases, the accuracy would not be affected. For false alarms, though the effect was marginally significant, it is important to look at the direction of that effect. A plot of the mean number of false alarms, shown in Figure 27 suggests that the effect of trend information is negative. Subjects typically had more false alarms with the trend information present than without it. 4.2- E 39U0.6 E Z 3.3 3- No Present Present Trend Information Figure 27: Results suggest an increase in the number of false alarms with trend information present This is not surprising given that in the trend information condition, subjects are presented with the historical trends of each parameter on the flight display on a second display of equal size. This is twice as much information as they have in the other condition with only the primary flight display. Therefore, it is possible that there are more "trigger" states which could cause them to see something they think isn't right and declare that a failure has occurred even if that is not true. The additional information could cause them to over analyze the data being presented and second guess their default condition of normal. Also, because the failure rate in this experiment must be so much higher than in a real world system in order to prevent boredom and gather valuable decision data, the subjects are primed to look for a failure condition, even though they know that normal scenarios are present. The importance of this trend must be evaluated in light of the consequences of false alarms versus 100 missed detections. There is a tradeoff between these two conditions in the sensitivity of any detector. Higher numbers of false alarms can indicate simply that the detector is more sensitive and should result in a corresponding decrease in missed detections which can be equally problematic depending on the system and task at hand. 6.5.2 Diagnosis It seems that historical trend information explicitly displayed to the subjects was helpful in improving diagnosis accuracy. However, it did not make any significant difference on the diagnosis latency. The significant effect on diagnosis accuracy is expected because diagnosis is a complex decision amongst several possible outcomes. Additional information that explicitly displays the symptoms of a failure over time should be valuable to an operator in correctly assessing whether or not the failure was of a particular type. Accurate diagnosis requires the human to keep track of failure symptoms, and store some sort of mental record of how the system parameters have evolved over time. The trend display eliminates this extra cognitive effort, decreasing the possibility for error and improving accuracy. When trying to understand lack of effect on latency, it is helpful to begin by looking at a plot that shows the mean latency with and without trend information for each failure type and approach type shown below in Figure 28. 101 at ApproachType * Shallow FailureType 70. * -- Normal Look Purp Thnater C6D o5.00 o C s .00-a4.00- C 3.00- C 20 Presets Nt Preerg Trend Information at ApproachType = Steep at ApproachType = Nominal FailureType 6. 05-00 FailureType se0 - .e0 Punop - Thuse -dve -vae 4.00 4.00- 0 0~ CC 00 200 NotPresert Trend Information NotPrePs Presert Trend information Figure 28: Results of the combined effects of trend information, approach, and failure type on diagnosis latency Latency appears to get slower when historical information is provided for shallow trajectories and some failure types on other trajectories. The majority of subjects commented in post experiment interviews that they preferred the trend display over the primary flight display because they felt like they better understood the evidence available. It seems logical then that in order to make a more accurate diagnosis, they would use this extra information and observe the system behavior for a longer period of time to be certain before making a decision. However, if they observed it for too long, the scenario time ended without them having made a decision and there 102 was the risk of the operator getting "tunnel vision" and forgetting about other tasks and constraints like time. 6.5.3 Lens Model Parameters Lens model parameter data was derived by post processing the cue information and inferred deviation information through the implementation of the Lens model both with and without trend information present. detection and diagnosis. consistency, This was done separately for both The two key parameters were achievement and decision however other parameters that showed significant or marginally significant effects also deserve consideration. For detection, it is not surprising that there was no effect for achievement which is effectively another measure of decision accuracy, since there was no effect for trend information on detection accuracy. However the significant effects on consistency and environmental predictability make a positive case for trend information. Increasing judgment consistency is important for operator training, and indicates that if operators could be trained to use the most accurate judgment strategy, then they could apply it more consistently with trend information present. This improves system safety and increases the likelihood of mission success. The data from the Lens model for diagnosis showed significant effects for consistency only. Predictability, linear knowledge, achievement, and skill score, were all marginally significant. The improvements in consistency and predictability are 103 valuable, if operators can be trained to utilize a more accurate judgment model, they can predict the environment more accurately and more consistently. Since achievement measures how well the human judge does at predicting the environment, this data should have reflected similar results to the diagnosis accuracy data. It was surprising that it was not significant. One possibility for this is that the regression model used in the Lens model didn't have a sufficient amount of data to capture the full effect of the trend information. For detection, because of the nature of the decision, there are a full 30 cases for each subject with and without trend information. This meets the rule of thumb for 5 cases per cue used[33]. For diagnosis, because some cases were detected and never diagnosed, there were often fewer cases and hence did not meet this regression rule of thumb for many of the subjects, potentially yielding a less accurate model. 104 7 Conclusions and Future Work 7.1 Conclusions A representation of the Lens model was selected to investigate the performance impacts of displaying trend information to system operators in failure detection and diagnosis. The experimental results revealed that the addition of trend information has some human performance benefits for failure detection and diagnosis tasks. While both speed and accuracy can be viewed as important performance metrics, improving one side of this tradeoff while the other remains constant still yields overall performance benefits. For detection, having trend information available yielded faster decisions, but did not make operators more accurate. For diagnosis, the presence of trend information led to slightly longer decision times, but made operators more accurate at diagnosing the problem. These two significant results are important for several reasons. First, because diagnosis is predicated on detection; the faster that a failure can be detected, the sooner diagnosis, and the entire failure resolution process can begin. Detection is the first step in the process, which suggests that faster detection could lead to overall faster failure diagnosis and resolution. In general, the faster a failure can be resolved, the smaller its impact on the overall mission and system performance. Next, accurate diagnosis is critical to successful human failure resolution. If a failure is incorrectly diagnosed, then incorrect procedures for resolving that failure may be executed. This can lead to other emergent system problems and fail to resolve 105 the failure condition. This could cause the abort of the mission or the loss of crew or equipment. Therefore, it can be argued that the performance improvements in detection latency and diagnosis accuracy which are achieved by explicitly providing operator's trend information about spacecraft system states are valuable results. The addition of trend information displays could improve the performance and safety of spacecraft systems with humans as monitors. One consideration in adding trend information is that it seems to produce a trend toward higher numbers of false alarms. Although the increase was not statistically significant, it resulted in a higher false alarm rate for 75% of experimental subjects. This trend towards higher false alarm rates may outweigh the human performance benefits in domains such as health care, where false alarm rates are a very important consideration. For many mechanical systems however, such as spacecraft systems, the improvements in detection latency and diagnosis accuracy outweigh the possibility of higher false alarm rates. Lastly, the Lens model was found to be a useful and generalizable decision making model for human failure detection and diagnosis. The Lens model fulfills all the basic modeling requirements for the project and provides more information about the decision environment and decision strategy than any of the other decision making models from literature that were considered. It was calibrated through human subject 106 experimentation and expanded it its ability to incorporate an operator's perception of dynamic trend information in their decision making process. 7.2 Future Work The experimental results regarding the impact of trend information on operator failure detection and diagnosis performance could also generate some interesting avenues for future work. The trend towards increased false alarms was surprisingly not significant and these ambiguous results could be further examined to understand why this occurs. One possible explanation is that a higher false alarm rate is the result of an increase in cognitive workload. An increase in the amount of information available, in this case trend information, may increase the operator's workload because they now have more information to monitor and process. This increase in cognitive workload may be the cause of the noticeably higher false alarm rate when trend information is present. Conversely, a higher false alarm rate may also perpetuate a cycle by then increasing the operator's workload as they attempt to diagnose and resolve failures that are not actually present. The Lens model parameter results were also ambiguous. Parameter results, such as achievement, did not show significance where the human performance counterpart did show statistical significance for trend information. There are several possible explanations for this work, but one of them could become an interesting space for future work. It is likely that this was partially the result of the most valuable cues not 107 being selected either for presentation to the operator or for inclusion in the model. Another set of cues may have generated a better regression model and may have made the performance impact, specifically on achievement, more clear. It would be valuable to have a framework for selecting representative cues that could provide the best descriptive model of the environment, which would then theoretically provide the best possible set of information to the operator for decision making. This process is currently mostly trial and error and often requires significant resources to conduct multiple experiments and determine the best feasible combination of cues. However, a representative set of cues is critical in order to get an accurate picture of the human decision making performance provided by Lens model parameters. The Lens model has been used for decades in judgment theory and decision making research and is a feasible and robust model that has recently been extended to dynamic decision making tasks. Future work could utilize this model, when integrated with other models of human performance and system dynamics, to examine system failure, human failure, and combinations of the two. These complex system models could provide insight into the downstream effect on overall system reliability and performance with operator input as components in the system. These complex system models could then be used as early-stage design tools that can aid in the design of intelligent complex systems, by allowing engineers to create hypothetical test cases to 108 evaluate system tradeoffs that affect both the human operator and the electromechanical system configuration. 109 Appendix A: Lens Model Implementation Validation The Lens model was implemented in Mathematica's MATLAB 2009a as a generic m-file. This implementation platform will allow it to be plugged more easily into the Simulink modeling component of MATLAB and integrated with existing system design and analysis tools for use in later closed-loop human system modeling. In order to validate the implementation of the model, a test case was generated based on a paper published by Bisantz A. M., et al. (2000) that involved a Navy aircraft identification task in a dynamic task environment. A secondary validation was to have the actual implementation, mathematical assumptions, and validation results examined by a mathematician and MATLAB expert. The first validation involves running a test case of known work using the model code and another statistical software tool (SPSS) for comparison of results. Bisantz, et al. (2000) was one of the closest works in literature, using the Lens model, which claims a dynamic task environment. It is valuable to generate similar case data and run the mathematics of the model in both programs to see if the MATLAB model, a widely used statistical software package called SPSS, and the paper results all agree on the model coefficients and correlation outputs. Naval Aircraft Identification Example After mathematical validation, a test case from a known work was generated to validate the model code. The goal was to create a synthetic data set that was coded 110 according to the same logic and task scenario as the data presented in the paper and then compare results using both computer aided tools, and the results presented in the paper[13]. The authors coded the cues in their Lens model based on standardized data values of five different variables that described an aircraft approaching an aircraft carrier: range, radar, IFF, speed, and altitude. The data was coded in the following way[13]: " Radar: coded as -1,0, or 1 depending on whether the aircraft had a hostile, ambiguous, or friendly radar signature. If the radar was not turned on or not requested by the participant, it was coded as 0. " Altitude: coded as 0 if the aircraft was flying at 27,000 ft or higher (indicative of commercial or non hostile aircraft) and 1 if the aircraft was flying lower than 27,000 ft (indicative of a military or hostile aircraft) * IFF: coded as -1,0, or 1 depending on whether the IFF signature was hostile, ambiguous, or friendly. The IFF was coded as 0 if it was not requested by the participant or if the aircraft was identified further than 150 nm out from the aircraft when IFF is unavailable. If the IFF was not turned on and the aircraft was within 150 nm of the ship, it was coded as -1 as all aircraft exhibiting this behavior were hostile. " Speed: coded as 0 if the speed was more than 600 kn and 1 if the speed was less than 600 kn. Aircraft flying slower than 600 kn were generally commercial airlines and friendly, while aircraft exceeding 600 kn were military with the exception of helicopters. " Range: coded as 0 if the aircraft was within 150 miles of the ship and 1 if it was further out than 150 miles. This bit of data was only included because if an aircraft was within 150 miles and did not have the radar on then it was considered hostile. Since there are five cues present, following a general rule of thumb for multiple regression, 50 synthetic data cases were generated, 10 per cue, in order to ensure an accurate representation of the judgment and environment using regression analysis[33]. Once the data creation was complete, a multiple regression analysis was conducted in 111 SPSS in order to see what the regression coefficients would be. This was then compared to the results presented in the Bisantz et all (2000) paper and the output of the MATLAB model using the same data. Table 6 shows the beta-weights for the fully correct case using each of the different methods and compared to the values presented in the paper for comparison: Table 6: Beta-weights calculated for synthetic validation case and compared to Bisantz, et al. (2000) Cue SPSS MATLAB model Bisantz Altitude 0.450 0.450 0.45 Speed -0.201 -0.202 -0.2 Range 0.047 0.048 0.05 IFF 0.207 0.207 0.21 Radar 0.353 0.353 0.35 The slight differences in outputs are attributed to rounding errors and also the fact that the data set that was created will not have the same cue values as the actual data from the documented. 112 Appendix B: Graphical User Interface Details B.1 User Interface Development The user interfaces used in experimentation were designed to follow the current display "convention" or standard industry practice for aviation, which is in a sense the field "closest" to spacecraft design. This jump was considered to be acceptable since the astronauts in command of a lunar lander would also be aircraft pilots and these types of displays would be familiar to them. Modem aircraft displays have also already been investigated in numerous human factors experiments in literature. Spacecraft displays are often modification of aircraft displays to begin with. The displays were designed to be as simple a representation of the critical system information as possible. Two user interfaces, seen below in Figure 29: Primary Flight Display shown to all subjects for each trial and Figure 30, were implemented: a primary flight display and a display of system parameter trend information. 113 ................. ............................. .... ......... .......... Figure 29: Primary Flight Display shown to all subjects for each trial Figure 30: Trend Information display with same system parameters represented, shown to subjects for half of the trials In order to truly test the effect of the additional benefit of trend information, all information was represented equally on both displays. If there was a question of which display had better resolution, the primary flight display was always favored to have higher resolution since this was the current conventional display and the effect of 114 display type or display resolution was not the research question. Green was selected as the primary color on the display in order to provide a good visible contrast on the black and grey background and also to avoid using colors such as red which are traditionally used to indicate warnings and problems. The font used in all the displays is Verdana 9 pt. This was selected based on its recommendation in the Federal Aviation Administration's Human Factors Standard as the best "readable" font for digital displays. B.2 Information Displayed The information presented on the displays is dynamic and changes throughout the scenarios according to data generated from a high fidelity lunar landing simulation which was modified in a simplified linear fashion to reflect specific failures. There are six system state parameters represented on both the primary flight display and the trend information display which are: Altitude, Range, Pitch, Roll, Yaw, and Fuel Quantity. The purpose of the experiment is to investigate the use of dynamic trend information in human decision making strategies. It is important here to make a distinction between implicitly and explicitly displayed trend information. Implicitly displayed information is not directly given to the operator as a numeric value but instead is inferred as a result of watching the instantaneous values of a dynamic system parameter vary over time. This information is said to be "achieved" since it is mentally 115 derived from the value of a separate parameter which is continually changing. This achieved information must be saved in the operators working memory in order to create the trend of what has happened in the system in the past. Explicitly displayed information is information which is given to the operator visually. The trend behavior is visually available to the pilot rather than inferred by watching the altimeter update its value continuously. Let's use the altitude as an example. The altimeter provides explicit information about the current value of the spacecraft's altitude. It also provides implicit rate information because as the spacecraft descends, the pilot can watch the values on the altimeter change and infer whether he is descending slowly, rapidly, or holding a steady altitude based on this observation. Exactly how fast he is descending would require a more exacting mathematical calculation, however implicit rate information may still be useful in his understanding of the current state of the world. Each of these pieces of information must be stored in the pilot's short term memory in the order that they made the observations for the pilot to be able to recall that they were approximately 200 meters higher approximately 4 seconds ago. If the operator is provided with explicit trend information, this would be akin to having a profile view of the current descent profile of the spacecraft. In the context of this experiment, one display contains explicit trend information and one does not. information is always available to the operator. B.3 116 Variable Definitions Implicit trend Altitude: Altitude above the lunar surface is altitude relative to the surface in the positive Z-direction (surface=0 meters), determined by a radar altimeter. The unit of this measurement is meters. Altitude ranges from maximum values of approximately 1100 m for shallow trajectories to approximately 1700 m for steep trajectories. Range: Range is the horizontal or ground distance from the spacecrafts current location to the designated landing point in the positive X-direction. The designated landing point is the reference point at 0 meters. The unit of this measurement is meters. Shallower trajectories have longer ranges of 3800 m, while steeper trajectories have much shorter ranges of 1600 m. Pitch: Pitch is the measurement of backward-forward rotation around the Y-axis of the spacecraft. It is measured in degrees with positive (increasing) pitch being indicated to the pilot as backwards motion in the negative X-direction. During the spacecraft's pitch over maneuver the spacecraft is essentially going from 90 degrees of pitch, on its back to 0 degrees of pitch in preparation for the terminal descent to the surface. Roll: Roll is the measurement of rocking motion about the X-axis of the spacecraft. It is measured in degrees with positive (increasing) roll being indicated to the pilot as rocking to the right, in the positive Y-direction. In general roll should remain approximately 0 and less than 2 degrees with the exception of one large roll 117 towards the end of the trajectory in order to give the crew a visual view of the designated landing point. Yaw: This is essentially a "heading" measurement for the spacecraft. Yaw is the measurement of left and right motion around the Z-axis. It is measured in degrees with positive (increasing) yaw being indicated to the pilot as a swing to the right in the positive Y-direction. In general roll should remain approximately 0 and less than 2 degrees with the exception of a large yaw motion coupled with the large roll in order to view the designated landing site. Fuel: This is indicated in the cockpit as fuel remaining and is a measure of the weight amount of fuel remaining in the RCS fuel tanks in kilograms. The lander begins its descent with a total of 460 kg of RCS fuel and requires at least half of this for the return journey. Fuel should decrease in small and possibly irregular steps depending on when the spacecraft guidance computer requires burns from different sets of thrusters. Each thruster uses approximately a quarter kg of fuel for a normal firing to correct minor attitude errors. 118 Appendix C: Experimental Consent Forms and Training Materials CONSENT TO PARTICIPATE IN NON-BIOMEDICAL RESEARCH The Effect of System Parameter Time History on Human Judgment Performance You are asked to participate in a research study conducted by Professor John Hansman, Ph.D, and Rachel Owen from the Department of Aeronautics and Astronautics at the Massachusetts Institute of Technology (M.I.T.) The results of this study will be included in Rachel Owen's Master's Thesis. You were selected as a possible participant in this study because of your access to Draper Laboratory and your level of experience in the experimental domains. You should read the information below, and ask questions about anything you do not understand, before deciding whether or not to participate. . PARTICIPATION AND WITHDRAWAL Your participation in this study is completely voluntary and you are free to choose whether to be in it or not. If you choose to be in this study, you may subsequently withdraw from it at any time without penalty or consequences of any kind. The investigator may withdraw you from this research if circumstances arise which warrant doing so. . PURPOSE OF THE STUDY This study is designed to investigate how humans use explicitly displayed time history information about system parameters in order to help them detect and identify a failure in a system. It is critical that human operators be able to quickly and accurately detect system failures in complex systems for safety and reliability. This research intends to explore human judgment strategies and the effect of specific information on the accuracy, timeliness and consistency of those judgment strategies. . PROCEDURES If you volunteer to participate in this study, we would ask you to do the following things: You will start by completing an informed consent form and a background questionnaire that gathers basic demographic information. Next, you will complete a computer-based 119 PowerPoint tutorial that outlines the experimental tasks, and explains the interfaces you will be using. This tutorial should take approximately 15-20 minutes. Next you will complete a short interactive practice session in the using the computer software to ensure you are familiar with the protocol and the displays. The practice session will be an active training in which you will be asked to detect a system failure with the experimenter present so you can ask any questions they have regarding the interfaces and relationships among the information presented. For the experimental trial, you will be asked to monitor an automated landing approach in a lunar lander simulator. Half of the trials will contain time history information and half will not. You should use all the information available to you to make a judgment about whether or not the system is operating correctly. You will be asked to press a specific button on the interface to indicate when you detect any kind of system failure. You will then do your best to identify the failure out of four possible failures you learned during training. These two tasks must be completed as quickly as possible. You will then complete 60 full experimental task scenarios. An experimental scenario lasts approximately 1 minute. There will be a 5-10 minute break after the first 30 scenarios. You may take additional breaks as you feel necessary. After the last set of scenarios, you will take part in a post-experiment interview in order to gather feedback about the experiment, your judgment strategies, and the task scenarios which will last approximately 15 minutes. This study will last approximately 2 hours depending on how quickly you choose to go through the scenarios. . POTENTIAL RISKS AND DISCOMFORTS There are no anticipated physical or psychological risks in this study. There is the potential discomfort of dry eyes due to your being asked to monitor a computer screen for an extended period of time. You may take breaks whenever you feel it necessary. . POTENTIAL BENEFITS While there is no immediate foreseeable benefit to you as a participant in the study, your efforts will help provide insight into human judgment strategies, and how information can be used by operators and monitors of complex systems in detecting system failures. . 120 PAYMENT FOR PARTICIPATION No financial payment will be given for your participation. You will not receive a charge number but you will receive two free AMC movie tickets. . CONFIDENTIALITY Any information that is obtained in connection with this study and that can be identified with you will remain confidential and will be disclosed only with your permission or as required by law. You will be assigned a subject number which will be used on all related documentation to include databases, summaries, and results. Only one separate master list of subject names and numbers will exist, password protected on the internal server at Draper Laboratory accessible only to members of the research team. There is a post-experiment interview which will require a separate consent form. Please read and sign the attached document entitled "Consent to Participate in an Interview". . IDENTIFICATION OF INVESTIGATORS If you have any questions or concerns about the research, please feel free to contact the student investigator, Rachel Owen, by telephone at (617) 258-4494 or via email at rlowen@draper.com . Her MIT faculty sponsor is Professor John Hansman who may be contacted at (617) 253-2271, email, rjhans@mit.edu, and his address is 77 Massachusetts Avenue, Room 33-303, Cambridge, MA 02139 . EMERGENCY CARE AND COMPENSATION FOR INJURY If you feel you have suffered an injury, which may include emotional trauma, as a result of participating in this study, please contact the person in charge of the study as soon as possible. In the event you suffer such an injury, M.I.T. may provide itself, or arrange for the provision of, emergency transport or medical treatment, including emergency treatment and follow-up care, as needed, or reimbursement for such medical services. M.I.T. does not provide any other form of compensation for injury. In any case, neither the offer to provide medical assistance, nor the actual provision of medical services shall be considered an admission of fault or acceptance of liability. Questions regarding this policy may be directed to MIT's Insurance Office, (617) 253-2823. Your insurance carrier may be billed for the cost of emergency transport or medical treatment, if such services are determined not to be directly related to your participation in this study. 121 . RIGHTS OF RESEARCH SUBJECTS You are not waiving any legal claims, rights or remedies because of your participation in this research study. If you feel you have been treated unfairly, or you have questions regarding your rights as a research subject, you may contact the Chairman of the Committee on the Use of Humans as Experimental Subjects, M.I.T., Room E25-143B, 77 Massachusetts Ave, Cambridge, MA 02139, phone 1-617-253 6787. SIGNATURE OF RESEARCH SUBJECT OR LEGAL REPRESENTATIVE I understand the procedures described above. My questions have been answered to my satisfaction, and I agree to participate in this study. I have been given a copy of this form. Name of Subject Name of Legal Representative (if applicable) Date Signature of Subject or Legal Representative SIGNATURE OF INVESTIGATOR In my judgment the subject is voluntarily and knowingly giving informed consent and possesses the legal capacity to give informed consent to participate in this research study. Signature of Investigator 122 Date CONSENT TO PARTICIPATE IN INTERVIEW The Effect of Rate Information Display on Human Judgment Performance You have been asked to participate in a research study conducted by Rachel Owen from the Department of Aeronautics and Astronautics at the Massachusetts Institute of Technology (M.I.T.). The purpose of the study is to understand the effect of rate of change information on the humans ability to make accurate and timely decisions. The results of this study will be included in Rachel Owen's Master's thesis. You were selected as a possible participant in this study because of your access to Draper Laboratory and your level of experience in the experimental domains. You should read the information below, and ask questions about anything you do not understand, before deciding whether or not to participate. * This interview is voluntary. You have the right not to answer any question, and to stop the interview at any time or for any reason. We expect that the interview will take about 15 minutes. " You will not be compensated financially for this interview. " Unless you give us permission to use your name, title, and / or quote you in any publications that may result from this research, the information you tell us will be confidential. We would like to record this interview on audio cassette so that we can use it for reference while proceeding with this study. We will not record this interview without your permission. If you do grant permission for this conversation to be recorded on cassette, you have the right to revoke recording permission and/or end the interview at any time. e This project will be completed by 30 June 2010. All interview recordings will be stored in a secure work space until 1 year after that date. The tapes will then be destroyed. I understand the procedures described above. My questions have been answered to my satisfaction, and I agree to participate in this study. I have been given a copy of this form. (Please check all that apply) [ ] I give permission for this interview to be recorded on audio cassette. 123 [ ] I give permission for the following information to be included in publications resulting from this study: [ ] my name [ ] my title [ ] direct quotes from this interview Name of Subject Signature of Subject Signature of Investigator Date Date _ Please contact Rachel Owen at rlowen@draper.com or (617) 258-4944 with any questions or concerns. If you feel you have been treated unfairly, or you have questions regarding your rights as a research subject, you may contact the Chairman of the Committee on the Use of Humans as Experimental Subjects, M.I.T., Room E25-143b, 77 Massachusetts Ave, Cambridge, MA 02139, phone 1-617-253-6787 124 Interview Materials: The purpose of this interview is to thoroughly understand the decision making process and information used to make that decision. We will randomly choose one scenario from each of the experimental conditions. All questions posed by the experimenter will be in quotations and all other actions will be italicized. After the subject responds, follow on questions may be needed to clarify a subject's response; these questions cannot be predicted, but their content w2ill only be about the specific decision making process or display content. All experimenter and subject responses will be recorded. Start the audio recording. Experimenterwill state subject ID for coding responses. "For this part of the experiment, we're going to select a subset of scenario from the experiment and discuss your response and how you made your decision. We're also going to discuss your response to the experiment and any external factors you thought influenced your behavior. Before we begin it would be appropriate to let you know that these responses are confidential and it would help us greatly if you would be as honest as possible in order to help us understand and improve our judgment models." Show subjects the scenariocard and bring up the interface and simulationfor that scenario. Walk me through a scenario and what you were looking at, and thinking about as you made your decisions. (24, 19,38) - Can you tell me why you made the decisions you made (remind them what their decision was, failure, no failure and where the failure occurred)? On a scale from 1-10, where 1=no clue and 10=absolutely certain, how confident - were you in your detection of a failure? Why? On a scale from 1-10, where 1=no clue and 10=absolutely certain, how confident were you in your identification of that failure? Why? Which of the displayed information did you find most useful in making your - detection decision? Diagnosis? It seems like your general decision strategy was: - (in non technical terms the experimenter will summarize the decision strategy that the subject has - expressed). Would you agree? What difficulties did you face in the decision making process? Was there any other information you would have found particularly helpful in aiding your decision? 125 The following questions will be asked of each participantafter the scenarios are discussed. - How did you feel in terms of your overall accuracy, were you confident in your decisions? Did you feel you had enough information to make the decision? - Did you find the task to get easier over time? (learning) - Did you find that you preferred one type of information over another for your decision making? - How did you cope with failure cases that looked very similar in both display cases? - At what point during the experiment did you feel comfortable with what was considered "normal"? - Did you feel like there were any limitations that prevented you from performing your best? Reminder: Please no discussion of any classified material in your responses. 126 Training Materials: Failure Detection and Identification in the Lunar Lander Experimental Domain . Domain: Lunar Lander - The experiment is investigating failure detection and identification while monitoring a lunar landing under automatic guidance and control technology 127 Your Task You are an astronaut aboard a NASA mission to the lunar surface. The spacecraft lands autonomously using the auto pilot and guidance computer. Your task is to monitor a portion of the descent trajectoiy, alert the rest of the crew, and initiate emergency procedures in the event that there is a system failure at this point in the mission. Task Details - You will be monitoring a one-minute segment of the spacecraft's approach phase as it nears its final descent to the lunar surface. - The lunar lander can approach the surface in any one of 3 ways = Steep Trajectory = Nominal Trajectory = Shallow Trajectory - You will see all three trajectory types during the experiment 128 Spacecraft Approaches Braking Maneuver Steep approach Pitch-over Maneuver Normal approach Monitoring this part of th Shallow approach Powered Descent Initiation (PDI) Braking Phase Approach Phase Terminal Descent Phase Trajectory Details - The pitch-over maneuver indicated on the diagram is a steady pitch transition from a pitch of approximately 6o degrees to an upright orientation which is indicated by a pitch of approximately o degrees. - This is the orientation the spacecraft needs to descend to the surface. - There should be very little yaw and roll motion throughout the trajectory other than normal disturbances due to the environment. Disturbances greater than system condition 2 degrees may indicate a 129 Possible System Failures - Failures may occur in the spacecraft's fuel and reaction control system (RCS) within the last critical minute of the approach phase. It is this system and its affects on the spacecraft that you will be interested in observing. - The reaction control system is a set of thrusters positioned around the spacecraft that produce small impulses and control its orientation or attitude in space. It is fueled using a liquid methane and nitrogen tetroxide mixture. - The RCS system has 460 kg of total fuel. Only 230 kg of this is allotted for the descent portion of the mission. Possible RCS Failure Cases - Two failures will be associated with the RCS fuel system, and two with the actual thrusters themselves - Fuel System Failures: - Fuel Leak - Fuel Pump Failure - Thruster System Failures: o o Thruster Failure-Stuck On Valve Failure - Now to look at the RCS system configuration.... 130 Physical Thruster Configuration - The spacecraft has 4 clusters of 4 RCS thrusters positioned in a square configuration . Thrusters are fixed and cannot tilt individually . Failures can affect a single thruster or an entire cluster -X +X RCS Ivuster reed RtCSJets PNr Widndows clw RCS Propulsion Schematic This system is mission critical for both the spacecraft's descent to the surface and the return to lunar orbit NTC Tr.AwvC " 1i rr"Aw. - NTO j 14 is I4j 131 Key System Components Thrusters, in clusters of 4 provide propulsion for attitude adjustment Fuel lines transport hydrazine to thrusters Fuel Tanks, store Typical System Behavior - Tlrusters are aligned around the spacecraft to ensure that proper angles and attitude can be maintained. They typically fire on and off infrequently or at long intervals to correct spacecraft drift. - The RCS fuel system typically sees fuel decrease in very small short bursts as thrusters are fired or fractions of a second to make small attitude adjustments. Fuel levels usually decrease in small step increments with wide intervals where no burns occur. Shallow trajectories come in closer to the lunar surface and at a longer range and hence require significantly more fuel burn initially than do steep trajectories. This does NOT necessarily constitute a failure. - Attitude for this portion of the trajectory should be largely centered around zero for roll and yaw. - Altitude and Range should be decreasing as the spacecraft descends and moves towards its designated landing point. - Shallow trajectories burn a large amount of fuel at the beginning and then stabilize, steeper trajectories have a larger fuel burn towards the end of the scenario. These are not fuel leaks. 132 Possible RCS Failure Cases . Two failures will be associated with the RCS fuel system, and two with the actual thrusters themselves - Fuel System Failures: Fuel Leak Fuel Pump Failure . Thruster System Failures: Thruster Failure-Stuck On Valve Failure Fuel Leak - Problem: A rupture has occurred somewhere in a fuel line or valve that is allowing fuel to leak out of the system . Symptoms: Thrusters and fuel pumps are functioning normally; stable RCS output As a result, spacecraft attitude remains stable and on cOurSe Fuel will decrease at a consistent rate which is faster than normal. Even if thrusters are not firing, fuel level will still be decreasing 133 Fuel Pump Failure - Problem: A fuel pump on one cluster has seized and stopped working - Symptoms: An entire cluster of thrusters will no longer be firing because no fuel is being provided to them Attitude will be noticeably affected because one full cluster of thrusters is non-functional. There will be changes in roll, and yaw as the other thrusters overcompensate and correct for the failed cluster Fuel will actually decrease in a mostly normal pattern, with a few additional burns commanded by the guidance computer in order to correct the attitude deviations Single Thruster Failure - Problem: One thruster in a cluster of 4 is stuck on, it is continuously firing . Symptoms: Fuel will continue to decrease linearly at a faster than normal rate both because the thruster is continually firing and it must be compensated for by other thrusters Spacecraft attitude will show a drift in the direction of the thrust in either roll, or yaw, or both, until control counteracts the failure 134 Valve Failure . Problem: A valve has created a series of random pressure pockets that causes a thruster to provide intermittent, (on/off), instead of consistent thrust. - Symptoms: " Attitude will slowly begin to show a subtle rocking motion, or oscillation, in whatever axes the failed thruster fires on and off. Because this behavior is random and unpredictable, the control algorithm will always be a step behind the failed thruster causing the magnitude of the oscillation to increase " Fuel will decrease faster than normal as the attitude oscillation becomes bigger and more thrusters operating for longer periods are necessary to correct the attitude Display: Primary Flight Screen 135 Key Display Components Display Components: Avionics - Time (s): Time remaining in the scenario is displayed by the scenario clock in the upper left corner of the display. Scenarios are 6o seconds long. - Altitude (i): The current altitude, or vertical height above the lunar surface, is represented by the green values on the tape display as the tape moves accordingly behind the digital display. - Range (m): The range display is set up to function exactly like the altimeter, but it provides information about the horizontal distance to the target landing site. - Fuel (kg): The fuel gauge is a bar graph of fuel. The value of fuel remaining is displayed at the top of the gauge in the header. It is an indication of the total amount of RCS fuel remaining for the rest of the descent trajectory and also the ascent to lunar orbit at mission completion, 230 kg is necessary for the return journey. 136 Display Components: Attitude Indicators Attitude is represented in 3 dimensions: Pitch, Roll, and Yaw, all of which are measured in degrees. Pitch: Motion about the Y axis, tilting fonvards or backwards, positive pitch is a backwards or -nose up" pitch : Roll: Motion about the X axis, tilting side to side, positive rotation to the right Yaw: Motion about the Z axis. sminging back and forth, positive rotation to the right Display Components: Failure Indicators - Failure Alarm: The failure alarm is the grey button labeled "Failure", and exists to alert the crew in the event of some kind of system failure. You are responsible for pushing this button whenever you detect a system failure. - Emergency Procediure Selection Panel: Represented as a set of grey buttons in the lower right corner labeled with all possible failures. Detecting a failure is not sufficient for mission success. Failures must be identified so they can be solved. The crew is trainied on emergency procedures for each of the given failure conditions. You are responsible for identifying the failure condition so that the correct emergency procedure can be implemented. 137 Display: Time History of Parameters 11C WOn | rvd 1hssre0il abou th1eairo -t wil -e e-rvi -u- in diioa yte al nomto ttsoe ie -1e - - Key History Components Time History of Spacecraft States ~~~ ~ ~ This ddtoa nomto _twllbpoidedduinkg 138 haftese ao ~rvd ~~ 'tr .cenwl Display Components: Time History - Time (s): The time shown on the horizontal axis is the full scenario time, 6o seconds. The current time is indicated by the last point drawn on the chart. Time remaining in the scenario is shown by the scenario clock. - Paraieter Inforiiiation: The units for these parameters will be the same as they are on the original avionics display. The curve will be generated from the values on the avionics display as the scenario time moves forward such that parameter behavior will be shown over time. How Each Scenario Will Take Place Move to the Next Slide 139 Scenario Start Failure Detection 140 Failure Identification End of Scenario 141 End of Scenario Important Things to Remember - Time history information will only be given to you on half of the experimental trials, these trials will all be grouped together and may occur either first or second - Failures can occur at any time during the scenario so remain attentive, ensure you have enough evidence for your decision - Just as failures do not occur on every mission in real life, failures WILL NOT occur in every scenario, you must discern this each time from the system performance - Only one failure will occur in each scenario. You only need to click the failure alarm and proper emergency procedure ONCE. Emergency procedures simply identify the problem to the crew, once you select an appropriate emergency procedure, the simulation will terminate and you will move on to the next scenario. 142 QUESTIONS? 143 Appendix D: Demographic Survey and Results Demographic Questionnaire Subject number: Age: Gender: M F Are you color blind? No Yes Occupation: if student, (circle one): Undergrad Masters PhD department: expected year of graduation: Military experience (circle one): No Yes No Yes If yes, which branch: Years of service: Do you have any flight experience? If yes, how many hours? Do you have any experience in flight simulators? No Yes If yes, please describe briefly: Do you have experience in engineering or operating spacecraft systems ? No If yes, please describe briefly: Level of Education, please list your degrees and subjects: 144 Yes Rate your comfort level with using computer programs. Not comfortable Somewhat comfortable Comfortable Very Comfortable Subject Demographics: Total Number of Subjects: 28 Male: 19 Female: 9 No Color Blindness Age Range: 23-57 Mean Age: 38 Military: 1 Students: 4 Pilots: 2 Flight Simulators: 5 Spacecraft Experience: 10 Education: Bachelors: 28 Masters: 17 PhD: 5 Comfort with Computers: No participants circled less than "Comfortable" The effects on performance (latency and accuracy) were investigated for the following demographics: Gender, Students, Pilots/Flight Simulators, Spacecraft Experience. There were no significant effects of these different demographic trends on performance. 145 Appendix E: Detailed Analysis Results Tables and Plots The statistical tables and results were computed in SPSS 15.0 statistical software package. For each data type and each decision there are detailed descriptive statistics, assumption checks and full ANOVA results tables. E.1 Detection Latency Descriptive Statistics for both the original latency and the cube root transform: Descriptive Statistics N Minimum Maximum Mean Std. Deviation Variance Statistic Statistic Statistic Statistic Statistic Statistic Statistic Error Statistic 1109 1109 .00 .00 50.73 3.70 11.9204 2.1692 7.94810 .51578 63.172 .266 1.283 -.118 .073 .073 2.071 .357 Latency CubeLatency Valid N (listwise) 1109 1 I I Skewness Std. IIII Kurtosis Std. Error .147 .147 III Detection latency lacked both homogeneity of variance and normality when the assumptions were verified, as a result, a cube root transform was applied in order to meet these ANOVA assumptions. Results of a Univariate ANOVA with five fixed factors: Trend information, Failure type, Approach, the order the subjects saw scenarios (Order), and whether they saw the trend information first or second (Block) Tests of Between-Subjects Effects Dependent Variable: CubeLatency Source Corrected Model Intercept Trend 146 Type IlIl Sum of Squares 101.918(a) 4726.618 8.797 df Mean Square 1.073 4726.618 8.797 F 5.636 24829.628 46.211 Sig. .000 .000 .000 FailureType ApproachType Order Block Trend * FailureType Trend * ApproachType 18.774 10.058 .046 .002 8.300 6.056 3 2 1 1 3 2 6.258 5.029 .046 .002 2.767 3.028 32.874 26.419 .241 .009 14.534 15.905 .000 .000 .624 .924 .000 .000 19.082 6 3.180 16.707 .000 5.262 .002 .376 6 1 3 .877 .002 .125 4.607 .000 .010 .659 .919 .577 .951 3 .317 1.666 .173 .445 2 .223 1.169 .311 2.463 2 1.232 6.469 .002 1.063 6 .177 .931 .472 2.547 6 .424 2.230 .038 1.638 2.057 1 3 1.638 .686 8.604 3.602 .003 .013 .390 3 .130 .683 .563 .276 2 .138 .724 .485 .237 2 .119 .623 .537 ApproachType * Block Trend* FailureType * ApproachType * Block Order * Block .920 6 .153 .806 .566 .451 6 .075 .395 .883 .163 1 .163 .857 .355 Trend * Order * Block FailureType * Order * .042 1 .042 .223 .637 .544 3 .181 .953 .414 3 .082 .432 .730 .073 2 .036 .192 .826 .159 2 .080 .419 .658 * .188 6 .031 .165 .986 * .795 6 .132 .696 .653 192.837 5513.059 1013 1109 .190 FailureType * ApproachType Trend * FailureType * ApproachType Trend* Order FailureType * Order Trend * FailureType * Order ApproachType * Order Trend* ApproachType * Order FailureType * ApproachType * Order Trend * FailureType * ApproachType * Order Trend * Block FailureType * Block Trend * FailureType * Block ApproachType * Block Trend * ApproachType Block FailureType * * Block Trend* FailureType *.247 Order * Block ApproachType * Order * Block Trend * ApproachType * Order* Block FailureType * ApproachType * Order Block Trend* FailureType * ApproachType * Order Block Error Total 147 ............................................. Corrected Total 1108 1 294.755 1 a R Squared = .346 (Adjusted R Squared = .284) Marginal means plots that show the mean cube root latency for each combination of failure type, approach type, and trend information (history), are useful in examining significant effects. For example, there is a distinct difference for all failure types for This effect is not as clear for shallow and trend information on steep trajectories. nominal trajectories. ApproachType - Steep History 2.40- - WePresot Presert 22 0 E E2102.00- 1.80 Lek Pump Tiuster V.ile FailureType ApproachType - Nominal History - 2.75- 230- LM225- 2.00- E lu Leak Pump Tiut.er FallureType 148 Va.ve NotPreset Prese03 ApproachType = Shallow History 2.40- NotPresert Preseri * 220- o2.00- 1.80- UJ 1.0 Lea Thruster Pump~ Valve FailureType E.2 Detection Accuracy Descriptive Statistics: Accuracy with Trend Valid Missing Incorrect Correct Total System Frequency 160 670 830 857 Total 1687 Percent 9.5 39.7 49.2 50.8 Cumulative Percent 19.3 100.0 Valid Percent 19.3 80.7 100.0 100.0 1 Accuracy without Trend Cumulative Valid Missing Incorrect Correct Total System 172 678 850 837 Total 1687 Percent Valid Percent Percent Frequency 10.2 40.2 50.4 49.6 20.2 79.8 100.0 20.2 100.0 100.0 1 Forward Likelihood Ratio Binary Regression: Variables in the Equation B Step 1(a) Step 2(b) Valve Constant Thruster Valve 1.508 1.200 1.502 1.772 [ S.E. .234 .065 .213 .236 JWald 41.343 344.457 49.918 56.392 df Sig. .000 .000 .000 .000 Exp(B) 4.516 3.322 4.489 5.884 149 Constant Step 3(c) Thruster Valve Shallow Constant Step Thruster 4(d) Valve Shallow Order Constant a Variable(s) entered b Variable(s) entered c Variable(s) entered d Variable(s) entered E.3 .936 .070 178.640 .000 2.549 1.570 1.799 -.486 1.095 1.576 1.806 -.489 .343 .931 on step 1: Valve. on step 2: Thruster. on step 3: Shallow. on step 4: Order. .214 .237 .133 .084 .215 .237 .133 .128 .103 53.655 57.736 13.414 169.027 53.915 58.012 13.481 7.135 81.821 .000 .000 .000 .000 .000 .000 .000 .008 .000 4.804 6.043 .615 2.989 4.835 6.083 .613 1.409 2.536 Diagnosis Latency Descriptive Statistics from a repeated measures ANOVA: Descriptive Statistics N Std. Deviation Variance Statistic Statistic Statistic Statistic Error Statistic 57.00 37.8214 1.00 .6786 8.23 3.0148 8.16 2.7908 11.19211 .47559 1.82849 1.94215 125.263 .226 3.343 3.772 .285 -.809 1.284 1.565 .441 .441 .441 .441 -1.466 -1.456 1.066 1.877 Minimum Maximum Mean Statistic Statistic Statistic Age Gender Trend NoTrend Valid N 28 28 28 28 23.00 .00 .96 .64 (listwise) 28 Skewness Std. Kurtosis Std. Error .858 .858 .858 .858 A repeated measures ANOVA was conducted on the diagnosis latency data because the independence of the observations could not be supported. Multivariate Tests(b) Effect Approach Failure 150 Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Value .248 .752 .331 .331 .556 F 4.297(a) 4.297(a) 4.297(a) 4.297(a) 7.512(a) Hypothesis df 2.000 2.000 2.000 2.000 4.000 Error df 26.000 26.000 26.000 26.000 24.000 Sig. .024 .024 .024 .024 .000 Trend Approach Approach * * Failure Trend Failure * Trend Approach * Failure * Trend Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root Pillai's Trace Wilks' Lambda Hotelling's Trace Roy's Largest Root .444 1.252 1.252 .005 .995 .005 .005 .498 .502 .994 .994 .004 .996 .004 .004 .318 .682 .466 .466 .400 .600 .666 .666 7.512(a) 7.512(a) 7.512(a) .132(a) .132(a) .132(a) .132(a) 2.484(a) 2.484(a) 2.484(a) 2.484(a) .046(a) .046(a) .046(a) .046(a) 2.796(a) 2.796(a) 2.796(a) 2.796(a) 1.666(a) 1.666(a) 1.666(a) 1.666(a) 4.000 4.000 4.000 1.000 1.000 1.000 1.000 8.000 8.000 8.000 8.000 2.000 2.000 2.000 2.000 4.000 4.000 4.000 4.000 8.000 8.000 8.000 8.000 24.000 24.000 24.000 27.000 27.000 27.000 27.000 20.000 20.000 20.000 20.000 26.000 26.000 26.000 26.000 24.000 24.000 24.000 24.000 20.000 20.000 20.000 20.000 .000 .000 .000 .719 .719 .719 .719 .047 .047 .047 .047 .955 .955 .955 .955 .049 .049 .049 .049 .169 .169 .169 .169 a Exact statistic b Design: Intercept Within Subjects Design: Approach+Failure+History+Approach*Failure+Approach*History+Failure*History+Approach*Failure*History Marginal means plots that indicate the effects on diagnosis latency of combinations of trend information (history), failure, and approach types. There is no clearly different effects for trend information for any one trajectory, though specific failure types show differences, such as a leak, where history present seemed to make subjects consistently slower. Approach - Steep History 4- 2- wNmal Lel Thuser Vlve Failure 151 Approach = Nominal History -- 1 -2 0 z Normal Leak Pump Thruster Valve Failure Approach = Shallow History 1 -2 Normal Pump Leak Thruster Valve Failure E.4 Diagnosis Accuracy Descriptive Statistics: Accuracy with Trend Cumulative Frequency Valid Missing Total Incorrect Correct Total System 277 553 830 857 1687 Percent Valid Percent 16.4 32.8 49.2 50.8 100.0 33.4 66.6 100.0 Percent 33.4 100.0 Accuracy without Trend Cumulative Frequency Valid Missing Total Incorrect Correct Total System Percent 366 484 850 837 21.7 28.7 50.4 49.6 1687 100.0 Valid Percent 43.1 56.9 100.0 Percent 43.1 100.0 Forward Likelihood Ratio Binary Logistic Regression: 152 Variables in the Equation S.E. B Step 1(a) Step 2(b) Thruster Constant Pump Thruster Constant Step 3(c) Pump Thruster HistPump Constant Step Pump 4(d) Thruster HistPump HistValve Constant a Variable(s) entered on b Variable(s) entered on c Variable(s) entered on d Variable(s) entered on E.5 Wald df Sig. Exp(B) -1.049 .700 -1.208 -1.273 .125 .058 .130 .130 70.364 145.956 86.669 95.883 1 1 1 1 .000 .000 .000 .000 .350 2.013 .299 .280 1.009 .070 210.435 1 .000 2.743 .191 .132 .241 .070 .192 .134 .240 .245 .074 105.259 87.185 35.339 205.055 91.759 67.932 35.936 19.017 137.406 1 1 1 1 1 1 1 1 1 .000 .000 .000 .000 .000 .000 .000 .000 .000 .141 .292 4.181 2.711 .159 .333 4.227 2.909 2.374 -1.962 -1.230 1.430 .997 -1.840 -1.101 1.442 1.068 .864 step 1: Thruster. step 2: Pump. step 3: HistPump. step 4: HistValve. Detection Achievement and Judgment Consistency and False Alarms Descriptive Statistics from a repeated measures ANOVA: Descriptive Statistics N Achievement Consistency Valid N (listwise) 56 56 56 Minimum -.19 .39 Maximum .96 .91 Mean .6160 .7178 Std. Deviation .23043 .11250 Marginal means plots of Lens model parameters achievement, decision consistency, and skill score that indicate higher values with trend information present, but not significantly. 153 Estimated Marginal Means of Acheivement 0.6162- 0.6161- S0.6160- . 0.6159- 0.6158- 0.6157N Preset Pret History Estimated Marainal Means of Consistency 0.775- 0750- -01725- . 0700 0675- N Preset Presert History 154 Estimated Marginal Means of Skill Score 0.10- 0.08- 0.06- ii0.04- .E 00 0.02- -0021 NotPresert Presert History False Alarms: False alarms were examined using a one-sample T-test because they are continuous data and met all the statistical assumptions. The results showed marginal significance. With the mean value of false alarms being only 1.4 out of 60 trials. One-Sample Statistics N Difference 28 Mean 1.3929 Std. Deviation 4.08556 Std. Error Mean .77210 One-Sample Test Test Value = 0 Mean t Difference E.6 di 1.804 Sig. (2-tailed) Difference .082 1.39286 27 95% Confidence Interval of the Difference Lower -.1914 Upper 2.9771 Diagnosis Achievement and Judgment Consistency Descriptive Statistics from a repeated measures ANOVA: Descriptive Statistics N Minimum Maximum Mean Std. Deviation 155 Achievement Consistency Valid N (listwise) 56 56 56 .97 .79 .12 .26 .6845 .5822 .22612 .13371 Marginal means plots showing the direction of trends for achievement, decision consistency and skill score. All of these seemed to increase, though not significantly when trend information was present. Estimated Marginal Means of Acheivement "6 0.70- V . 0.68- E 0.66- 0.64NotPresetr Present History Estimated Marqinal Means of Consistency 0.64- 0.62- -U60 0.58- E 0.56- 054Present Not Present History 156 Estimated Marainal Means of Skill Score 0.50- .5S0.40- E w 0.35- 0.30N Present Preset History 157 Appendix F: Linear Multiple Regression Overview The primary method for determining the judgment strategy and the environmental state in the Lens model formulation is by multiple linear regression. Multiple linear regression is a statistical technique that attempts to model the relationships between variables in the complex context of real world data. This mathematical that model simplifies and organizes observed relationships so investigators can better understand the actual processes at work in the data. The model relates a dependent variable, generally denoted Y, and multiple independent variables denoted Xi where i = 1,..., n and, n represents the number of variables in the model[40]. A general two dimensional example which includes a regression trendline is found in Figure 31 below. 200 0 15-0- 151000 0 -20 00 0 00 20 40 60 80 Figure 31: Example of multiple regression plot with regression model trend line 158 We find the formula for multiple linear regression twice in the Lens model formulation above in the form, Y = reiXj + Ee . Linear regression models can be useful for a number of applications but are largely used in prediction and forecasting for data sets and circumstances where the linear additive model is a good fit. It is also useful in determining relationships and sensitivity of a dependent variable to a set of independent variables that can be manipulated. In general to get a model that is considered accurate, the rule of thumb is that five data points per independent variable, are necessary, and ten is better[33]. This means that the operator must make that same decision in the same task environment, given those same cues at least 5 times for each cue present to get a valid regression model of their decision strategy and of the environment. Other rules of thumb have been suggested. Norusis (2006) recommends 10 to 20 cases per independent variable. A third rule of thumb is: N ;: 50 + 8m, where m is the number of independent variables[51]. The bottom line appears to be, multiple regression requires a large number of cases. This is true because multiple linear regression attempts to fit a trend line, as shown in the figure above, to a set of data in multiple dimensions based on one of any number of estimation procedures. The more data the estimator is given for each independent variable, generally the more accurate the estimate of the regression coefficient will be. If the sample size is not large enough there is a significant risk that the model will be overfitted, which means it will not be generalizable to any another 159 data sample but only applicable to the sample set that originally created it[40]. Though there are many estimation methods, only one of these estimation procedures are discussed below as that is the procedure used by the mathematical tools that were used in the generation and post processing of the data for this thesis. Ordinary Least Squares: (OLS) is the simplest estimation procedure used in linear regression and thus a very common technique. It is conceptually simple and computationally straightforward. Ordinary least squares seeks to minimize the sum of the squared difference between the observed and predicted values of the dependent variable, or the residuals, of a data set and leads to a closed form analytical solution for the estimated value of the regression coefficients. This method of coefficient estimation produces coefficients that are considered unbiased and consistent as long as the assumptions of linear regression are satisfied. This is the method used to compute the regression coefficients in the MATLAB model of the Lens model that will be discussed in future chapters. This is also the estimation method used by SPSS statistical software which was used to all of the experimental data post processing. This estimation procedure generates estimates for the regression coefficients. There are two types of coefficients that are usually generated in regression analysis: unstandardized and standardized. Unstandardized coefficients are the model coefficients referred to in the Lens model formulation above as re,j and r, 1 . The magnitude of these coefficients does not necessarily tell us how good they are at 160 predicting the dependent variable because the size of the coefficient is dependent on the units of measurement of that particular independent variable. This means that the regression coefficients cannot be compared because they all have different units of measurement. The standardized coefficients are referred to generally as "beta" weights and are represented by fle,) for the environment and fl,for the judgment strategy. The benefit to standardized coefficients is that they can then be compared to one another outright. Standardized coefficients are the result of standardizing both the dependent and independent variables to have a mean of 0 and a standard deviation of 1. The value of the coefficients still depends largely on the other variables included in the model, but they can now be compared to one another in the particular model in question. These coefficients are restricted between -1 and 1 unless there are two or more correlated independent variables in the model equation[40]. Linear regression employs a number of assumptions that must be satisfied in order for this method to be used on a data set. These assumptions are: - Independence of error terms: The independent variables and hence their - residuals are not correlated, each observation is independent of another. Linearity: There is a linear relationship between the dependent variable and the - independent variables. Normality: The error terms for each independent variable follow a normal - distribution Homoscedasticity: The error terms for each independent variable have the same variance. 161 One other assumption that can be helpful to check for is multicollinearity. If the goal of the regression analysis is either inference or predictive modeling, the performance of ordinary least squares estimates can be poor if this trait is present, unless the sample size is large. Multicollinearity is a statistical phenomenon where one or more independent variables are highly correlated. This can result in the condition of heteroscedasticity, and cause the regression assumptions to be violated[40]. In order to verify that each of these assumptions holds, there are a variety of tests that must be performed on the data before any analysis with regression can be done. The first step is generally to generate the residuals of the data which is simply the difference between the observed and the predicted value of the dependent variable. These are also dependent on the units of measurement of a given variable and so a truer reflection of the error variance can be determined from standardizing or studentizing the residuals. Normality: The best way to asses normality of the residuals is to plot them in a histogram. A histogram of the data will show the number of different pieces of data that fall into different buckets. If this diagram is symmetric and generally mirrors the shape of the normal distribution curve with the majority of the data values falling in the center, the data can be assumed to be normal. See Figure 32 below for an example of a normal histogram. 162 100 80 - 60 - 40 20 n // 40 60 80 ~/ 7 100 120 140 180 Bins Figure 32: Example normal histogram with distribution curve Linearity: In order to determine if the variables have a linear relationship to the dependent variable it is best to simply create a scatter plot of the residuals for the dependent variable against the predicted values or observed values of the independent variables or predictor variables in the regression equation. If the errors seem to be distributed evenly above and below zero along the entire sample, the relationship is probably linear. It is possible to visualize this by graphing the dependent variable and each of the independent variables as well, but this will not give any indication other than a best judgment about exactly how linear the relationship actually is. An example of a data set which exhibits a linear relationship is shown in Figure 33 below. 163 1.0 * 0.5- 0 . 150.0 -0.5 - -1.0 -1.5-2.0 1 2 3 4 5 6 Figure 33: Plot of residuals vs. the independent variable for a data set that shows a likely linear relationship Homoscedasticity: The easiest way to check for equality of variance is to examine the spread of the residuals over the range of the independent variable. If the spread of the residuals seems to increase as the predicted values increase then the variance is most likely unequal. A diagram of what a spread plot would look like is shown below in Figure 34. A statistical test known as the Levene test may also be used to verify that the variances the error terms of an independent variable are equal across all its factor levels. If this statistical test is significant, the variances are unequal and the data violate the assumptions for linear regression. 164 0 11 0oe 14 P~n Figure 34: Example spread and level plot for determining equal variance Independence: Checking for independence requires that you plot residuals in the order they were obtained if the observations were collected in a sequence. If there is a pattern, there may be a violation of the independence assumption. The most appropriate way to check for independence of observations however is using the Chisquared test. If the Chi-squared test results are significant, this means that the null hypothesis that the variables are randomly correlated is rejected and the variables are not independent[40]. 165 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 166 Laboratory, C.S.D., The Use of Markov and PARADyM Modeling to Evaluate and Optimize Designs for NASA Crewed Spacecraft. 2008, NASA Engineering and Safety Center: Cambridge, MA. Fuld, R.B., Liu, Y., Wickens, C.D. The Impact of Automation on Error Detection: Some Results from a Visual DiscriminationTask. in Human Factors Society. 1987. Parasuraman, R., T.B. Sheridan, and C. Wickens, A Model for Types and Levels of Human Interaction with Automation. IEEE Transactions on Systems, Man, and Cybernetics, 2000. 30(3). Rouse, W.B., Models of Human Problem Solving: Detection, Diagnosis, and Compensationfor System Failures.Automatica, 1983. 19(6): p. 613625. Brunswik, E., Representative Design and ProbabalisticTheory in a Functional Psychology. Psycholgical Review, 1955. 62(3): p. 193-217. Willsky, A.S., A Survey of Design Methods for Failure Detection in Dynamic Systems. Automatica, 1976. 12(1): p. 601-611. Isermann, R., Supervision, Fault-Detectionand Fault-DiagnosisMethodsAn Introduction. Control Engineering Practice, 1997. 5(5): p. 639-652. Greenstein, J.S., Rouse, W. B., A Model of Human Decisionmakingin Multiple Process Monitoring Situations. IEEE Transactions on Systems, Man, and Cybernetics, 1982. SMC-12(2): p. 182-193. Gai, E.G., Curry, R. E., A Model of the Human Observer in Failure Detection Tasks. IEEE Transactions on Systems, Man, and Cybernetics, 1976. SMC-6(2): p. 85-94. Polycarpou, M.M., Vemuri, A. T., Learning Methodology for Failure Detection and Accommodation. IEEE Control Systems, 1995: p. 16-24. Mehra, R., Rago, C., Seereeram, S., Autonomous FailureDetection, Identificationand Fault-tolerantEstimation with Aerospace Applications. IEEE, 1998: p. 133-138. Wickens, C.D., Kessel, C., The Effects of ParticipatoryMode and Task Workload on the Detection of Dynamic System Failures.IEEE Transactions on Systems, Man, and Cybernetics, 1979. 9(1). 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. Bisantz, A.M., et al., Modeling and Analysis of a Dynamic Judgment Task Using a Lens Model Approach. IEEE Transactions on Systems, Man, and Cybernetics, 2000. 30(6): p. 605-616. Wickens, C.D. and J.G. Hollands, EngineeringPsychology and Human Performance.2000, Prentice-Hall: Upper Saddle River, NJ. Wickens, C.D. and J.G. Hollands, EngineeringPsychology and Human Performance.3rd ed. 2000, Upper Saddle River, N.J.: Prentice Hall. O'Hare, D., Aeronautical Decision Making: Metaphors, Models, and Methods, in Principles and Practiceof Aviation Psychology, P.S. Tsang, Vidulich, M. A., Editor. 2003. p. 201-237. Donmez, B., Pina, P.E., Cummings, M.L. Evaluation Criteriafor Human-Automation PerformanceMetrics. in Intelligent Systems Workshop. 2008. Gaithersburg, MD. Vidulich, M.A., Mental Workload and Situation Awareness: Essential Conceptsfor Aviation Psychology Pracitce,in Principlesand Practiceof Aviation Psychology, P.S. Tsang, Vidulich, M. A., Editor. 2003. p. 115146. Endsley, M.R., Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors, 1995. 37(1): p. 32-64. Orasanu, J. and T. Connolly, The re-invention of decision making, in Decision Making in Action: Models and Methods, G. Klein, et al., Editors. 1993, Ablex Publishing: Norwood, N.J. Endsley, M., Measurement of situation awareness in dynamic systems. Human Factors, 1995. 37(1): p. 65-84. Wickens, C.D., Flach, J. M., Information Processing,in Handbook of Human Factorsand Ergonomics, G. Salvendy, Editor. 1988, Wiley. Takano, K., Reason, J., Psychological Biases Affecting Human Cognitive Performancein Dynamic OperationalEnvironments. Journal of Nuclear Science and Technology, 1999. 36(11): p. 1041-1051. Kahneman, D., Tversky, A., On the Psychology of Prediction. Psycholgical Review, 1973. 80(4): p. 237-251. Mosier, K.L., et al., Automation bias: decision making and performance in high-tech cockpits. The International Journal of Aviation Psychology, 1998. 8(1): p. 47-63. 167 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 168 Glaser, R., Expertise and Learning:How Do We Think About Instructional Processes Now That We Have Discovered Knowledge Structures?, in Complex Information Processing:The Impact of Herbert A. Simon, H.A. Simon, Klahr, D., Kotovsky, K., Editor. 1989, Psychology Press. Kozlowski, S.W.J., Trainingand developing adaptive teams: Theory, principles, and research., in Decision Making Under Stress: Implicationsfor Trainingand Simulation, J.A. Cannon-Bowers, Salas, E., Editor. 1998, APA Books: Washington. Endsley, M.R., Jones, W.M., Situation Awarneness Information Dominance and Information Warfare. 1997, Logicon Technical Services Inc: Dayton, OH. Sarter, N.B., Woods, D.D., Situation Awareness: A Critical but IllDefined Phemomenon. The International Journal of Aviation Psychology, 1991. 1(1): p. 45-57. Endsley, M. Situation Awareness, Automation, and Free Flight. in FAA/Eurocontrol Air Traffic Management R&D Seminar. 1997. Saclay, France. Sefarty, D., MacMillan, J., Entin, E.E., Entin, E.B., The Decision Making Expertise of Battlefield Commanders, in NaturalisticDecision Making, C.E. Zsambok, Klein, G., Editor. 1997, Pyschology Press. Brannick, M.T., The Lens Model, University of Southern Florida, College of Arts and Sciences. Miller, S., Judgment Analysis: The Lens Model. 2008, University of Illinois at Urbana-Champaign: Urbana, Illinois. Karelaia, N., Hogarth, R. M., Determinants of Linear Judgment: A MetaAnalysis of Lens Model Studies. Psychological Bulletin, 2008. 134(5): p. 404-426. Hursch, C.J., K.R. Hammond, and J.L. Hursch, Some methodological considerationsin multiple-cue probability studies. Psychological Review, 1964. 71: p. 42-60. Tucker, L.R., A suggested alternativeformulation in the developments by Hursch, Hammond, & Hursch and by Hammond, Hursch, & Todd. Psychological Review, 1964. 71: p. 528-530. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. Dhami, M.K., Harries, C. , Fast andfrugal versus regression models of human judgment. Thinking and Reasoning, 2001. 7: p. 5-27. Rothrock, L., Kirlik A., Inferring rule-based strategies in dynamic judgment tasks. IEEE Transactions on Systems, Man, and Cybernetics, 2003. 33: p. 58-72. Keeley, S.M., Doherty, M. E., Probabalisticand multiple regression modeling of the biologist's decision processes adn its implicationsfor medical diagnosis. Bulletin of Mathematical Biology, 1971. 33(3): p. 439449. Norusis, M.J., SPSS 15.0 Statistical ProceduresCompanion. 2006, Upper Saddle River, NJ: Prentice Hall. Murphy, A.H., Skill scores based on the mean square errorand their relationshipsto the correlationcoefficient. Monthly Weather Review, 1988. 2417(24). Kirlik, A., Strauss, R., Situation awareness as judgment I: Statistical modeling and quantitative measurement. International Journal of Industrial Ergonomics, 2006. 36(5): p. 463-474. Strauss, R., Kirlik, A., Situation awareness as judgment II: Experimental demonstration. International Journal of Industrial Ergonomics, 2006. 36(5): p. 475-484. Juslin, P.N., Karlsson, J., et al., Play it Again with Feeling: Computer Feedback in Musical Communication of Emotions. Journal of Experimental Psychology: Applied, 2006. 12(2): p. 79-95. Miller, S., Kirlik, A., et al. Supporting Joint Human-ComputerJudgment under Uncertainty. in Proceedingsof the Human Factors and Ergonomics Society. 2008. Masalonis, A.J., Byrne, E. A., et al. Instrument FailureDetection and Workload in Simulated General Aviation Flight During Manual and Automated Lateral Tracking. in InternationalSymposium on Aviation Psychology. 1997. Washington, DC. Ephrath, A.R., Curry, R. E., Detection by Pilots of System Failures During Instrument Landings. IEEE Transactions on Systems, Man, and Cybernetics, 1977. SMC-7(12): p. 841-848. 169 Bortolami, S.B., Duda, K. R., Borer, N. K., Markov Analysis of Humanin-the-Loop System Performanceand Reliability. 2008, The Charles Stark Draper Laboratory: Cambridge, MA. 49. Cooksey, R.W., Judgment Analysis: Theory, Methods, and Applications. 1996: Academic Press. 50. Mueller, E., Bilimoria, K. D., Frost, C., Effects of Control Power and Inceptor Sensitivity on Lunar Lander HandlingQualities, in AIAA Space. 2009: Pasedena, CA. 51. Tabachnick, B.G. and L.S. Fidell, Using Multivariate Statistics. 4th ed. 2001, New York: HarperCollins. 880. 48. 170