Hoffman, R. R. (2002, September). “An Empirical Comparison of Methods for Eliciting and Modeling Expert Knowledge.” In Proceedings of the 46th Meeting of the Human Factors and Ergonomics Society. Santa Monica, CA: Human Factors and Ergonomics Society. AN EMPIRICAL COMPARISON OF METHODS FOR ELICITING AND MODELING EXPERT KNOWLEDGE Robert R. Hoffman, Ph. D. John W. Coffey, Ed.D. Mary Jo Carnot, MA Joseph D. Novak, Ph.D. Institute for Human and Machine Cognition University of West Florida The goal of this project was to apply a variety of methods of Cognitive Task Analysis (CTA) and Cognitive Field Research (CFR) to support a process going all the way from knowledge elicitation to system prototyping, and also use this as an opportunity to empirically compare and evaluate the methods. The research relied upon the participation of expert, journeyman, and apprentice weather forecasters at the Naval Training Meteorology and Oceanography Facility at Pensacola Naval Air Station. Methods included protocol analysis, a number of types of structured interviews, workspace and work patterns analysis, the Critical Decision Method, the Knowledge Audit, Concept Mapping, and the Cognitive Modeling Procedure. The methods were compared in terms of (1) their yield of information that was useful in modeling expert knowledge, (2) their yield in terms of identification of leverage points (where the application of new technology might bring about positive change), and (3) their efficiency. Efficiency was gauged in terms of total effort (time to prepare to run a procedure, plus time to run the procedure, plus time to analyze the data) relative to the yield (number of leverage points identified, number of propositions suitable for use in a model of domain knowledge). CTA/CFR methods supported the identification of dozens of leverage points and also yielded behaviorally-validated models of the reasoning of expert forecasters. Knowledge modeling using Concept-Mapping resulted in thousands of propositions covering domain knowledge. The Critical Decision Method yielded a number of richly-populated case studies with associated Decision Requirements Tables. Results speak to the relative efficiency of various methods of CTA/CFR, and also the strengths of each of the methods. In addition to extending our empirical base on the comparison of knowledge elicitation methods, a deliverable from the project was a knowledge model that illustrates the integration of training support and performance aiding in a single system. INTRODUCTION The empirical comparison of knowledge elicitation (KE) methods is nearly 20 years old, dated from Duda and Shortliffe (1983), who recognized what came to be called the "knowledge acquisition bottleneck"—that it took longer for computer scientists to interview experts and build a knowledge base than it did to actually write the software for the expert system. The first systematic comparisons of knowledge elicitation methods (i.e., Burton, Shadbolt, Hedgecock, & Rugg, 1987; Hoffman 1987), and the first wave of psychological research on expertise (e.g., Chi, Feltovich, & Glaser, 1981; Chi, Glaser, & Farr, 1988; Glaser, et al., 1985; Hoffman, 1992; Shanteau, 1992; Zsambok & Klein, 1997), resulted in some guidance concerning knowledge elicitation methodology (see Cooke, 1994; Hoffman, Shadbolt, Burton, & Klein, 1995). In the subsequent years, new methods were developed, including the Critical Decision Method (see Hoffman, Crandall, & Shadbolt, 1998) and the Cognitive Modeling Procedure (Hoffman, Coffey, & Carnot, 2000). In addition, a number of research projects have attempted to extend our empirical base on knowledge elicitation methodology, including Thorsden's (1991) comparison of Concept Mapping with the Critical Decision Method, and Evans, Jentsch, Hitt, Bowers, and Salas' (2001) comparison of Concept Mapping with methods for rating and ranking domain concepts. A factor that has made interpretation difficult is that some studies have used college-age participants (and, of course, assessments of the sorts of knowledge that they would possess, e.g., sports, fashion). The transfer of the findings to knowledge elicitation for genuine experts in significant domains is questionable. A second and major difficulty in the comparison of KE methods is the selection of dependent variables. Hoffman (1987) compared methods in terms of relative efficiency—the number of useful propositions obtained per total task minute, where total task minute is the time take to prepare to run the KE procedure, the time taken to run the procedure, plus the time taken to analyze the data and cull out the useful propositions; and where the adjective "useful" was applied to any proposition that was not already contained in the first-pass knowledge base that had been constructed on the basis of a documentation analysis. (A somewhat similar metric, number of elicited procedural rules, was utilized in the work of Burton et al., 1987.) Hoffman's initial purpose for creating an efficiency metric involved the need of computer scientists to assess the usefulness of the results in terms of building knowledge bases for expert systems. While a somewhat reasonable metric from the standpoint of first-generation expert systems, it would not work for all of the purposes of either computer science or experimental psychology. For their dependent variable, Evans et al. (2001) generated correlations of the similarity ratings among domain concepts. This correlation approach makes it possible to lock down the relative similarity of domain concepts and scale the convergence among alternative methods (e.g., ranking versus Concept-Mapping), but raw pairwise similarity of domain concepts glosses over the meaning and content that are necessary for the construction of models. Another factor that clouds the interpretation of results from some studies that have used the Concept-Mapping procedure is that it is often apparent that the Concept Maps that are created (either by domain practitioners or by practitioners in a collaboration with the researchers) are lacking in the qualities that define Concept Maps. These criteria, and their foundations in the theory of meaningful learning, have been discussed by Novak and his colleagues (e.g., Ausubel, Novak, & Hanesian, 1978; Novak, 1998). Criteria include semi-hierarchical morphology, propositional coherence, labeled links, the use of cross-links, and the avoidance of certain pitfalls that characterize Concept Maps made by unpracticed individuals (including the creation of "fans," "stacks," sentence-like "spill-overs," and other features). A final factor that makes interpretation difficult is the fact that some studies involve apples-oranges comparisons. For instance, to those who are familiar with the techniques, it would make little sense to compare a concept sorting task to the Concept-Mapping in terms of their ability to yield models of expert reasoning—in fact, neither method is suited to that purpose. One goal of the present research was to create a comparison that involved a reasonable mix of alternative methods, but also to put all of the methods on a more level playing field. Hoffman's efficiency metric was re-defined as the yield of useful propositions, useful in that they could be used in a variety of ways (and not just in creating a knowledge base for an expert system). One could seek to create models of expert knowledge, or create models of expert reasoning. In addition, a second metric was used to carve out the applications aspect of KE research—the yield of leverage points. A leverage point was defined as any aspect of the domain or work practice where an infusion of new tools (simple or complex) might result in an improvement in the work. Leverage points were initially identified by the researchers but were then affirmed by the domain practitioners themselves. Also, there was ample opportunity for convergence in that leverage points could be identified in the results from more than one KE method.1 METHODS Participants Participants (n = 22) were senior expert civilian forecasters, junior Aerographers (i.e., Apprentices who were qualified as Observers) and senior Aerographers (i.e., Advanced Journeymen and Journeymen who were qualified as Forecasters) at the Meteorology and Oceanography Training Facility at Pensacola Naval Air Station. Methods The following methods of CTA/CFR were utilized: Bootstrapping (documentation analysis, analysis of SOP documents, the Recent Case Walkthrough method), 2. Proficiency Scaling (Participant Career Interviews; comparison of experience versus forecast hit rates as a measure of actual performance), 3. Client (i.e., pilots and pilot trainers) Interviews, 4. Workspace Analysis (Repeated photographic surveys, detailed workspace mapping), 5. Workpatterns Analysis (live and videotaped Technical Training Briefings, Watchfloor observations), 6. The Knowledge Audit, 7. Decision Requirements Analysis, 8. The Critical Decision Method, 9. The Cognitive Modeling Procedure (see Hoffman, et al., 2000), 10. Protocol Analysis, 11. Concept Mapping using the CMap Tools software. 1. RESULTS AND DISCUSSION The conduct of some methods was relatively easy and quick. For example, the Knowledge Audit procedure took a total of 70 minutes. Others were quite time-consuming. For instance, we conducted over 60 hours of Concept Mapping sessions. Full protocol analysis of a single knowledge modeling session took a total of 18 hours to collect and analyze the data. Results for protocol analysis confirm a finding from previous studies (Burton, et al., 1990; Hoffman, et al., 1995), that full protocol analysis (i.e., transcription and functional coding of audiotaped protocol statements, with independent coders) is so time consuming and effortful as to have a relatively low effective yield. Knowledge models and reasoning models can be developed, refined, and validated much more efficiently (i.e., by orders of magnitude), using such procedures as Concept Mapping and the Cognitive Modeling Procedure. The CDM The CDM worked effectively as a method for generating rich case studies. However, the present results provide a useful qualification to previous reports on the CDM (e.g., Hoffman, et al., 1998). A lesson learned in the present project was that in this domain and organizational context, the conduct of each CDM session had to span more than one day. On the first day the researcher would conduct the first 3 steps in the CDM, then retreat to the lab to input the results into the method's boilerplate forms. The researcher returned to the workplace on a subsequent day to complete the procedure. Weather forecasting cases are rich (in part because weather phenomena can span days and usually involve dozens of data types and scores of data fields). More importantly, expert forecasters' memories of cases are often remarkably rich. Indeed, there is a tradition in meteorology to convey important lessons by means of case reports (e.g., Buckley & Leslie, 2000; any issue of The Monthly Weather Review). The impact of this domain feature was that the conduct of the CDM was time-consuming and effortful. Previous studies had suggested that the CDM procedure takes about 2 hours, but those measurements only looked at session time. The present study involved a more inclusive measure of effort, total task time, and in the present research context, the conduct of the CDM took about 10 hours per case. Concept Mapping We are led to qualify a conclusion of Thorsden (1991), who also used the CDM in conjunction with Concept Mapping. Thorsden argued that the strength of the CDM lies in eliciting "tacit knowledge" whereas Concept Mapping has its strength in supporting the domain practitioner in laying out a model of their tasks. Putting aside legitimate (and overdue) debate about the meaning of the phrase "tacit knowledge," we see the greatest strength of the CDM to be the generation of rich case studies, including information about cues, hypothetical reasoning, strategies, etc. (i.e., decision requirements), all of which can be useful in the modeling of the reasoning procedures or strategies. The strength of Concept Mapping lies in generating models of domain knowledge. Concept Mapping (either paper-and-pencil or through the use of the CMap Tools software) can be used to concoct diagrams that look like flow diagrams or decision trees. Our experience is that it is easy for novices to see Concept Maps as being flow-diagrams or models of procedural knowledge. However, good Concept Maps can just as easily describe the domain in a way that is task and device independent. (And therefore the Concept Mapping procedure can provide a window into the nature of the "true work;" as in Vicente, 1999.) To put a fine point on it, our calculations of yield (number of mappable propositions generated per total task minute) place Concept Mapping right on the mark in terms of rate of gain of information for knowledge modeling. Previous guidance (Hoffman, 1987) was that the "effective" knowledge elicitation techniques yield two or more informative propositions per total task minute. (Again by comparison, full protocol analysis was calculated to yield less than one informative proposition per total task minute.) In the present research, it took about 1.5 to 2 hours to create, refine, and verify each Concept-Map. (The Concept Maps contained an average of 47 propositions. Verification took about seven propositions per minute, for about seven minutes per ConceptMap.) The rate of gain for Concept Mapping was just about two mappable propositions per session minute. If one takes into account the fact that for the Concept Mapping procedure, session time actually is total task time (i.e., there is no preparation time and the result from a session is the final product), it can be safely concluded that Concept Mapping is as at least as efficient at generating models of domain knowledge as any other method of knowledge elicitation. Indeed, it is quite probably much more efficient. Leverage Points In terms of effectiveness at the identification of leverage points, 35 in all were identified. Leverage points ranging all the way from simple interventions (e.g., a tickle board to remind the forecasters of when certain tasks need to be conducted) to the very complex (e.g., an AI-enabled fusion box to support the forecaster's creation of a visual representation of their mental models of atmospheric dynamics). All of the leverage points were affirmed as being leverage points by one or more of the participating experts.2 Furthermore, all of the leverage points were confirmed by their identification in more than one method. The leverage points were placed into broad categories (e.g., decision-aids for the forecaster, methods of presenting weather data to pilots, methods of archiving organizational knowledge, etc.). No one of the CTA/CFR methods resulted in leverage points that were confined to any one category. We found it interesting that, overall, the observational methods (e.g., Watchfloor observations) had a greater yield of identified leverage points. On the other hand, acquiring those leverage points took more time. For example, we observed 15 weather briefings that were presented either to pilots or to the other forecasters, resulting in 15 identified leverage points. But the yield was 15/954 minutes = 0.016 leverage points per observed minute. APPLICATION TO SYSTEM DESIGN After identifying the preservation of local weather forecasting expertise as an organizationally-relevant leverage point for a prototyping effort, the models of reasoning that were created using the Cognitive Modeling Procedure, the models of knowledge that were created using the Concept Mapping Procedure, and the case studies that were created using the CDM were all integrated into a Concept Map-based Knowledge Model. This model contained 24 Concept-Maps, which themselves contained a total of 1,129 propositions and 420 individual multimedia resources. This "System To Organize Representations in Meteorology-Local Knowledge" (STORM-LK) is not an expert system but instead uses the Concept-Mapsa model of the expert's knowledgeto be the interface to support the trainee or practicing forecaster as they navigate through the work domain. A screen shot of a Concept-Map is presented in Figure 1, below. The screen shot in Figure 2 shows a Concept-Map overlaid with examples of some of the kinds of resources that are directly accessible from the clickable icons that are appended to many of the concept-nodes. These include satellite images, charts, and digitized videos allow the apprentice to "stand on the expert's shoulders" by viewing mini-tutorials. Also appended to concept-nodes are Concept Map icons that take one to the Concept Map indicated by the conceptnode to which the icon is attached. The Top Map serves as a "Map of Maps" in that it contains concept-nodes that designate all of the other Concept-Maps (e.g., cold fronts, thunderstorms, etc.). At the top node in every other Concept Map is an icon that takes one back to the Top Map and to all of the immediately associated Concept-Maps. For example, the Top Map contains a concept-node for Hurricanes, and appended to that are links to both of the Concept-Maps that are about hurricanes (i.e., hurricane dynamics and hurricane developmental phases). Through the use of these clickable icons, one can meaningfully navigate from anywhere in the knowledge model to anywhere else, in two clicks at most. Disorientation in webspace becomes a non-issue. STORM-LK contains all of the information in the "Local Forecasting Handbook," and since the Concept Maps are web-enabled, they allow real-time access to actual weather data (radar satellite, computer forecasts, charts, etc.)—within a context that provides the explanatory glue for the weather understanding process. STORM-LK is intended also for use in distance learning and collaboration, acceleration of the acquisition of expertise, and knowledge preservation at the organizational level. Evaluations and extensions of STORMLK are currently underway. CONCLUSION Our understanding of the strengths and weakness of alternative CTA/CFR methods is becoming more refined, as is our understanding that knowledge elicitation is one part of a larger process of co-creative system design and evaluation (see Hoffman & Woods, 2000; Hollnagel & Woods, 1984; Potter, Roth, Woods, and Elm, 2000; Rasmussen, 1992; Vicente, 1999), a larger process that embraces both the science and aesthetics of the design of complex cognitive systems. However, there remains a need for more work along these lines, especially including studies in domains of expertise having characteristics that differ from those of the domains that have been studied to date. Additional KE methods can be examined as well. Footnote 1. To be sure, other researchers might have identified leverage points other than the ones we identified. 2. We can note also that leverage point affirmation also took the form of concrete action on the basis of our recommendations. For instance, the physical layout of the watchfloor was changed. References Ausubel, D. P., Novak, J. D., & Hanesian, H. (1978). Educational psychology: A cognitive view (2nd ed.). New York: Holt, Rinehart and Winston. Buckley, B. W., & Leslie, L. M. (2000). The Australian Boxing Day storm of 1998--Synoptic description and numerical simulations. Weather & Forecasting, 16, 543-558. Burton, A. M., Shadbolt, N. R., Hedgecock, A. P., & Rugg, G. (1987). A formal evaluation of a knowledge elicitation techniques for expert systems: Domain 1. In D. S. Moralee (Ed.), Research and development in expert systems, Vol 4. (pp.35-46). Cambridge: University Press. Chi, M. T. H, Feltovich, P. J., & Glaser, R. (1981). Categorization and representation of physics problems by experts and novices. Cognitive Science, 5, 121-152. Chi, M. T. H., Glaser, R., & Farr, M. J. (Eds.) (1988). The nature of expertise. Mahwah, NJ: Erlbaum. Cooke, N. M. (1994). Varieties of knowledge elicitation techniques. International Journal of human-Computer Studies, 41, 801-849. Duda, R. O., & Shortliffe, E. H. (1983). Expert systems research. Science, 220, 261-268. Evans, A. W., Jentsch, F., Hitt, J. M., Bowers, C, & Salas, E. (2001). Mental model assessments: Is there convergence among different methods? In Proceedings of the Human Factors and Ergonomics Society 45th Annual Meeting, (pp. 293-296). Santa Monica, CA: Human Factors and Ergonomics Society. Glaser, R., Lesgold, A. Lajoie, S., Eastman, R., Greenberg, L., Logan, D., Magone, M., Weiner, A., Wolf, R., & Yengo, L. (1985). Cognitive task analysis to enhance technical skills training and assessment. Report, Learning Research and Development Center, University of Pittsburgh, Pittsburgh, PA. Hoffman, R. R. (1987, Summer). The problem of extracting the knowledge of experts from the perspective of experimental psychology. The AI Magazine, 8, 53-67. Hoffman, R. R. (Ed.). (1992). The psychology of expertise: Cognitive research and empirical AI. New York: Springer Verlag. Hoffman, R. R., Shadbolt, N., Burton, A. M., & Klein, G. A. (1995). Eliciting knowledge from experts: A methodological analysis. Organizational Behavior and Human Decision Processes, 62, 129-158. Hoffman, R. R., Coffey, J. W., & Carnot, M. J. (2000, November). Is there a "fast track" into the black box?: The Cognitive Models Procedure. Poster presented at the 41st annual meeting of the Psychonomics Society, New Orleans, LA. Hoffman, R. R., Crandall, B., & Shadbolt, N. (1998). A case study in cognitive task analysis methodology: The Critical Decision Method for the elicitation of expert knowledge. Human Factors, 40, 254-276. Hoffman, R. R., Shadbolt, N., Burton, A. M., & Klein, G. A. (1995). Eliciting knowledge from experts: A methodological analysis. Organizational Behavior and Human Decision Processes, 62, 129-158. Hoffman, R. R., & Woods, D. D. (2000). Studying cognitive systems in context. Human Factors, 42, 1-7. Hollnagel, E. & Woods, D. D. (1983). Cognitive Systems Engineering: New wine in new bottles. International Journal of ManMachine Studies, 18, 583-600. Novak, J. D. (1998). Learning, creating, and using knowledge. Mahwah, NJ: Erlbaum. Potter, S. S., Roth, E. M., Woods, D. D., & Elm, W. C. (2000). Bootstrapping multiple converging cognitive task analysis techniques for system design. In J. M. Schraagen & S. F. Chipman (Eds.), Cognitive task analysis (pp. 317-340). Mahwah, NJ: Erlbaum. Rasmussen, J. (1992). Use of field studies for design of workstations for integrated manufacturing systems. In M. Helander & N. Nagamachi (Eds.), Design for manufacturability: A systems approach to concurrent engineering and ergonomics (pp. 317-338). London: Taylor and Francis. Shanteau, J. (1992). Competence in experts: The role of task characteristics. Organizational Behavior and Human Decision Processes, 53, 252-266. Thorsden, M. L. (1991). A comparison of two tools for cognitive task analysis: Concept Mapping and the Critical Decision Method. In Procedings of the Human Factors Society 35th Annual Meeting (pp. 283-285). Santa Monica: CA: Human Factors Society Vicente, K. (1999). Cognitive work analysis: Toward safe, productive, and healthy computer-based work. Mahwah, NJ: Erlbaum. Zambok, C. E., & Klein, G. (Eds.) (1997). Naturalistic decision making. Mahwah, NJ: Erlbaum. Figure 1 A screen shot from STORM-LK showing a Concept-Map Figure 2. A screen shot from STORM-LK showing example resources.