21st International Symposium on Military Operational Research September 2nd, 2004 An Introduction to the Development, Application, Uses and Limitations of STAMPER: Systematic Task Analysis for Measuring Performance and Evaluating Risk Eugenia Kalantzis Operations Research Analyst Director General Land Combat Development Fort Frontenac, PO Box 17000 Station Forces Kingston, Ontario, Canada K7K 7B4 (613) 541-5010 x 2469 Kalantzis.E@forces.gc.ca Abstract A central activity within the Canadian Army's combat development process is a series of seminar wargames used to evaluate experimental force structures and capabilities. In support of this goal, the Director General Land Combat Development (DGLCD) operational research team was tasked with providing analytical support and guidance to the seminar process and to develop a robust methodology to measure the performance of competing systems. Working closely with the client, the team developed a methodology based on the evaluation of a system’s performance of a predetermined list of tasks. This methodology is referred to as STAMPER, or Systematic Task Analysis for Measuring Performance and Evaluating Risk. The methodology involved the development of a set of task lists, as well as software-supported tools for the measurement of performance and risk, and for the computation of overall measures of performance for each of the five operational functions of Command, Sense, Act, Shield and Sustain. These measures of performance are based on a simple model that, although useful in providing indications of general trends, must be interpreted with extreme caution. This paper describes the steps involved in the development and application of the STAMPER process. In particular, it explains the crucial role of the client-analyst collaboration, it describes the simple models used to assign measures of risk and to calculate overall performance measures, and it lists the strengths and limitations of these models along with examples of possible misuse and misinterpretation of the results. BACKGROUND In June 2003, the Director of Army Doctrine (DAD) was tasked with the development of a force employment concept for the Interim Army. In support of this initiative, DAD directed the design and execution of a series of wargames to assess the evolving concepts. Warfighting was selected as the focus of the initial seminar wargames; however, in due course, the seminar series will expand to include peace support, non-combatant evacuation and emergency domestic operations. The first seminar wargame, Force Employment Wargame 0401, was held in February 2004 and examined the performance of a Main Contingency Force brigade group and battle group. The Canadian force structure for the first seminar was based on equipment available at the time, including Leopard tanks and M109 medium guns. The aim of the first wargame was to establish a baseline of performance against which the impact of proposed organizational changes may be assessed, with the intent of furthering the force employment concept for the Interim Army. The second seminar wargame, Force Employment Wargame 0402, was held in May 2004 and was modelled on the baseline seminar wargame. The purpose of this wargame was to evaluate the performance of a proposed Interim Army force structure and to compare these results with those obtained 1/10 21st International Symposium on Military Operational Research September 2nd, 2004 in the baseline wargame. To support this aim, the scenarios and the Red Force remained essentially unchanged; however, the Blue Force structure was modified to include weapons systems and equipment that are programmed to be in place for the Interim Army. Of particular note was the removal of tanks and M109s and the insertion of the Mobile Gun System, the TOW missile system on a LAV chassis and the Multi-Mission Effects Vehicle Version 1, all part of the direct fire system. A mobile artillery vehicle using a 105 mm gun mounted on the bed of a variant of the Mobile Support Vehicle System was also introduced. SPONSOR OBJECTIVES The ongoing objectives of the sponsor are to assess the impact of changes in structures, equipment and capabilities that will come into effect during the Interim Army timeframe with a view to further refining the force employment concept in preparation for field trials at Canadian Manoeuvre Training Centre and eventual incorporation into doctrine, including Tactics, Techniques and Procedures (TTPs) and Standard Operating Procedures (SOPs). Insights and judgments resulting from these seminar wargames are intended to guide follow-on seminar wargame iterations, to further the Interim Army Force Employment Concept development process, and to prioritize operational research activities by identifying spin-off issues that would be most effectively dealt with using more traditional computer-assisted wargaming techniques. SEMINAR WARGAME SCENARIOS The scenario used for the wargame series was based on the Department of National Defence’s Force Planning Scenario 11, but in a time frame situated 10 years after the original conflict. The scenario included an emphasis on urban operations and the ‘three block war’. Six vignettes were wargamed - three by the brigade group and three by the battle group. On Day 1, conducted in open terrain, the Canadian brigade group and battle group were tasked with capturing objectives and destroying enemy forces that were established in a fairly conventional defensive position. On Day 2, conducted in urban terrain, operations involved seizing key nodes within the city core. On Day 3, set three days after the cessation of formal hostilities, operations included a mission in mountainous terrain, concurrent with stability operations over a large sector. OPERATIONAL RESEARCH TEAM ROLES AND OBJECTIVES The operational research (OR) team was responsible for defining the problem and scope of the exercise, selecting and designing the appropriate methodologies and criteria for investigating the issues, balancing the methodologies against constraints or time and resources, implementing the data collection plan, and collecting, extracting and presenting the results. In addition to these traditional OR responsibilities, the specific objectives for this exercise were to develop a comprehensive and robust methodology to collect quantitative and qualitative measures of performance, and to design a process by which performance and risk may be evaluated quantitatively and compared across vignettes and seminar wargames. This later objective included the design of a final product that was both simple and easy to interpret by a diverse audience within the Canadian Army. DESIGN OF THE DATA COLLECTION METHODOLOGY The importance of the client-analyst collaboration was evident in the first stage of the project, i.e. the design of the data collection methodology. The design of an appropriate methodology required a perfect understanding of the sponsor’s objectives and requirements, as well as an understanding of the constraints imposed by the seminar wargame process itself. Given the very tight timelines, the design of this methodology was performed in parallel with the sponsor’s design of the seminar wargame process itself. 2/10 21st International Symposium on Military Operational Research September 2nd, 2004 Daily interaction and close collaboration was essential to ensure the two activities converged to form a well-knit process. In the design of the data collection methodology, the following points were of particular importance: The nature of a seminar wargame is such that there is much hidden complexity in the verbal interplay. Essentially, every discussion potentially contained an argument related to strengths or limitations buried within it. If not actively recorded, these points risked being lost. Particularly in the case of the Interim Army, there is imperfect or incomplete knowledge related to capabilities, equipment, and structures. As such, any model designed must not require input that is more in depth than what is available. Stated simply, the model should not be over designed. The transient nature of the wargaming participants, particularly the combat developers that make up the core team, requires that the methodology be simple and quick to adopt by new players. Participants are classified as either part of the core team, as subject matter experts, or as observers. Different mechanisms must be put into place to record and classify observations from each of these groups. The task loading for the core team, i.e. the combat developers, and the time constraints imposed by the seminar wargame process necessitate a data collection scheme that is relatively quick and easy. In consideration of the aforementioned points, a two-tier approach was taken in the development of the analysis methodology to address both the quanlitative nature of the seminar wargame format and the requirement for quantitative results. First, the qualitative data collection process involved the conduct of formal judgment and insight session, the submission of observation sheets from all participants, as well as the compilation of strengths, weaknesses and issues matrices by the combat developers for each of the five operational functions. Qualitative data was collected and categorized in an ACCESS database designed to allow the analyst to selectively filter the observations, and to produce reports that could be channelled to an appropriate member of the staff for further action. Qualitative data collection will not be discussed further in this report. Quantitative data analysis was conducted using the STAMPER methodology, as described herein. STAMPER – SYSTEMATIC TASK ANALYSIS FOR MEASURING PERFORMANCE AND EVALUATING RISK The purpose of the STAMPER methodology was to provide a systematic framework to measure performance and identify risk factors, and to compare variations in system performance between different vignettes, as well as from one seminar wargame iteration to the next. This was done with a series of survey instruments used to elicit the subjective judgements of a team of assessors in the evaluation of the system’s performance of essential tasks. In addition, these task lists provided a framework for discussion during the seminar wargame sessions, as each combat developer assessed the performance of the tasks under his operational function. In the selection of the models to calculate overall performance and to assess risk, the use of a simple model was deemed most appropriate due to the incomplete knowledge of the innovative, and often exploratory concepts and equipment introduced during the course of this seminar wargame series. The expected model fidelity was matched to the question at hand, and the model was designed to provide a level of detail that was equal to and supported by the level of fidelity of the inputs available to the model itself. The development of the STAMPER methodology began three months prior to the baseline seminar wargame, and consisted of the following exercises: the development of the DGLCD Task Lists, the creation of the survey instruments and automated tools, and the development of a measurement methodology to identify risk factors and assign appropriate scores to individual tasks, and to the overall operational functions. 3/10 21st International Symposium on Military Operational Research September 2nd, 2004 DGLCD Task List During the task list development stage, a comprehensive task list analysis exercise was performed by the DAD combat developers and their staff. The objective of this exercise was to produce a complete list of tasks under the responsibility of the Army, divided into the five operational functions of Command, Sense, Act, Shield and Sustain. These are referred to as the DGLCD Task Lists (DTLs). The DTLs draw their origins from recognized task lists such as the Canadian Joint Task List, the subsequently developed Canadian Army Task List, the Canadian Brigade Battle Task Standards, as well as other informally developed task lists. For each operational function, Level 1, Level 2, and Level 3 tasks were identified, with each level depicting finer granularity than the previous. Figure 1 depicts a sample of the Command task list breakdown. Figure 1 Sample of the Command Task List Breakdown Survey Instruments The DTLs form the foundation of the survey instruments used to facilitate the elicitation of expert opinion, and to measure performance and evaluate risks factors observed during each of the seminar wargames. In the completion of the survey instruments, assessors were asked to evaluate tasks along two dimensions: performance level, and impact/importance of the task on the completion of the higher-level task. Table 1 presents the two questions that appeared on the survey instruments, as well as the response options available to the participants. Additionally, the table specifies the category of tasks applicable to each questions. Of particular note is that participants were required to score the performance of Level 3 tasks only. The performance of Level 2 and Level 1 tasks were automatically calculated from the performance scores of Level 3 tasks in combination with the impact scores of Level 3, Level 2 and Level 1 tasks. 4/10 21st International Symposium on Military Operational Research September 2nd, 2004 Table 1 Survey Instrument Questions and Response Options Question Response Options Question 1: Performance Question 1 Given your experience and the discussions held during the seminar wargame, please indicate the level of overall performance you believe we would most likely achieve in the completion of this task during this mission, given the capability-set assumed for this seminar wargame. 1. Unacceptable: task incomplete 2. Undesirable: task completed below standard 3. Acceptable: task completed to minimum standards 4. Superior: task completed to above minimum standards Question 2: Impact Question 2 For this particular mission, how 1. Critical: Highly correlated with the would you characterize the impact or successful completion of the task importance of this task in the overall 2. Important: Correlated with the performance of the higher-level task it successful completion of the task belongs to. 3. Minor: Low correlation with the successful completion of the task 4. Negligible: Minimally affecting the successful completion of the task Action Required Respond to Question 1 for Level 3 tasks only. Review default Question 2 results for Level 3, 2, and 1 tasks; update if necessary. Calculating Performance Scores and Assigning Risk For each of the Level 2 and Level 1 tasks, as well as the overall operational functions, a measure of performance was calculated based on Question 1 and Question 2 results for Level 3 tasks, in combination with Question 2 results for Level 2 and Level 1 tasks. The calculation of performance is based on an inner-product rule in which each task is attributed points as a function of performance and impact. Points are attributed as per the Point Allocation Matrix in Figure 2. Essentially, the measure of performance of a Level 2 task is the weighted average of the performance scores assigned to the Level 3 tasks belonging to it. Weights were assigned to tasks based on the survey responses to Question 2 on impact; tasks estimated to be of a higher importance were assigned a higher weight. As such, Performance score of a Level 2 task = Sum of points for Level 3 tasks belonging to the Level 2 task Sum of weights for Level 3 tasks belonging to the Level 2 task In a similar fashion, the measure of performance of a Level 1 task is the weighted average of the performance of the Level 2 tasks belonging to it. And finally, the measure of performance of an operational function is the weighted average of the performance of the Level 1 tasks belonging to it. However, in the case of the roll-up for Level 1 tasks and for overall operational function performance 5/10 21st International Symposium on Military Operational Research September 2nd, 2004 scores, the points assigned to each sub-task are calculated as the product of the sub-task performance score, as calculated in the previous step, and the appropriate weight assigned as a function of the response to Question 2 on impact. Figure 2 Point Allocation Matrix For each of the Level 3 tasks, a measure of risk was obtained using the results of Question 1 and Question 2. A risk indicator was assigned to each task following the Risk Assessment Matrix in Figure 3. The lower the score, the higher the risk associated with that task. Low scores, colour-coded in red and yellow, identify tasks that would endanger mission success, whereas high scores, colour-coded in green and blue, identify tasks that would contribute to mission success. The assigned risk indicators were not rolled up as in the analysis of the overall performance measures. This was done to ensure that visibility of high-risk tasks is maintained throughout the exercise, and that this information was not obscured when results are merged to obtain measures of performance at higher levels. Instead, these scores would be compared with those collected in future iterations, and changes would be examined at this level of granularity. Figure 3 Risk Assessment Matrix 6/10 21st International Symposium on Military Operational Research September 2nd, 2004 Automated Tool to Calculate Performance, Risk and Deltas In preparation for the seminar wargame, an Excel-based analysis tool was designed to automatically assemble responses from participants, to assign a measure of risk as a function of performance and impact scores, to calculate a score of overall performance for Level 2 and Level 1 tasks, and to roll-up these scores into an overall measure of performance at the operational function level. In addition, following the second seminar wargame, the tool was updated to automatically display the change is performance and impact/importance ratings, as well as the change in calculated performance and risk scores. This analysis was performed for each of the six vignettes, and for each of the five operational functions. Figures 4 and 5 depict snapshots of the tool as it is used to automatically calculate Level 2 and Level 1 scores, and to display changes in performance of scores from one seminar wargame to the next, respectively. Figure 4 Automatic Calculation of Performance Scores for Level 2 and Level 1 Tasks 7/10 21st International Symposium on Military Operational Research September 2nd, 2004 Figure 5 Automatic Calculation of Change in Performance Scores from Baseline to Iteration 1 Strengths Among others, the use of the STAMPER methodology presented following benefits: The development of the process itself required a close collaboration between the sponsor and the analyst. This collaboration engaged the participants fully, and instilled a sense of ownership in the process that was essential to the success of the exercise. The process provided a framework for a structured evaluation of performance across a wide and complete range of tasks. The quantitative results complemented the results extracted from the qualitative data collection. As a visualization tool, the methodology was successful in providing a quick identification of performance and risk results, as well as changes in these measures from one iteration to the next. The automated tool ensured the availability of near real-time results. These results were then available at end-of-day judgments & insights sessions, to be used as required. Limitations Limitations inherent to the STAMPER methodology are as follows: Quantitative results are currently based on the opinion of a small pool of subject matter experts. Sensitivity analysis results for performance scores vary across operational functions. As such, performance scores should not be compared across these functions. However, within a given 8/10 21st International Symposium on Military Operational Research September 2nd, 2004 operational function, comparisons can be made across scenarios/vignettes (e.g. open vs. urban) and across capability sets (i.e. across seminar wargames). Changes in performance levels should be interpreted with care. These deltas should be used as indicators or pointers to an underlying phenomenon requiring further investigation. The quantitative results alone do not provide a complete picture of the issue, nor do they provide an explanation of the reason for the level of performance achieved or the change in this score. These results should therefore be interpreted jointly with the qualitative results to provide a complete picture. Examples of Misuse Very often, quantitative results such as these can be misinterpreted or used out of context; and the results tend to take on their own life. A common mistake in the use of this model is the tendency to focus on the higher-level performance scores, and to ignore the performance scores of individual sub-tasks. For example, if an operational function performance score of 80 is obtained in a particular situation, there is a tendency to assume that all sub-tasks within this function were performed at an equally acceptable level. This is not necessarily true. Particular sub-tasks within that operational function may have been performed at less than acceptable levels. Only an investigation of performance scores at all task levels would reveal the complete picture. Another possible misuse of the model is the comparison of performance scores across operational functions. For example, in the case where the Sustain and Command functions obtained scores of 70 and 80, respectively, there is a tendency to conclude that one function was performed better than another. Although this may be true, the level of fidelity inherent in this model cannot justify this type of broad statement. CONCLUSION The seminar format provided a suitable setting for discussion of organizational strengths and weaknesses, as well as the identification of more specific issues warranting further study. The STAMPER methodology provided quantitative results that supported the qualitative results collected in seminar discussions and judgements & insights sessions; the two-tier approach to data collection yielded a complete picture of events. In addition, it included a final product that was both simple and easy to interpret by a diverse audience within the Canadian Army. As with other simple models, results should be interpreted with caution. Conclusions drawn from the results should match the level of fidelity inherent to this model itself, and should not go beyond. This being said, the tool is useful in highlighting changes in levels of performance and indicating the presence of underlying phenomena warranting further study. 9/10 21st International Symposium on Military Operational Research September 2nd, 2004 REFERENCES 1. Alberts, D.S., July 2002, Code of Best Practice for Experimentation, Command and Control Research Program, Department of Defense, Washington DC, United States 2. Anon, 1998, A Tool for Evaluating Force Modernization Options, RAND, Santa Monica, CA, United States 3. Ayyub, B.M., 2001, Elicitation of Expert Opinions for Uncertainty and Risks, CRC Press, Boca Raton FL, United States 4. Ayyub, B.M., 2001, A Practical Guide on Conducting Expert-Opinion Elicitation of Probabilities and Consequences for Corps Facilities, Institute for Water Resources, U.S. Army Corps of Engineers, Alexandria, VA, United States 5. Davis, P.K., 2002, Analytic Architecture for Capabilities-Based Planning, Mission System Analysis, and Transformation, RAND National Defense Research Institute, Santa Monica, CA, United States 6. DuBois, D.A., Shalin, V.L., Levi, K.R., Borman, W.C., December 1995, A Cognitively-Oriented Approach to Task Analysis and Test Design, Personnel Decisions Research Institutes Inc. Doctrine, September 2003, A Force Employment Discussion Paper, Canada 7. Director General Land Combat Development, October 2003, Seminar Wargame Initiating Directive: Interim Army Force Employment Concept, Kingston, Canada 8. Entin, E.E., Entin, E.B., June 2001, Measures for Evaluation of Team Processes and Performance in Experiments and Exercises, 6th Annual International Command and Control Research and Technology Symposium (ICCRTS), National Defense University, Washington DC, United States for Office of Naval Research, Minneapolis, MN, United States 9. Hunter, D., Bailey. A., and Taylor, B., 1992, The Art of Facilitation: How to create Group Synergy, Fisher Books, Tucson AZ, United States 10. Kalantzis, E., Cameron, F., and Roy, R., 2004, Force Employment Wargame 0401: A Baseline of the Interim Army in Warfighting Tasks, Director General Land Combat Development, Kingston, Canada 11. Murphey, J., Grynovicki, J.O., Kysor, K. P., June 2003, Case Study of a Prototype Set of Behaviorally Anchored Rating Scales (BARS) for C2 Assessment, 8th Annual ICCRTS, National Defense University, Washington DC, United States 12. Page, P.B., McFarlane, S., July 1998, Land Operations 2020 Technology Seminar Wargaming: Final Report, Centre for Defence Analysis (DERA), Kent, England 13. Sienknecht, R. T., June 1999, An Empirical Analysis of Rating Effectiveness for a State Quality Award, Thesis: Master of Science in Industrial and Systems Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA, United States 14. Zsambok, C.E., Klein, G., 1996, Naturalistic Decision Making, Lawrence Erlbaum, NJ, United States 10/10