In Search of the Root Cause By Dr. John Robert Dew An Operator in a steam plant shuts off a critical valve, causing a system malfunction and a million-dollar plant shutdown. The captain of an oil tanker fails to turn his vessel at the proper time, causing it to run aground and spill its cargo. The cockpit crew of an airplane does not realize the wing flaps are not properly set before takeoff, leading to a crash and tragic loss of life. These scenarios sound like stories on the evening news. In every case, there would be an investigation into the cause of the incident. Some people would study the situation to find out what actually happened. Others might be primarily concerned with who was to blame for the problem and who is liable for the damages. A few might suggest finding the root cause of the incident to ensure that it doesn’t happen again. Defining a problem In root-cause analysis, a problem is defined as a situation where the performance of a system does not meet expectations. For instance, you might your washing machine to run through all its cycles, but instead you find that it frequently goes out of balance in the spin-dry setting. Finding the Immediate cause When you have a problem, it is important to identify the immediate cause. The immediate cause will be the action, or series of actions, that directly creates the difference between expected and actual performance. In some cases, the immediate cause will be simple to identify. If a valve on a water system is left open and an operator freely admits that he is mistakenly left the valve open, we know the immediate cause of the problem. In other cases, the immediate cause of a problem might be hidden and require some investigation. In an airplane crash, there might not be an obvious cause. Investigators will need a questioning process to help them gather the important facts about the accident before they develop theories of possible causes. Effective questioning defines the problem and narrows the scope of possible causes by describing what the problem is, where it exists, when it occurs, and whether it is increasing in scope, staying the same, or decreasing in scope. One of the best questioning methods for identifying the immediate cause of a problem was developed by Chuck Kepner and Ben Tregoe through their research on how scientists ask questions to solve problems.(1) Taking Action Once the immediate cause of a problem is known, an appropriate action can be chosen. Certain actions often provide short-term, temporary solutions that are inexpensive and require knowing only the immediate cause of the problem. Permanent, corrective action, however, requires knowing the root cause. At home, you might opt for a short-term solution to the problematic washing machine by placing a block of wood under one side of the machine so that it no longer stops during the spin-dry cycle. When you choose this type of action, you might congratulate yourself for finding a creative solution and avoiding the repairman’s bill. Likewise, managers might opt for the short-term fix to a problem in the work environment. For example, some copper tubing that supplies coolant to a motor might need repair. Proper procedure requires that an engineer be contacted to diagnose the reason for the tubing failure. However, to save time and money and avoid organizational red tape, the manager might have the local maintenance crew take care of the job and not even bother recording that the work was performed. The rationale might be “After all, it saved money, and everyone who needed to know about the repair work was present when the decision was made to take the small shortcut.” The manager might pat himself on the back, just as you would at home. The short-term fix can be seductive. In the long run, however, these quick fixes can create new problems and make us fail to see significant inadequacies in our operating processes. The most seductive part of the short-term fix is needing to know only the immediate cause of the problem. This means not having to extend the questioning into the search for the root cause. This is a reward, since looking for root causes often means asking embarrassing questions about how the organization functions. The longer an organization goes without asking these tough questions, the more ingrained systematic problems will become. Looking for the root cause The root cause is the most basic causal factor or factors that, if corrected or removed, will prevent the recurrence of the situation. It is important to understand where to look for root causes. In nature, roots are found in the soil. In organizations, the soil is the systematic factors that deal with how management plans, organizes, controls, and provides assurance of quality and safety in five key areas that will be discussed later. The search for the root cause is a questioning process, and, because root-cause analysis means asking difficult and sometimes embarrassing questions about how an organization is managed, it might be ignored for internal political reasons. Irving Janis has conducted extensive research on the way people in an organization reinforce their own beliefs and behaviors through mutual rationalization.(2) According to Janis’ studies, any effort to question these insulated views of reality will be considered an assault by a hostile force. The questioner will be subject to both indirect and direct pressure not to rock the boat. Tools for questioning There are three commonly used tools for structuring the questioning process. One tool is an event and causal factor diagram to visually display the sequence of events. A second tool describes the problem in terms of protective safeguards that have failed, and the third tool focuses on changes that have occurred relative to the expected and real performance of a process. The event and causal factor diagram The event and causal factor diagram establishes a chronological sequence of events leading to the problem. Once the relevant events have been identified and placed in their proper sequence, the investigator looks at each step or action and asks, “What allowed this to happen?” The diagram provides a visual tool for analyzing the actions relevant to the problem and for tracing the actions back to their roots. Figure 1 shows what a basic process diagram looks like. Figure One Figure 2 is a event and causal factor diagram for a car accident investigation. After identifying the stages leading up to the accident, it is important to ask, “What caused this?” When you learn that the truck was moving too slowly for highway conditions, ask why again. When you learn that the truck was overloaded, continue to ask why. Driver pulls up behind a slow moving truck Driver peeks around vehicle to pass Driver pulls out to pass Driver sees oncoming motorcycle Figure Two Safeguard analysis Safeguard analysis provides a structured way to envision the events related to system failure of the creation of a problem. Safeguard analysis identifies the safeguard and controls that will remove or reduce hazards, enforce compliance with procedures, and make targets invulnerable to hazards. Figure 3 illustrates the source of a problem, a target or victim of the situation, and barriers that are supposed to be in place to protect the target from the source of the problem. Figure Three One example of safeguard analysis is to consider what problems cause an oil tanker accident. The source of the potential problem is the crude oil being shipped in the tanker. The potential target or victim is the marine life and shorelines in the area affected by the spill. There are several possible safeguards that will keep the oil from affecting the marine life and minimize the impact of a spill: design barriers, training and qualification barriers, and containment and clean-up procedures and equipment. To provide additional safeguards, an oil tanker might be designed with a double hull to prevent rupturing, and the oil within might be stored in several internal holds. A tanker designed with a single hull and a small number of holds will be more efficient in passage but ineffective as a barrier against an environmental problem. A specific number of staff, having special qualifications to handle certain valves or navigate certain waters, might be required on the tanker. However, if an inadequate crew is allowed to operate the ship and if under qualified helmsman pilots the tanker, there will be no effective safeguards against collision or grounding. Even if an oil tanker runs aground and begins to spill its cargo, strategically located containment and clean-up equipment can minimize the damage. However, if the equipment is not available is not available or the operators are not well trained, the clean-up operation might be an inadequate safeguard to protect the victim. Safeguards might be physical requirements, special procedures, or implemented assurance activities. An organization can impose upon itself administrative and verification controls that serve as safeguards against potential problems. For instance, an organization might require that designs be independently reviewed by a qualified engineer, separate from the original designer, to ensure the quality of the design. An organization might place stringent controls on how it handles drawings and how drawings will be numbered, stored, and used in the field. There might be well-defined steps for changing drawings and for providing controlled drawings to the people who use them. Training is another form of assurance activity. It is important that operators and maintenance people are properly trained and that the organization has evidence of the proper design, delivery, and completion of the training programs. Training assurance is often as important as the training itself. However, when using the safeguard analysis to determine what object or assurance activity either failed or was missing from the process, it is important to keep asking questions. It is vital to discover why a critical safeguard was left out or why the management system allowed inadequacies in safeguards to exist. Knowing which safeguard was faculty does not explain why it was faculty; that is the embarrassing question that needs to be asked. Change analysis The third tool for conducting a root-cause analysis is referred to as change analysis, in which the questioner compares the present state of the system (the real, nonfunctioning situation) with prior state of the system (when it was working properly). The objective is to identify what has changed in the system between the time it worked and the time it failed. Investigating these changes will determine whether they had a significant effect. Change analysis must be followed by additional questioning to determine how the changes were permitted to occur. It is always important to continue the questioning process into the area of systemic factors. Status at time of event Status prior to event Changes Personnel Supervision Policies Procedures Equipment Supplies Facilities Maintenance Training Technology Software Environment Vendors Systemic factors Systemic issues concern how the management of the organization plans, organizes, controls, and provides quality assurance and safety in five key areas: personnel, procedures, equipment, material and the environment. These five areas constitute the total work system. Each area can be broken into several parts. The personnel systemic factors include issues of internal communication, training, and human factors such as physical health, mental health, substance abuse, and mental attention. If an organization involves people in technical tasks but has no way of ensuring they are functionally literate, it is setting itself up for major problems. If training is causal, not rigorously planned and executed, employees might have significant gaps in their knowledge of how to do their work. Training should include both general familiarization with the overall organization and specific performance-based instruction to master work tasks. The organization must also consider how it deals with other human factors. For instance, are employees required to work extensive overtime and, if so, what effect does it have on safety and quality? How does the organization ensure that substance abuse and other personal problems are identified and counseling and rehabilitation are available? The procedural systemic factors concern how the organization chooses to develop and handle procedures to establish how the work should be done. Are the procedures clearly written? Are they reviewed for accuracy by someone other than the person who wrote them? Are they up to date? Are they available to the people performing the work? The equipment systemic factors address how equipment is designed, selected, operated, and maintained. Has the organization selected the proper type of equipment for a task? Was the equipment properly designed, manufactured, and installed? Were the maintenance people properly trained to maintain the equipment? Was the equipment periodically maintained? Are there surveillance systems in place to ensure that proper maintenance is performed? Were the operators trained to use the equipment? Are the operators using statistical control limits to measure their work, or are they making adjustments as they see fit? The materials systemic issues relate to how the process’ raw materials are being used. How do you specify which materials should be purchased? How does the organization ensure that the materials meet specifications? Does the organization require evidence of statistical control from its suppliers? Does the organization rely on on-site inspection of the suppliers’ facilities, or does it inspect supplier’s products before use? Once an organization receives materials, how does it ensure they are properly used? Does the organization have a system for controlling the identification of materials to ensure they do not get mixed up and misused? When materials are stored, how does the organization ensure they do not become lost or damaged? The same questions about incoming materials for use in a process also apply to the end product of your organization. How do you ensure its quality and protect it during storage and shipping? The environmental systemic issues involve how to handle natural and man-made environmental conditions in the work area. How does the organizations protect its people, equipment, and materials from adverse environmental factors such as rain, snow, extreme temperatures, and humidity? How does the organization protect its people, equipment, and materials from man-made hazards such as radiation, corrosive gases and liquids, high temperatures, and explosive conditions? What systemic methods are in place to deal with these hazards? Environmental issues also include the system an organization has in place to deal with the by-products of its primary manufacturing system. How does the organization control air contaminants, hazardous chemicals, toxic substances, dangerous fumes, and radioactive waste? Asking embarrassing questions How do you know when you are on the right track with your rootcause analysis? When you ask embarrassing questions that people in the organization normally do not discuss. How far should you go in pursuing the root cause? You have certainly gone too far when you start discussing theology? The root cause will be in the soil of the organization, which consists of the systemic issues of how we choose to manage our business, governmental, and nonprofit organizations. References 1. Charles Kepner and Ben Tregoe, Problem Analysis and Decision Making (Princeton, NJ: Princeton Research Press, 1979). 2. Irving L. Janis, Victims of Group Think (New York: Houghton Mifflin Co., 1972). Dr. John Robert Dew 205-348-9831 jdew@aalan.ua.edu