In Search of the Root Cause

advertisement
In Search of the Root Cause
By Dr. John Robert Dew
An Operator in a steam plant shuts off a critical valve, causing a
system malfunction and a million-dollar plant shutdown. The captain of an
oil tanker fails to turn his vessel at the proper time, causing it to run aground
and spill its cargo. The cockpit crew of an airplane does not realize the wing
flaps are not properly set before takeoff, leading to a crash and tragic loss of
life.
These scenarios sound like stories on the evening news. In every
case, there would be an investigation into the cause of the incident. Some
people would study the situation to find out what actually happened. Others
might be primarily concerned with who was to blame for the problem and
who is liable for the damages. A few might suggest finding the root cause of
the incident to ensure that it doesn’t happen again.
Defining a problem
In root-cause analysis, a problem is defined as a situation where the
performance of a system does not meet expectations. For instance, you
might your washing machine to run through all its cycles, but instead you
find that it frequently goes out of balance in the spin-dry setting.
Finding the Immediate cause
When you have a problem, it is important to identify the immediate
cause. The immediate cause will be the action, or series of actions, that
directly creates the difference between expected and actual performance. In
some cases, the immediate cause will be simple to identify. If a valve on a
water system is left open and an operator freely admits that he is mistakenly
left the valve open, we know the immediate cause of the problem.
In other cases, the immediate cause of a problem might be hidden and
require some investigation. In an airplane crash, there might not be an
obvious cause. Investigators will need a questioning process to help them
gather the important facts about the accident before they develop theories of
possible causes.
Effective questioning defines the problem and narrows the scope of
possible causes by describing what the problem is, where it exists, when it
occurs, and whether it is increasing in scope, staying the same, or decreasing
in scope. One of the best questioning methods for identifying the immediate
cause of a problem was developed by Chuck Kepner and Ben Tregoe through their
research on how scientists ask questions to solve problems.(1)
Taking Action
Once the immediate cause of a problem is known, an appropriate
action can be chosen. Certain actions often provide short-term, temporary
solutions that are inexpensive and require knowing only the immediate cause
of the problem. Permanent, corrective action, however, requires knowing
the root cause.
At home, you might opt for a short-term solution to the problematic
washing machine by placing a block of wood under one side of the machine
so that it no longer stops during the spin-dry cycle. When you choose this
type of action, you might congratulate yourself for finding a creative
solution and avoiding the repairman’s bill.
Likewise, managers might opt for the short-term fix to a problem in
the work environment. For example, some copper tubing that supplies
coolant to a motor might need repair. Proper procedure requires that an
engineer be contacted to diagnose the reason for the tubing failure.
However, to save time and money and avoid organizational red tape, the
manager might have the local maintenance crew take care of the job and not
even bother recording that the work was performed. The rationale might be
“After all, it saved money, and everyone who needed to know about the
repair work was present when the decision was made to take the small
shortcut.” The manager might pat himself on the back, just as you would at
home.
The short-term fix can be seductive. In the long run, however, these
quick fixes can create new problems and make us fail to see significant
inadequacies in our operating processes. The most seductive part of the
short-term fix is needing to know only the immediate cause of the problem.
This means not having to extend the questioning into the search for the root
cause. This is a reward, since looking for root causes often means asking
embarrassing questions about how the organization functions. The longer an
organization goes without asking these tough questions, the more ingrained
systematic problems will become.
Looking for the root cause
The root cause is the most basic causal factor or factors that, if
corrected or removed, will prevent the recurrence of the situation. It is
important to understand where to look for root causes.
In nature, roots are found in the soil. In organizations, the soil is the
systematic factors that deal with how management plans, organizes,
controls, and provides assurance of quality and safety in five key areas that
will be discussed later.
The search for the root cause is a questioning process, and, because
root-cause analysis means asking difficult and sometimes embarrassing
questions about how an organization is managed, it might be ignored for
internal political reasons. Irving Janis has conducted extensive research on
the way people in an organization reinforce their own beliefs and behaviors
through mutual rationalization.(2) According to Janis’ studies, any effort to
question these insulated views of reality will be considered an assault by a
hostile force. The questioner will be subject to both indirect and direct
pressure not to rock the boat.
Tools for questioning
There are three commonly used tools for structuring the questioning
process. One tool is an event and causal factor diagram to visually display
the sequence of events. A second tool describes the problem in terms of
protective safeguards that have failed, and the third tool focuses on changes
that have occurred relative to the expected and real performance of a
process.
The event and causal factor diagram
The event and causal factor diagram establishes a chronological
sequence of events leading to the problem. Once the relevant events have
been identified and placed in their proper sequence, the investigator looks at
each step or action and asks, “What allowed this to happen?” The diagram
provides a visual tool for analyzing the actions relevant to the problem and
for tracing the actions back to their roots. Figure 1 shows what a basic
process diagram looks like.
Figure One
Figure 2 is a event and causal factor diagram for a car accident
investigation. After identifying the stages leading up to the accident, it is
important to ask, “What caused this?” When you learn that the truck was
moving too slowly for highway conditions, ask why again. When you learn
that the truck was overloaded, continue to ask why.
Driver pulls up
behind a slow
moving truck
Driver peeks
around vehicle
to pass
Driver
pulls out to
pass
Driver sees
oncoming
motorcycle
Figure Two
Safeguard analysis
Safeguard analysis provides a structured way to envision the events
related to system failure of the creation of a problem. Safeguard analysis
identifies the safeguard and controls that will remove or reduce hazards,
enforce compliance with procedures, and make targets invulnerable to
hazards. Figure 3 illustrates the source of a problem, a target or victim of
the situation, and barriers that are supposed to be in place to protect the
target from the source of the problem.
Figure Three
One example of safeguard analysis is to consider what problems cause
an oil tanker accident. The source of the potential problem is the crude oil
being shipped in the tanker. The potential target or victim is the marine life
and shorelines in the area affected by the spill.
There are several possible safeguards that will keep the oil from
affecting the marine life and minimize the impact of a spill: design barriers,
training and qualification barriers, and containment and clean-up procedures
and equipment.
To provide additional safeguards, an oil tanker might be designed
with a double hull to prevent rupturing, and the oil within might be stored in
several internal holds. A tanker designed with a single hull and a small
number of holds will be more efficient in passage but ineffective as a
barrier against an environmental problem.
A specific number of staff, having special qualifications to
handle certain valves or navigate certain waters, might be required on the
tanker. However, if an inadequate crew is allowed to operate the ship and if
under qualified helmsman pilots the tanker, there will be no effective
safeguards against collision or grounding.
Even if an oil tanker runs aground and begins to spill its cargo,
strategically located containment and clean-up equipment can minimize the
damage. However, if the equipment is not available is not available or the
operators are not well trained, the clean-up operation might be an inadequate
safeguard to protect the victim.
Safeguards might be physical requirements, special procedures, or
implemented assurance activities. An organization can impose upon itself
administrative and verification controls that serve as safeguards against
potential problems. For instance, an organization might require that designs
be independently reviewed by a qualified engineer, separate from the
original designer, to ensure the quality of the design. An organization might
place stringent controls on how it handles drawings and how drawings will
be numbered, stored, and used in the field. There might be well-defined
steps for changing drawings and for providing controlled drawings to the
people who use them.
Training is another form of assurance activity. It is important that
operators and maintenance people are properly trained and that the
organization has evidence of the proper design, delivery, and completion of
the training programs. Training assurance is often as important as the
training itself.
However, when using the safeguard analysis to determine what object
or assurance activity either failed or was missing from the process, it is
important to keep asking questions. It is vital to discover why a critical
safeguard was left out or why the management system allowed inadequacies
in safeguards to exist. Knowing which safeguard was faculty does not
explain why it was faculty; that is the embarrassing question that needs to be
asked.
Change analysis
The third tool for conducting a root-cause analysis is referred to as
change analysis, in which the questioner compares the present state of the
system (the real, nonfunctioning situation) with prior state of the system
(when it was working properly). The objective is to identify what has
changed in the system between the time it worked and the time it failed.
Investigating these changes will determine whether they had a significant
effect.
Change analysis must be followed by additional questioning to
determine how the changes were permitted to occur. It is always important
to continue the questioning process into the area of systemic factors.
Status at time of
event
Status prior to
event
Changes
Personnel
Supervision
Policies
Procedures
Equipment
Supplies
Facilities
Maintenance
Training
Technology
Software
Environment
Vendors
Systemic factors
Systemic issues concern how the management of the organization
plans, organizes, controls, and provides quality assurance and safety in five
key areas: personnel, procedures, equipment, material and the environment.
These five areas constitute the total work system. Each area can be broken
into several parts. The personnel systemic factors include issues of
internal communication, training, and human factors such as physical health,
mental health, substance abuse, and mental attention. If an organization
involves people in technical tasks but has no way of ensuring they are
functionally literate, it is setting itself up for major problems. If training is
causal, not rigorously planned and executed, employees might have
significant gaps in their knowledge of how to do their work. Training
should include both general familiarization with the overall organization and
specific performance-based instruction to master work tasks. The
organization must also consider how it deals with other human factors. For
instance, are employees required to work extensive overtime and, if so, what
effect does it have on safety and quality? How does the organization ensure
that substance abuse and other personal problems are identified and
counseling and rehabilitation are available?
The procedural systemic factors concern how the organization
chooses to develop and handle procedures to establish how the work should
be done. Are the procedures clearly written? Are they reviewed for
accuracy by someone other than the person who wrote them? Are they up to
date? Are they available to the people performing the work?
The equipment systemic factors address how equipment is
designed, selected, operated, and maintained. Has the organization selected
the proper type of equipment for a task? Was the equipment properly
designed, manufactured, and installed? Were the maintenance people
properly trained to maintain the equipment? Was the equipment periodically
maintained? Are there surveillance systems in place to ensure that proper
maintenance is performed? Were the operators trained to use the
equipment? Are the operators using statistical control limits to measure their
work, or are they making adjustments as they see fit?
The materials systemic issues relate to how the process’ raw
materials are being used. How do you specify which materials should be
purchased? How does the organization ensure that the materials meet
specifications? Does the organization require evidence of statistical control
from its suppliers? Does the organization rely on on-site inspection of the
suppliers’ facilities, or does it inspect supplier’s products before use? Once
an organization receives materials, how does it ensure they are properly
used? Does the organization have a system for controlling the identification
of materials to ensure they do not get mixed up and misused? When
materials are stored, how does the organization ensure they do not become
lost or damaged? The same questions about incoming materials for use in a
process also apply to the end product of your organization. How do you
ensure its quality and protect it during storage and shipping?
The environmental systemic issues involve how to handle natural
and man-made environmental conditions in the work area. How does the
organizations protect its people, equipment, and materials from adverse
environmental factors such as rain, snow, extreme temperatures, and
humidity? How does the organization protect its people, equipment, and
materials from man-made hazards such as radiation, corrosive gases and
liquids, high temperatures, and explosive conditions? What systemic
methods are in place to deal with these hazards? Environmental issues also
include the system an organization has in place to deal with the by-products
of its primary manufacturing system. How does the organization control air
contaminants, hazardous chemicals, toxic substances, dangerous fumes, and
radioactive waste?
Asking embarrassing questions
How do you know when you are on the right track with your rootcause analysis? When you ask embarrassing questions that people in the
organization normally do not discuss. How far should you go in pursuing
the root cause? You have certainly gone too far when you start discussing
theology? The root cause will be in the soil of the organization, which
consists of the systemic issues of how we choose to manage our business,
governmental, and nonprofit organizations.
References
1. Charles Kepner and Ben Tregoe, Problem Analysis and Decision Making
(Princeton, NJ: Princeton Research Press, 1979).
2. Irving L. Janis, Victims of Group Think (New York: Houghton Mifflin
Co., 1972).
Dr. John Robert Dew
205-348-9831
jdew@aalan.ua.edu
Download