"Introduction" to Normal Accidents,

advertisement
“Introduction” to Normal
Accidents,
by Charles Perrow
Welcome to the world of high-risk technologies.
You may have noticed that they seem to be
multiplying, and it is true. As our technology
expands, as our wars multiply, and as we invade
more and more of nature, we create systems—
organizations, and the organization of
organizations—that increase the risks for the
operators, passengers, innocent bystanders, and
for future generations. In this book we will review
some of these systems—nuclear power plants,
chemical plants, aircraft and air traffic control,
ships, dams, nuclear weapons, space missions,
and genetic engineering. Most of these risky
enterprises have catastrophic potential, the
ability to take the lives of hundreds of people in
one blow, or to shorten or cripple the lives of
thousands or millions more. Every year there are
more such systems. That is the bad news.
The good news is that if we can understand
the nature of risky enterprises better, we may be
able to reduce or even remove these dangers. I
have to present a lot of the bad news here in
order to reach the good, but it is the possibility of
managing high-risk technologies better than we
are doing now that motivates this inquiry. There
are many improvements we can make that I will
not dwell on, because they are fairly obvious—
such as better operator training, safer designs,
more quality control, and more effective
regulation. Experts are working on these
solutions in both government and industry. I am
not too sanguine about these efforts, since the
risks seem to appear faster than the reduction of
risks, but that is not the topic of this book.
Rather, I will dwell upon characteristics of
high-risk technologies that suggest that no
matter how effective conventional safety devices
are, there is a form of accident that is inevitable.
This is not good news for systems that have high
catastrophic potential, such as nuclear power
plants, nuclear weapons systems, recombinant
DNA production, or even ships carrying highly
toxic or explosive cargoes. It suggests, for
example, that the probability of a nuclear plant
meltdown with dispersion of radioactive materials
to the atmosphere is not one chance in a million
a year, but more like one chance in the next
decade.
Most high-risk systems have some special
characteristics, beyond their toxic or explosive or
genetic dangers, that make accidents in them
inevitable, even “normal.” This has to do with the
way failures can interact and the way the system
is tied together. It is possible to analyze these
special characteristics and in doing so gain a
much better understanding of why accidents
occur in these systems, and why they always will.
If we know that, then we are in a better position
to argue that certain technologies should be
abandoned, and others, which we cannot
abandon because we have built much of our
society around them, should be modified. Risk
will never be eliminated from high-risk systems,
and we will never eliminate more than a few
systems at best. At the very least, however, we
might stop blaming the wrong people and the
wrong factors, and stop trying to fix the systems
in ways that only make them riskier.
The argument is basically very simple. We
start with a plant, airplane, ship, biology
laboratory, or other setting with a lot of
components (parts, procedures, operators). Then
we need two or more failures among
components that interact in some unexpected
way. No one dreamed that when X failed, Y
would also be out of order and the two failures
would interact so as to both start a fire and
silence the fire alarm. Furthermore, no one can
figure out the interaction at the time and thus
know what to do. The problem is just something
that never occurred to the designers. Next time
they will put in an extra alarm system and a fire
suppressor, but who knows, that might just allow
three more unexpected interactions among
inevitable failures. This interacting tendency is
characteristic of a system, not of a part or an
operation; we will call it the “interactive
complexity” of the system.
For some systems that have this kind of
complexity, such as universities or research and
development labs, the accident will not spread
and be serious because there is a lot of slack
available, and time to spare, and other ways to
get things done. But suppose the system is also
“tightly coupled,” that is, processes happen very
fast and can’t be turned off, the failed parts
2
cannot be isolated from other parts, or there is
no other way to keep the production going safely.
Then recovery from the initial disturbance is not
possible; it will spread quickly and irretrievably
for at least some time. Indeed, operator action or
the safety systems may make it worse, since for
a time it is not know what the problem really is.
Probably many production processes started
out this way—complexly interactive and tightly
coupled. But with experience, better designs,
equipment, and procedures appeared, and the
unsuspected interactions were avoided and the
tight coupling reduced. This appears to have
happened in the case of air traffic control, where
interactive complexity and tight coupling have
been reduced by better organization and
“technological fixes.” We will also see how the
interconnection between dams and earthquakes
is beginning to be understood. We now know
that it involves a larger system than we originally
thought when we just closed off a canyon and let
it fill with water. But for most of the systems we
shall consider in this book, neither better
organization nor technological innovations
appear to make them any less prone to system
accidents. In fact, these systems require
organizational structures that have large internal
contradictions, and technological fixes that only
increase interactive complexity and tighten the
coupling; they become still more prone to certain
kinds of accidents.
If interactive complexity and tight coupling—
system characteristics—inevitably will produce
an accident, I believe we are justified in calling it
a normal accident, or a system accident. The
odd term normal accident is meant to signal that,
given the system characteristics, multiple and
unexpected interactions of failures are inevitable.
This is an expression of an integral characteristic
of the system, not a statement of frequency. It is
normal for us to die, but we only do it once.
System accidents are uncommon, even rare; yet
this is not all that reassuring, if they can produce
catastrophes.
The best way to introduce the idea of a normal
accident or a system accident is to give a
hypothetical example from a homey, everyday
experience. It should be familiar to all of us; it is
one of those days when everything seems to go
wrong.
A Day in the Life
You stay home from work or school because you
have an important job interview downtown this
morning that you have finally negotiated. Your
friend or spouse has already left when you make
breakfast, but unfortunately he or she has left the
glass coffeepot on the stove with the heat on.
The coffee has boiled dry and the glass pot has
cracked. Coffee is an addiction for you, so you
rummage about in the closet until you find an old
drip coffeemaker. Then you wait for the water to
boil, watching the clock, and after a quick cup
dash out the door. When you get to your car you
find that in your haste you have left your car keys
(and the apartment keys) in the apartment.
That’s okay, because there is a spare apartment
key hidden in the hallway for just such
emergencies. (This is a safety device, a
redundancy, incidentally.) But then you
remember that you gave a friend the key the
other night because he had some books to pick
up, and, planing ahead, you knew you would not
be home when he came. (That finishes that
redundant pathway, as engineers call it.)
Well, it is getting late, but there is always the
neighbor’s car. The neighbor is a nice old gent
who drives his car about once a month and
keeps it in good condition. You knock on the
door, your tale ready. But he tells you that it just
so happened that the generator went out last
week and the man is coming this afternoon to
pick it up and fix it. Another “backup” system has
failed you, this time through no connection with
your behavior at all (uncoupled or independent
events, in this case, since the key and the
generator are rarely connected). Well, there is
always the bus. But not always. The nice old
gent has been listening to the radio and tells you
the threatened lock-out of the drivers by the bus
company has indeed occurred. The drivers
refuse to drive what they claim are unsafe buses,
and incidentally want more money as well. (A
safety system has foiled you, of all things.) You
call a cab from your neighbor’s apartment, but
none can be had because of the bus strike.
(These two events, the bus strike and the lack of
cabs, are tightly connected, dependent events,
3
or tightly coupled events, as we shall call them,
since one triggers the other.)
You call the interviewer’s secretary and say,
“It’s just too crazy to try to explain, but all sorts of
things happened this morning and I can’t make
the interview with Mrs. Thompson. Can we
reschedule it?” And you say to yourself, next
week I am going to line up two cars and a cab
and make the morning coffee myself. The
secretary answers “Sure,” but says to himself,
“This person is obviously unreliable; now this
after pushing for weeks for an interview with
Thompson.” He makes a note to that effect on
the record and searches for the most
inconvenient time imaginable for next week, one
that Mrs. Thompson might have to cancel.
Now I would like you to answer a brief
questionnaire about this event. Which was
the primary cause of this “accident” or foulup?
1. Human error (such as leaving the heat on
under the coffee, or forgetting the keys in the
rush)? Yes______ No______ Unsure ______
2. Mechanical failure (the generator on the
neighbor’s car)? Yes______ No______ Unsure
______
3. The environment (bus strike and taxi
overload)? Yes______ No______ Unsure
______
4. Design of the system (in which you can lock
yourself out of the apartment rather than having
to use a door key to set the lock; a lack of
emergency capacity in the taxi fleet)?
Yes______ No______ Unsure ______
5. Procedures used (such as warming up coffee
in a glass pot; allowing only normal time to get
out on this morning)? Yes______ No______
Unsure ______
I you answered “not sure” or “no” to all of the
above, I am with you. If you answered “yes” to
the first, human error, you are taking a stand on
multiple failure accidents that resembles that of
the President’s Commission to Investigate the
Accident at Three Mile Island. The Commission
blamed everyone, but primarily the operators.
The builders of the equipment, Babcock and
Wilcox, blamed only the operators. If you
answered “yes” to the second choice,
mechanical error, you can join the Metropolitan
Edison officials who run the Three Mile Island
plant. They said the accident was caused by the
faculty valve, and then sued the vendor, Babcock
and Wilcox. If you answered “yes” to the fourth,
design of the system, you can join the experts of
the Essex Corporation, who did a study for the
Nuclear Regulatory Commission of the control
room.
The best answer is not “all of the above” or
any one of the choices, but rather “none of the
above.” (Of course I did not give you this as an
option.) The cause of the accident is to be found
in the complexity of the system. That is, each of
the failures—design, equipment, operators,
procedures, or environment—was trivial by itself.
Such failures are expected to occur since
nothing is perfect, and we normally take little
notice of them. The bus strike would not affect
you if you had your car key or the neighbor’s car.
The neighbor’s generator failure would be of little
consequence if taxis were available. If it were not
an important appointment the absence of cares,
buses, and taxis would not matter. On any other
morning the broken coffeepot would have been
an annoyance (an incident, we will call it), but
would not have added to your anxiety and
caused you to dash out without your keys.
Though the failures were trivial in themselves,
and each one had a backup system or redundant
path to treat if the main one were blocked, the
failures became serious when they interacted. It
is the interaction of the multiple failures that
explains the accident. We expect bus strikes
occasionally, we expect to forget our keys with
that kind of apartment lock (why else hide a
redundant key?), we occasionally loan the extra
key to someone rather than disclose its hiding
place. What we don’t expect is for all of these
events to come together at once. That is why we
told the secretary that it was a crazy morning, too
complex to explain, and invoked Murphy’s law to
ourselves (if anything can go wrong, it will).
That accident had its cause in the interactive
nature of the world for us that morning and in its
tight coupling—not in the discrete failures, which
4
are to be expected and which are guarded
against with backup systems. Most of the time
we don’t notice the inherent coupling in our
world, because most of the time there are no
failures, or the failures that occur do not interact.
But all of a sudden, things that we did not realize
could be linked (buses and generators, coffee
and a loaned key) became linked. The system is
suddenly more tightly coupled that we had
realized. When we have interactive systems that
are also tightly coupled, it is “normal” for them to
have this kind of an accident, even though it is
infrequent. It is normal not in the sense of being
frequent or being expected—indeed, neither is
true, which is why we were so baffled by what
went wrong. It is normal in the sense that it is an
inherent property of the system to occasionally
experience this interaction. Three Mile Island
was such a normal or system accident, and so
were countless others that we shall examine in
this book. We have such accidents because we
have built an industrial society that has some
parts, like industrial plants or military adventures,
that have highly interactive and tightly coupled
units. Unfortunately, some of these have high
potential for catastrophic accidents.
Our “day in the life” example introduced some
useful terms. Accidents can be the result of
multiple failures. Our example illustrated failures
in five components: in design, equipment,
procedures, operators, and environment. To
apply this concept to accidents in general, we will
need to add a sixth area—supplies and
materials. All six will be abbreviated as the
DEPOSE components (for design, equipment,
procedures, operators, supplies and materials,
and environment). The example showed how
different parts of the system can be quite
dependent upon one another, as when the bus
strike created a shortage of taxis. This
dependence is know as tight coupling. On the
other hand, events in a system can occur
independently as we noted with the failure of the
generator and forgetting the keys. These are
loosely coupled events, because although at this
time they were both involved in the same
production sequence, one was not caused by the
other.
One final point which our example cannot
illustrate. It isn’t the best case of a normal
accident or system accident, as we shall use
these terms, because the interdependence of the
events was comprehensible for the person or
“operator.” She or he could not do much about
the events singly or in their interdependence, but
she or he could understand the interactions. In
complex industrial, space, and military system,
the normal accident generally (not always)
means that the interactions are not only
unexpected, but are incomprehensible for some
critical period of time. In part this is because in
these human-machine systems the interactions
literally cannot be seen. In part it is because,
even if they are seen, they are not believed. As
we shall find out and as Robert Jervis and Karl
Weick have noted, seeing is not necessarily
believing; sometime, we must believe before we
can see.
Variations on the Theme
While basically simple, the idea that guides this
book has some quite radical ramifications. For
example, virtually every system we will examine
places “operator error” high on its list of causal
factors—generally about 60 to 80 percent of
accidents are attributed to this factor. But if, as
we shall see time and time again, the operator is
confronted by unexpected and usually
mysterious interactions among failures, saying
that he or she should have zigged instead of
zagged is possible only after the fact. Before the
accident no one could know what was going on
and what should have been done. Sometimes
the errors are bizarre. We will encounter
“noncollision course collisions,” for example,
where ships that were about to pass in the night
suddenly turn and ram each other. But careful
inquiry suggests that the mariners had quite
reasonable explanations for their actions; it is
just that the interaction of small failures led them
to construct quite erroneous worlds in their
minds, and in this case these conflicting images
led to collision.
Another ramification is that great events have
small beginnings. Running through the book are
accidents that start with trivial kitchen mishaps;
we will find them on aircraft and ships and in
nuclear plants, having to do with making coffee
or washing up. Small failures abound in big
5
systems; accidents are not often caused by
massive pipe breaks, wings coming off, or
motors running amok. Patient accident
reconstruction reveals the banality and triviality
behind most catastrophes.
Small beginnings all too often cause great
events when the system uses a “transformation”
process rather than an additive or fabricating
one. Where chemical reactions, high
temperature and pressure, or air, vapor, or water
turbulence is involved, we cannot see what is
going on or even, at times, understand the
principles. In many transformation systems we
generally know what works, but sometimes do
not know why. These systems are particularly
vulnerable to small failures that “propagate”
unexpectedly, because of complexity and tight
coupling. We will examine other systems where
there is less transformation and more fabrication
or assembly, systems that process raw materials
rather than change them. Here there is an
opportunity to learn from accidents and greatly
reduce complexity and coupling. These systems
can still have accidents—all systems can. But
they are more likely to stem from major failures
whose dynamics are obvious, rather than the
trivial ones that are hidden from understanding.
Another ramification is the role of
organizations and management in preventing
failures—or causing them. Organizations are at
the center of our inquiry, even though we will
often talk about hardware and pressure and
temperature and the like. High-risk systems have
a double penalty; because normal accidents
stem from the mysterious interaction of failures,
those closest to the system, the operators, have
to be able to take independent and sometimes
quite creative action. But because these systems
are so tightly coupled, control of operators must
be centralized because there is little time to
check everything out and be aware of what
another part of the system is doing. An operator
can’t just do her own thing; tight coupling means
tightly prescribed steps and invariant sequences
that cannot be changed. But systems cannot be
both decentralized and centralized at the same
time; they are organizational Pushme-pullyous,
straight out of Dr. Doolittle stories, trying to do in
opposite directions at once. So we must add
organizational contradictions to our list of
problems.
Even aside from these inherent
contradictions, the role of organizations is
important in other respects for our story. Time
and time again warnings are ignored,
unnecessary risks taken, sloppy work done,
deception and downright lying practiced. As an
organizational theorist I am reasonably unshaken
by this; it occurs in all organizations, and it is a
part of the human condition. But when it comes
to systems with radioactive, toxic, or explosive
materials, or those operating in an unforgiving,
hostile environment in the air, at sea, or under
the ground, these routine sins of organizations
have very nonroutine consequences. Our ability
to organize does not match the inherent hazards
of some of our organized activities. Better
organization will always help any endeavor. But
the best is not good enough for some that we
have decided to pursue.
Nor can better technology always do the job.
Besides being a book about organizations (but
painlessly, without the jargon and the sacred
texts), this is a book about technology. You will
probably learn more than you ever wanted to
about condensate polishers, buffet boundaries,
reboilers, and slat retraction systems. But that is
in passing (and even while passing you are
allowed a considerable measure of
incomprehension). What is not in passing but is
essential here is an evaluation of technology and
its “fixes.” As the saying goes, man’s reach has
always exceeded his grasp (and of course that
goes for women too). It should be so. But we
might begin to learn that of all the glorious
possibilities out there to reach for, some are
going to be beyond our grasp in catastrophic
ways. There is not technological imperative that
says we must have power or weapons from
nuclear fission or fusion, or that we must create
and loose upon the earth organisms that will
devour our oil spills. We could reach for, and
grasp, solar power or safe coal-fired plants, and
the safe ship designs and industry controls that
would virtually eliminate oil spills. No catastrophic
potential flows from these.
It is particularly important to evaluate
technological fixes in the systems that we cannot
or will not do without. Fixes, including safety
6
devices, sometimes create new accidents, and
quite often merely allow those in charge to run
the system faster, or in worse weather, or with
bigger explosives. Some technological fixes are
error-reducing--the jet engine is simpler and
safer than the piston engine; fathometers are
better than lead lines; three engines are better
than two on an airplane; computers are more
reliable than pneumatic controls. But other
technological fixes are excuses for poor
organization or an attempt to compensate for
poor system design. The attention of authorities
in some of these systems, unfortunately, is hard
to get when safety is involved.
When we add complexity and coupling to
catastrophe, we have something that is fairly new
in the world. Catastrophes have always been
with us. In the distant past, the natural ones
easily exceeded the human-made ones. Humanmade catastrophes appear to have increased
with industrialization as we built devices that
could crash, sink, burn, or explode. In the last
fifty years, however, and particularly in the last
twenty-five, to the usual cause of accidents—
some component failure, which could be
prevented in the future—was added a new
cause: interactive complexity in the presence of
tight coupling, producing a system accident. We
have produced designs so complicated that we
cannot anticipate all the possible interactions of
the inevitable failures; we add safety devices that
are deceived or avoided or defeated by hidden
paths in the systems. The systems have become
more complicated because either they are
dealing with more deadly substances, or we
demand they function in ever more hostile
environments or with ever greater speed and
volume. And still new systems keep appearing,
such as gene splicing, and others grow ever
more complex and tightly tied together. In the
past, designers could learn from the collapse of a
medieval cathedral under construction, or the
explosion of boilers on steamboats, or the
collision of railroad trains on a single track. But
we seem to be unable to learn from chemical
plant explosions or nuclear plant accidents. We
may have reached a plateau where our learning
curve is nearly flat. It is true that I should be wary
of that supposition. Reviewing the wearisome
Cassandras in history who prophesied that we
had reached our limit with the reciprocating
steam engine or the coal-fired railroad engine
reminds us that predicting the course of
technology in history is perilous. Some wellplaced warnings will not harm us, however.
One last warning before outlining the chapters
to come. The new risks have produced a new
breed of shamans, called risk assessors. As with
the shamans and the physicians of old, it might
be more dangerous to go to them for advice than
to suffer unattended. In our last chapter we will
examine the dangers of this new alchemy where
body counting replaces social and cultural
values and excludes us from participating in
decisions about the risks that a few have decided
the many cannot do without. The issue is not
risk, but power.
Fast Forward
Chapter 1 will examine the accident at Three
Mile Island (TMI) where there were four
independent failures, all small, none of which the
operators could be aware of. The system caused
that accident, not the operators. Chapter 2 raises
the question of why, if these plants are so
complex and tightly coupled, we have not had
more TMIs. A review of the nuclear power
industry and some of its trivial and its serious
accidents will suggest that we have not given
large plants of the size of TMI time to express
themselves. The record of the industry and the
Nuclear Regulatory Commission is frightening,
but not because it is all that different from the
records of other industries and regulatory
agencies. It isn’t. It is frightening because of the
catastrophic potential of this industry; it has to
have a perfect performance record, and it is far
from achieving that.
We can go a fair distance with some loosely
defined concepts such as complexity, coupling,
and catastrophe, but in order to venture further
into the world of high-risk systems we need
better definitions, and a better model of systems
and accidents and their consequences. This is
the work of Chapter 3, where terms are defined
and amply illustrated with still more accident
stories. In this chapter we explore the
advantages of loose coupling, map the industrial,
service, and voluntary organizational world
7
according to complexity and coupling, and add a
definition of types of catastrophes. Chapter 4
applies our complexity, coupling, and
catastrophe theories to the chemical industry. I
wish to make it clear that normal accidents or, as
we will generally call them, system accidents, are
not limited to the nuclear industry. Some of the
most interesting and bizarre examples of the
unanticipated interaction of failures appear in this
chapter—and we are now talking about a quite
well-run industry with ample riches to spend on
safety, training, and high-technology solutions.
Yet chemical plants mostly just sit there,
though occasionally they will send a several
hundred pound missile a mile away into a
community or incinerate a low flying airplane. In
Chapter 5 we move out into the environment and
examine aircraft and flying, and air traffic control
and the airports and airways. Flying is in part a
transformation system, but largely just very
complex and tightly coupled. Technological fixes
are made continuously here, but designers and
airlines just keep pushing up against the limits
with each new advance. Flying is risky, and
always will be. With the airways system, on the
other hand, we will examine the actual reduction
of complexity and coupling through
organizational changes and technological
developments; this system has become very
safe, as safety goes in inherently risky systems.
An examination of the John Wayne International
Airport in Orange County, California, will remind
us of the inherent risks.
With marine transport, in Chapter 6, the
opposite problem is identified. No reduction in
complexity or coupling has been achieved.
Horrendous tales are told, three of which we will
detail, about the needless perils of this system.
We will analyze it as one that induces errors
through its very structure, examining insurance,
shipbuilders, shippers, captains and crews,
collision avoidance systems, and the
international anarchy that prevents effective
regulation and encourages cowboys and hot
rodders at sea. One would not think that ships
could pile up as if they were on the Long Island
Expressway, but they do.
Chapter 7 might seem to be a diversion since
dams, lakes, and mines are not prone to system
accidents. But it will support our point because
they are also linear, rather than complex
systems, and the accidents there are
foreseeable and avoidable. However, when we
move away from the individual dam or mine and
take into account the larger system in which they
exist, we find the “eco-system accident,” an
interaction of systems that were thought to be
independent but are not because of the larger
ecology. Once we realize this we can prevent
future accidents of this type; in linear systems we
can learn from our mistakes. Dams, lakes, and
mines also simply provide tales worth telling. Do
dams sink or float when they fail? Could we
forestall a colossal earthquake in California by a
series of mammoth chiropractic spinal
adjustments? How could we lose a whole lake
and barges and tugs in a matter of hours? (By
inadvertently creating an eco-system accident.)
Chapter 8 deals with far more esoteric
systems. Space missions are very complex and
tightly coupled, but the catastrophic potential was
small and now is smaller. More important, this
system allows us to examine the role of the
operator (in this case, extraordinarily well-trained
astronauts) whom the omniscient designers and
managers tried to treat like chimpanzees. It is a
cautionary tale for all high-technology systems.
Accidents with nuclear weapons, from dropping
them to firing them by mistake, will illustrate a
system so complicated and error-prone that the
fate of the earth may be decided more by
inadvertence than anger. The prospects are, I
am afraid, terrifying. Equally frightening is the
section in this chapter on gene splicing, or
recombinant DNA. In this case, in the unseemly
haste for prizes and profits, we have abandoned
even the most elementary safeguards, and may
loose upon the world a rude beast whose time
need not have come.
In the last chapter we shall examine the new
shamans, the risk assessors, and their
inadvertent allies, the cognitive psychologists.
Naturally, as a sociologist, I will have a few sharp
words to say about the latter, but point out that
their research has really provided the grounds for
a public role in high-risk decision making, one
that risk assessors do not envisage. Finally, we
will add up the credits and deficits of the systems
we examined, and I will make a few modest
8
suggestions for complicating the lives of some
systems—and shutting others down completely.
Download