the e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University

the e-risks of e-commerce
Professor Ken Birman
Dept. of Computer Science
Cornell University
 If it stops ticking when it takes a
licking… your e-commerce company
could tank
 So you need to know that your
technology base is reliable
 It does what it should do, does it when
needed, does it correctly, and is
accessible to your customers.
A Quiz
 Q: When and why did Sun Microsystems
have a losing quarter?
Ken Birman:
Mr. Birman,
Sun experienced a loss in Q4FY89 (June 1989). This was the quarter in which we
transitioned to a new manufacturing, order processing and inventory control
Andrew Casey
Manager, Investor Relations
Sun Microsystems, Inc.
(650) 336-0761
[email protected]
Typical Web Session
Typical Web Session
DNS root
resolve “”
DNS leaf
caching proxy
DNS node
DNS node
DNS root
DNS root
DNS leaf
IP address
caching proxy
DNS leaf
web server
web server
The Web’s dark side
Netscape error: web server
... not responding. Server may
have crashed or is overloaded.
Right URL, but the
request times out. Why?
The web server could be down
Your network connection may have failed
There could be a problem in the “DNS”
There could be a network routing problem
The Internet may be experiencing an overload
Your web caching proxy may be down
Your PC might have a problem, or your version of Netscape (or
Explorer), or the file system you are using, or your LAN
 The URL itself may be wrong
 A router or network link may have failed and the Internet may
not yet have rerouted around the problem
E-Trade computers
crash again -- and again
Edupage Editors <[email protected]> Sun, 07 Feb 1999 10:28:30 -0500
The computer system of online security firm
E-Trade crashed on Friday for the third
consecutive day. "It was just a software
glitch. I think we were all frustrated by
it," says an E-Trade executive. Industry
analyst James Mark of Deutsche Bank
commented “…it's the application on a
large scale. As soon as E-Trade's
volumes started spiking up, they had
the same problems as others…."
Reliable Distributed Computing:
Increasingly urgent, yet unsolved
 Distributed computing has swept the world
Impact has become revolutionary
Vast wave of applications migrating to networks
Already as critical a national infrastructure as water,
electricity, or telephones
 Yet distributed systems remain
Unreliable, prone to inexplicable outages
Insecure, easily attacked
Difficult (and costly) to program, bug-prone
A National Imperative
 Potential for catastrophe cited by
Presidential Commission on Critical Infrastructure
Protection (PCCIP)
National Academy of Sciences Study on Trust in
 These experts warn that we need a quantum
improvement in technologies
 Meanwhile, your e-commerce venture is at
grave risk of stumbling – just like many others
A Business Imperative
A Business Imperative
E-projects Often Fail
 e-commerce revolves around computing
 Even business and marketing people are
at the mercy of these systems
 When your company’s computing
systems aren’t running, you’re out of
Big and Little
 It is too easy to understand “reliability” as a
narrow technical issue
 In fact, many systems and companies stumble
by building
Unreliable technologies, because of
A mixture of poor management and poor technical
 Reliable systems demand a balance between
good management and good technology
A Glimpse of Some
“Unreliable Systems”
 Quick review of some failed projects
 These were characterized by poor
reliability of the final product
 But the issues were not really technical
 As future managers you need to
understand this phenomenon!
Tales from the
Software Crypt
NYC Control of 10,000 Traffic Lights
Univac, based on experience in Baltimore
and Toronto
Started in late 1960’s
Scrapped 2-3 years later
Spent: ?
Second system effect:
New radio control system
New software, algorithms
Earlier systems were 100x smaller
Incommensurate scaling
California Dept. of Motor Vehicles
Vehicle registration and drivers licenses
Started in 1987
Scrapped 1994
Spent: $44M
Underestimated cost by a factor of 3
Slower than 1965 system
Governor fired the whistleblower
DMV blames Tandem,
Tandem blames DMV
United Airlines/UNIVAC
Automated reservations, ticketing, flight
schedules, fuel delivery, kitchens, and
general administration
Started in late 1960’s
Scrapped early 1970’s
Spent: $50M
Second system effect:
Hilton, Marriot, Budget, American Airlines
Hotel res., links to Wizard and Sabre
Started: 1988
Scrapped: 1992
Spent: $125M
Second system
Very dull tools (machine language)
Bad-news diode
See CACM October 1994 for details
Source: Jerry Saltzer, Keynote address, SOSP 1999
Tried to automate everything,
including kitchen sink
Ditto: Burroughs/TWA.
Delta currently planning to build
something similar
But they will use the web.
“Magic bullet” concept…
Today uses web, works well
Tales from the
Software Crypt
SACSS (California)
State-wide system for automated child
support tracking
Started 1991 ($99M)
“On hold” 1997
Spent: $300M
Lockheed and HWDC disagree on
what the system contains and which
part of it isn’t working
Taurus (British Stock Exchange)
Replacement for British Stock Exchange
Started 1980’s
Scrapped 1993
Spent $600M
“massive complexity of the back-end
settlement systems…”
delays and cost overruns
IBM Workplace OS for PC
Mach 3.0 + binary compatibility with Pink, AIX,
DOS, OS/400 + new clock mgt. + new RPC +
new I/O + new CPU
Started in 1991
Scrapped 1996
Spent: $2B
400 staff on kernel, 1500 elsewhere
“sheer complexity of the class structure
proved to be overwhelming”
Replacement for “in route” air traffic control
Started 1982
Scrapped 1994
Spent more than $6B
Management misestimated size and
length of project
Project goals constantly changed
Advanced Automation System
Source: Jerry Saltzer, Keynote address, SOSP 1999
“Departments shouldn’t deploy
a system to additional users if it
is not working”
Even question of how to represent
numbers wasn’t settled
Early design choices and
compatibility decisions doomed the
Poor technology choices
Run by gov’t. bureaucrats
1995 Standish Group
Over budget
Missed schedule
Lacks functions
On time
On budget
On function
2x budget
2x competion time
2/3 planned functionality
Source: Jerry Saltzer, Keynote address, SOSP 1999
A strange picture
 Many technology projects fail
For lots of reasons
 But some succeed
Today we do web-based hotel reservations
all the time, yet “Confirm” failed
French air traffic project was a success yet
US project lost $6 billion
 Is there a pattern?
Recurring Problems
Incommensurate scaling
Too many ideas
Mythical man-month
Bad ideas included
Modularity is hard
Bad-news diode
Best people are far more productive than average
 New is better, not-even-available yet is best
 Magic bullet syndrome
Source: Jerry Saltzer, Keynote address, SOSP 1999
1995 Study of Tandem
Computer Systems
 77% of failures that are software problems.
 Software fault-tolerance techniques can
overcome about 75% of detected faults.
 Loose coupling between primary and backup is
important for software fault tolerance.
 Over two-thirds (72%) of measured software
failures are recurrences of previously reported
Source: Jerry Saltzer, Keynote address, SOSP 1999
A Buggy Aside
 Q: What are the two main categories of
software bugs called?
 A: Bohrbugs and Heisenbugs
 Q: Why?
Bohr Model of Atom
 Bohr argued that the
nucleus was a little ball
Bohr Model of Atom
 Bohr argued that the
nucleus was a little ball
 Bohr bug is a nasty
but well defined thing
Bohr Model of Atom
 Bohr argued that the
nucleus was a little ball
 Bohr bug is a nasty
but well defined thing
 Your technical people
can reproduce it, so they
can nail it
 Heisenberg modeled atom
as a cloud of electroms
and a cloud-like nucleus
 The closer you look, the
more it wiggled
 A Heisenbug moves when
your people try and pin it down.
They won’t find it easy to fix.
 Bohrbugs tend to be deterministic errors –
outright mistakes in the code
 Once you understand what triggers them they
are easy to search for and fix
 Heisenbugs are often delayed side-effects of
an old error. Like a bad tank of gas, effect
may happen long after the bug first “occurs”.
Hard to fix because at the time the mistake
happened, nothing obvious went wrong
Why Systems fail
 Mostly, because something crashes
 Usually, software or a human error
 Mean time to failure improves with age but
software problems remain prevalent
 Every kind of software system is prone to
failures. Failure to plan for failures is the most
common way for e-systems to fail.
 We want e-commerce solutions to be reliable…
but what should this mean?
Fast enough?
Accessible to customers?
 Deliver critical services when needed, where
needed, in a correct, timely manner
Costs of a Failure
Minimizing Downtime
 Idea is to design critical parts of your
system to survive failures
 Two basic approaches
Recoverable systems are designed to restart
without human intervention – but may wait
until outage is repaired
Highly available systems are designed to
keep running during failure
 The technology is called “transactions”
We’ll discuss this next time, but…
 Main issue is time needed to restart the
 For a large database, half an hour or
more is not at all unusual
 Faster restart requires a “warm standby”
High Availability
 Idea is to have a way to keep the system
running even while some parts are
 For example, a backup that takes over if
primary fails
 Backup is kept “warm”
This involves replicating information
As changes occur, backup may lag behind
 The looming threat to your e-commerce
solution, no matter what it may be
 Even simple systems are hard to make reliable
 Complex systems are almost impossible to
make reliable
 Yet innovative e-commerce projects often
require fairly complex technologies!
Two Side-by-Side Case
 American Advanced Automation System
Intended as replacement for air traffic control
Needed because Pres. Reagan fired many
controllers in 1981
But project was a fiasco, lost $6B
 French Phidias System
Similar goals, slightly less ambitious
But rolled out, on time and on budget, in 1999
 Air traffic control systems are using
1970’s technology
 Extremely costly to maintain and
impossible to upgrade
 Meanwhile, load on controllers is rising
 Can’t easily reduce load
Air Traffic Control
system (one site)
Team of Controllers
Air Traffic Database
(flight plans, etc)
 Government wanted to upgrade the
whole thing, solve a nagging problem
 Controllers demanded various
simplifications and powerful new tools
 Everyone assumed that what you use at
home can be adapted to the demands of
an air traffic control center
 IBM bid the project, proposed to use its
own workstations
 These aren’t super reliable, so they
proposed to adapt a new approach to
 Idea is to plan for failure
Detect failures when they occur
Automatically switch to backups
Core Technical Issue?
 Problem revolves around high availability
 Waiting for restart not seen as an option: goal
is 10sec downtime in 10 years
 So IBM proposed a replication scheme much
like the “load balancing” approach
 IBM had primary and backup simply do the
same work, keeping them in the same state
Conceptual flow of system
IBM’s fault-tolerant process pair concept
Identify Lookup
Identify Lookup
Why is this Hard?
 The system has many “real-time” constraints
on it
Actions need to occur promptly
Even if something fails, we want the human
controller to continue to see updates
 IBM’s technology
Based on a research paper by Flaviu Cristian
But had never been used except for proof of
concept purposes, on a small scale in the laboratory
 IBM’s proposal sounded good…
… and they were the second lowest bidder
… and they had the most aggressive
 So the FAA selected them over
 IBM took on the whole thing all at once
Disaster Strikes
 Immediate confusion: all parts of the system
seemed interdependent
To design part A I need to know how part B, also
being designed, will work
 Controllers didn’t like early proposals and
insisted on major changes to design
 Fault-tolerance idea was one of the reasons
IBM was picked, but made the system so
complex that it went on the back burner
Summary of
 Focus on some core components
 Postpone worry about fault-tolerance until later
 Try and build a simple version that can be
fleshed out later
… but the simplification wasn’t enough. Too
many players kept intruding with requirements
Crash and Burn
 The technical guys saw it coming
Probably as early as one year into the effort
But they kept it secret (“bad news diode”)
Anyhow, management wasn’t listening
(“they’ve heard it all before – whining engineers!”)
 The fault-tolerance scheme didn’t work
Many technical issues unresolved
 The FAA kept out of the technical issues
But a mixture of changing specifications and serious
technical issues were at the root of the problems
What came out?
 In the USA, nothing.
 The entire system was useless – the
technology was of an all-or-nothing style
and nothing was ready to deploy
 British later rolled out a very limited
version of a similar technology, late, with
many bugs, but it does work…
Contrast with French
 They took a very incremental approach
Early design sought to cut back as much as
If it isn’t “mandatory” don’t do it yet
Focus was on console cluster architecture
and fault-tolerance
 They insisted on using off-the-shelf
Contrast with French
 Managers intervened in technology
For example, the vendor wanted to do a
home-brew fault-tolerance technology
French insisted on a specific existing
technology and refused to bid out the work
until vendors accepted
A critical “good call” as it worked out
Learning by Doing
 To gain experience with technology
They tested, and tested, and tested
Designed simple prototypes and played with them
Discovered that large cluster would perform poorly
But found a “sweet spot” and worked within it
This forced project to cut back on some goals
 9/10th of time and expense on any
system is in
 Many projects overlook this
 French planned conservatively
Software Bugs
Figure 1/10 lines in new code
But as many as 1/250 lines in old code
Bugs show up under stress
Trick is to run a system in an unstressed mode
French identified “stress points” and designed
to steer far from them
 Their design also assumed that components
would fail and automated the restart
All of this worked!
 Take-aways from French project?
Complex technical issues at the core of the system
But they managed to break big poject into pieces
Do the critical core first, separately, and focus
exclusively on it
Test, test, test
Don’t build anything you can possibly buy
Management was technically sophisticated enough
to make some critical “calls”
Your Problem
 e-commerce systems are at e-risk
 These e-risks take many forms:
System complexity
Failure to plan for failures
Poor project management
 Ignore this at our peril, as we’ve seen
 But how can we learn to do better?
Keys to Reliability
 Know the basic technologies
 Realize that software is buggy and failures will
Design to treat failure as a mundane event
Failure to plan for failure is the biggest e-risk!
 Complexity is a huge threat. Use your naiveté
as an advantage: if you can’t understand it,
why assume that “they” can understand it?
 The network and associated services
 Databases
 Web servers
 “Scripts” – the glue your people use to
tie it all together
Next Lecture
 Look at some realistic e-commerce
 Ask ourselves where to start first, if we
need to convince ourselves that the
system will be reliable enough
 Trick is to balance between system
complexity and adequate risk coverage