A Layered Naming Architecture for the Internet

advertisement
Towards an Internet that
“Never Fails”
Hari Balakrishnan
MIT
Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru
What We Should Aim Toward
• Carrier airlines (2002 FAA Fact Book)
 41 accidents, 6.7 million flights (five “nines”
availability)
• 911 phone service (1993 NRIC report)
 29 minutes downtime per year per line (four
“nines” availability)
• Standard phone service (various sources)
 53 minutes downtime per year per line (four
“nines” availability)
• The Internet?
 One to two “nines”
Example Catastrophic Failures
“…a glitch at a small ISP… triggered a major outage in
Internet access across the country. The problem started
when MAI Network Services...passed bad router
information
one of
its customers
onto
Sprint.”
“Microsoft's from
websites
were
offline for up
to 23
-- news.com,
25, 1997
hours...because
of a [router]April
misconfiguration…it
took
nearly a day to determine what was wrong and undo the
“WorldCom
a widespread
changes.” Inc…suffered
-- wired.com,
January 25,outage
2001 on its
Internet backbone that affected roughly 20 percent of its
U.S. customer base. The network problems…affected
millions of computer users worldwide. A spokeswoman
attributed
outagecustomers
to "a route went
tableout
issue."
"A numberthe
of Covad
from 5pm today
-- cnn.com,
October
3, 2002denial of service
due to, supposedly,
a DDOS
(distributed
attack) on a key Level3 data center, which later was
described as a route leak (misconfiguration).”
-- dslreports.com, February 23, 2004
# Threads over Stated Period
NANOG List Failure “Analysis”
90
80
70
60
50
40
30
20
10
0
More than 70% of threads discussing failures related
to router configuration or route announcement problems
Filtering
Route
Leaks
Route
Hijacks
1994-1997
Route
Instability
1998-2001
Routing Blackholes
Loops
2001-2004
Note: Only includes problems openly discussed on this list.
Faults and Failures
• Fault = Underlying defect in a component that
causes it to violate a specification
 Latent or Active (i.e., cause errors)
• Unmasked faults (errors) cause failures
 Failure of subsystem (spec violation) causes fault
in system
• Internet faults occur for complex reasons
 Hardware, software, protocol, design,
implementation, operational faults: could be
triggered by malice
• Internet failure: A cannot communicate with B
Three Directions
• Configuration as programming
 Defines BGP behavior
 Tools to cope with routing complexity
• Coping with protocol faults: failure-atomic
interdomain routing
 Prefix-based routing considered harmful
• End-to-end routing
 Exposing multiple paths to end systems (and
stubs)
Today: Reactive Operation
What happens if I
tweak this policy…?
• Problems cause downtime
• Problems often not immediately apparent
Coping with Complexity
• View configuration as (distributed) programming
 Large-scale: over 1M lines of code in some
networks
• Programming tools to reduce fault frequency
 Static analysis can detect many faults [rcc]
 Sandboxing to overcome current “stimulusresponse” reasoning [FR03]
• Centralize configuration platform
 More “intentional” config specs
 Push configs to routers
 Push routes to routers [RCP:F+04]
 Use static analysis and sandboxing tools
Proactive Operation with rcc
http://nms.csail.mit.edu/rcc
rcc
Configure
Distributed router
configurations
(Single AS)
Detect
Faults
Correctness
Specification
Deploy
Constraints
rcc
Normalized
Representation
• Represent complex, distributed configuration
• Define a correctness specification
• Map specification to constraints
Faults
Correctness Specification
Path Visibility
EveryIfdestination
with
usable
there exists
aa
path,
paththen
has athere
routeexists
advertisement
a route
Example violation: Signaling partition
Route Validity
EveryIfroute
thereadvertisement
exists a route,
corresponds
to aexists
usablea path
then there
path
Example violation: Routing loop
Results: Faults across 17 ASes
Every AS had faults, regardless of network size
Most faults can be attributed to distributed configuration
10
Path Visibility
8
6
4
2
Incomplete
Filter
Undefined
Filter
Transit
Between
Peers
Inconsistent
Import
Inconsistent
Export
Incomplete
iBGP
Session
Duplicate
Loopback
0
iBGP
Signaling
Partition
Number of ASes
Route Validity
Three Directions
• Configuration as programming
 Tools to cope with routing complexity
• Coping with protocol faults: failure-atomic
interdomain routing
 Prefix-based routing considered harmful
• End-to-end routing
 Exposing multiple paths to end systems
Prefixes are too coarse-grained
Validity: If a failure occurs that makes a network
unreachable via a given path, then the route
corresponding to that path must be withdrawn
70% of intra-AS failures
not visible in BGP [FABK03]
…but they are also too fine-grained!
• ~70% of discontiguous prefix pairs from the same AS
are announced from the same location
• Allocation explains about 60% of these cases:
 Registries often allocate discontiguous address
blocks to a single AS on the same day
• Routes for these prefixes will “flap” together.
 135.36.0.0/16 (Agere) and 135.12.0.0/14
(Lucent)
Route objects should correspond
to an “atom” of hosts that share fate
Proposal: Atomic Interdomain Protocol
(AIP)
• Exterminate prefixes
• Name “atomic domains” (AD) directly
 Addressing, forwarding and routing on ADs
 Like current AS numbers, but finer-grained
 Example: MIT, Microsoft Redmond, one PoP of a
large ISP, …
• Flat AD IDs can carry cryptographic meaning
 Self-certifying (hash of public key)
• End-system addresses have the form [AD : LocalID]
Summary
It’s worth shooting for a two or three order-ofmagnitude improvement in Internet availability
It’s possible to get four or five nines of Internet
availability, if we:
 Develop tools to cope with configuration
complexity
 Develop a failure-atomic routing system
 Expose multiple IP-layer paths to higher layers
Download