Routing Measurement

advertisement
Routing Measurements:
Three Case Studies
Jennifer Rexford
Motivations for Measuring the Routing System
• Characterizing the Internet
– Internet path properties
– Demands on Internet routers
– Routing convergence
• Improving Internet health
– Protocol design problems
– Protocol implementation problems
– Configuration errors or attacks
• Operating a network
– Detecting and diagnosing routing problems
– Traffic shifts, routing attacks, flaky equipment, …
Techniques for Measuring Internet Routing
• Active probing
– Inject probes along path through the data plane
– E.g., using traceroute
• Passive route monitoring
– Capture control-plane messages between routers
– E.g., using tcpdump or a software router
– E.g., dumping the routing table on a router
• Injecting network events
– Cause failure/recovery at planned time and place
– E.g., BGP route beacon, or planned maintenance
Challenges in Measuring Routing
• Data vs. control plane
– Understand relationship between routing protocol
messages and the impact on data traffic
• Cause vs. effect
– Identify the root cause for a change in the
forwarding path or control-plane messages
• Visibility and representativeness
– Collect routing data from many vantage points
– Across many Autonomous Systems, or within
• Large volume of data
– Many end-to-end paths
– Many prefixes and update measurements
Measurement Tools: Traceroute
• Traceroute tool exploits TTL-limited probes
– Observation of the forwarding path
• Useful, but introduces many challenges
– Path changes
– Non-participating nodes
– Inaccurate, two-way measurements
– Hard to map interfaces to routers and ASes
TTL=1
source
Time
exceeded
destination
TTL=2
Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message
Measurement: Intradomain Route Monitoring
• OSPF is a flooding protocol
– Every link-state advertisements sent on every link
– Very helpful for simplifying the monitor
• Can participate in the protocol
– Shared media (e.g., Ethernet)
• Join multicast group and listen to LSAs
– Point-to-point links
• Establish an adjacency with a router
• … or passively monitor packets on a link
– Tap a link and capture the OSPF packets
Measurement: Interdomain Route Monitoring
Talk to operational
routers using SNMP or
telnet at command line
Establish a “passive” BGP
session from a workstation
running BGP software
BGP session over TCP
(-) BGP table dumps
are expensive
(+) BGP table dumps do not
burden operational routers
(+) Table dumps show all
alternate routes
(-) Receives only best routes from
BGP neighbor
(-) Update dynamics lost
(+) Update dynamics captured
(-) restricted to interfaces
provided by vendors
(+) not restricted to interfaces
provided by vendors
Collect BGP Data From Many Routers
Seattle
Cambridge
Chicago
New York
Kansas City
Denver
San
Francisco
Detroit
Philadelphia
St. Louis
Washington, D.C.
2
Los Angeles
Dallas
San Diego
Atlanta
Phoenix
Austin
Houston
BGP is not a flooding protocol
Orlando
Route Monitor
Two Kinds of BGP Monitoring Data
• Wide-area, from many ASes
– RouteViews or RIPE-NCC data
– Pro: available from many vantage points
– Con: often just one or two views per AS
• Single AS, from many routers
– Abilene and GEANT public repositories
– Proprietary data at individual ISPs
– Pro: comprehensive view of a single AS
– Con: limited public examples, mostly research nets
Measurement: Injecting Events
• Equipment failure/recovery
– Unplug/reconnect the equipment 
– Packet filters that block all packets
– Knowing when planned event will take place
– Shutting down a routing-protocol adjacency
• Injecting route announcements
– Acquire some blocks of IP addresses
– Acquire a routing-protocol adjacency to a router
– Announce/withdraw routes on a schedule
– Beacons: http://psg.com/~zmao/BGPBeacon.html
Two Papers for Today
• Both early measurement studies
– Initially appeared at SIGCOMM’96 and ’97
– Both won the “best student paper” award 
– Early glimpses into the health of Internet routing
– Early wave of papers on Internet measurement
• Differences in emphasis
– Paxson96: end-to-end active probing to measure
the characteristics of the data plane
– Labovitz97: passive monitoring of BGP update
messages from several ISPs to characterize
(in)stability of the interdomain routing system
Paxson Study: Forwarding Loops
• Forwarding loop
– Packet returns to same router multiple times
• May cause traceroute to show a loop
– If loop lasted long enough
– So many packets traverse the loopy path
• Traceroute may reveal false loops
– Path change that leads to a longer path
– Causing later probe packets to hit same nodes
• Heuristic solution
– Require traceroute to return same path 3 times
Paxson Study: Causes of Loops
• Transient vs. persistent
– Transient: routing-protocol convergence
– Persistent: likely configuration problem
• Challenges
– Appropriate time boundary between the two?
– What about flaky equipment going up and down?
– Determining the cause of persistent loops?
• Anecdote on recent study of persistent loops
– Provider has static route for customer prefix
– Customer has default route to the provider
Paxson Study: Path Fluttering
• Rapid changes between paths
– Multiple paths between a pair of hosts
– Load balancing policies inside the network
• Packet-based load balancing
– Round-robin or random
– Multiple paths for packets in a single flow
• Flow-based load balancing
– Hash of some fields in the packet header
– E.g., IP addresses, port numbers, etc.
– To keep packets in a flow on one path
Paxson Study: Routing Stability
• Route prevalence
– Likelihood of observing a particular route
– Relatively easy to measure with sound sampling
– Poisson arrivals see time averages (PASTA)
– Most host pairs have a dominant route
• Route persistence
– How long a route endures before a change
– Much harder to measure through active probes
– Look for cases of multiple observations
– Typical host pair has path persistence of a week
Paxson Study: Route Asymmetry
• Hot-potato routing
• Other causes
– Asymmetric link weights
in intradomain routing
– Cold-potato routing,
where AS requests traffic
enter at particular place
Customer B
Provider B
multiple
peering
points
• Consequences
Early-exit
routing
Provider A
Customer A
– Lots of asymmetry
– One-way delay is not
necessarily half of the
round-trip time
Labovitz Study: Interdomain Routing
• AS-level topology
– Destinations are IP prefixes (e.g., 12.0.0.0/8)
– Nodes are Autonomous Systems (ASes)
– Links are connections & business relationships
4
3
5
2
1
Client
7
6
Web server
Labovitz Study: BGP Background
• Extension of distance-vector routing
– Support flexible routing policies
– Avoid count-to-infinity problem
• Key idea: advertise the entire path
– Distance vector: send distance metric per dest d
– Path vector: send the entire path for each dest d
“d: path (2,1)”
3
“d: path (1)”
1
2
data traffic
data traffic
d
Labovitz Study: BGP Background
• BGP is an incremental protocol
– In theory, no update messages in steady state
• Two kinds of update messages
– Announcement: advertising a new route
– Withdrawal: withdrawing an old route
• Study saw an alarming number of updates
– At the time, Internet had around 45,000 prefixes
– Routers were exchanging 3-6 million updates/day
– Sometimes as high as 30 million in a day
• Placing a very high load on the routers
Labovitz Study: Classifying Update Messages
• Analyze update messages
– For each (prefix, peer) tuple
– Classify the kinds of routing changes
• Forwarding instability
– WADiff: explicit withdraw, replaced by alternate
– AADiff: implict withdraw, replaced by alternate
• Pathological
– WADup: explicit withdraw, and then reanounced
– AADup: duplicate announcement
– WWDup: duplicate withdrawal
Labovitz Study: Duplicate Withdrawals
• Time-space trade-off in router implementation
– Common system building technique
– Trade one resource for another
– Can have surprising side effects
• The gory details
– Ideally, you should not send a withdrawal if you
never sent a neighbor a corresponding
announcement
– Requires remembering what update message you
sent to each neighbor
– Easier to just send everyone a withdrawal when
your route goes away
Labovitz Study: Practical Impact
• “Stateless BGP” is compliant with the standard
– But, it forces other routers to handle more load
– So that you don’t have to maintain state
– Arguably very unfair, and bad for global Internet
• One router vendor was largely at fault
– Router vendor modified its implementation
– ISPs then deployed the updated software
Labovitz Study: Still Hard to Diagnose Problems
• Despite having very detailed view into BGP
– Some pathologies were very hard to diagnose
• Possible causes
– Flaky equipment
– Synchronization of BGP timers
– Interaction between BGP and intradomain routing
– Policy oscillation
• These topics were studied in follow-up studies
– Example: study of BGP data within a large ISP
– http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf
ISP Study: Detecting Important Routing Changes
• Large volume of BGP updates messages
– Around 2 million/day, and very bursty
– Too much for an operator to manage
• Identify important anomalies
– Lost reachability
– Persistent flapping
– Large traffic shifts
• Not the same as root-cause analysis
– Identify changes and their effects
– Focus on mitigation, rather than diagnosis
– Diagnose causes if they occur in/near the AS
Challenge #1: Excess Update Messages
• A single routing change
– Leads to multiple update messages
– Affects routing decision at multiple routers
BR
E
BR
E
BGP
Updates
BR
E
BGP Update
Grouping
Events
Persistent
Flapping
Prefixes
Group updates for a prefix with inter-arrival < 70 seconds,
and flag prefixes with changes lasting > 10 minutes.
Determine “Event Timeout”
Cumulative distribution of BGP update inter-arrival time
(70, 98%)
BGP beacon
Event Duration: Persistent Flapping
Complementary cumulative distribution of event duration
(600, 0.1%)
Long
Events
Detecting Persistent Flapping
• Significant persistent flapping
– 15.2% of all BGP update messages
– … though a small number of destination prefixes
– Surprising, especially since flap dampening is used
• Types of persistent flapping
– Conservative flap-damping parameters (78.6%)
– Policy oscillations, e.g., MED oscillation (18.3%)
– Unstable interface or BGP session (3.0%)
Example: Unstable eBGP Session
AE
AT&T
DE
BE
CE
p
Customer
Peer
Challenge #2: Identify Important Events
• Major concerns of network operators
– Changes in reachability
– Heavy load of routing messages on the routers
– Flow of the traffic through the network
No Disruption
Events
Event
Classification
Loss/Gain of Reachability
“Typed”
Events
Internal Disruption
Single External Disruption
Multiple External Disruption
Classify events by type of impact it has on the network
Event Category – “No Disruption”
p
AS2
AS1
DE
No Traffic Shift
EE
AE
BE
“No Disruption”: eachAT&T
of the border routers has
no traffic shift
CE
Event Category – “Internal Disruption”
p
AS2
AS1
DE
EE
AE
BE
“Internal Disruption”: all of the traffic shifts are
internal traffic shift AT&T
CE
Internal Traffic Shift
Event Type: “Single External Disruption”
p
AS2
AS1
DE
external Traffic Shift
EE
AE
BE
AT&T
“Single External Disruption”: traffic at one exit
point shifts to other exit points
CE
Statistics on Event Classification
No Disruption
Internal Disruption
Single External Disruption
Multiple External Disruption
Loss/Gain of Reachability
Events
50.3%
15.6%
20.7%
7.4%
6.0%
Updates
48.6%
3.4%
7.9%
18.2%
21.9%
Challenge #3: Multiple Destinations
• A single routing change
– Affects multiple destination prefixes
“Typed”
Events
Event
Correlation
Clusters
Group events of same type that occur close in time
Main Causes of Large Clusters: BGP Resets
• External BGP session resets
– Failure/recovery of external BGP session
– E.g., session to another large tier-1 ISP
– Caused “single external disruption” events
– Validated by looking at syslog reports on routers
p
AS2
AS1
DE
EE
AE
BE
AT&T
CE
Main Causes of Large Clusters: Hot Potatoes
• Hot-potato routing changes
– Failure/recovery of an intradomain link
– E.g., leads to changes in IGP path costs
– Caused “internal disruption” events
– Validated by looking at OSPF measurements
P
AE
11
9
BE
ISP
CE
10
“Hot-potato routing” =
route to closest egress point
Challenge #4: Popularity of Destinations
• Impact of event on traffic
– Depends on the popularity of the destinations
Clusters
Traffic Impact
Prediction
Large
Disruptions
Netflow
Data
BR
E
BR
E
BR
E
Weight the group of destinations by the traffic volume
ISP Study: Traffic Impact Prediction
• Traffic weight
– Per-prefix measurements from Netflow
– 10% prefixes accounts for 90% of traffic
• Traffic weight of a cluster
– The sum of “traffic weight” of the prefixes
• Flag clusters with heavy traffic
– A few large clusters have large traffic weight
– Mostly session resets and hot-potato changes
ISP Study: Summary
BGP (106)
BR
E Updates
Events (105)
BR
E
BR
E
BGP Update
Grouping
Persistent
Flapping
Prefixes
(101)
“Typed”
Events
Event
Classification
Clusters
Event
Correlation
Frequent
Flapping
Prefixes
(101)
Large
Disruptions
(101)
(103)
Traffic Impact
Prediction
Netflow
Data
BR
E
BR
E
BR
E
Three Studies, Three Approaches
• End-to-end active probes
– Measure and characterize the forwarding path
– Identify the effects on data traffic
• Wide-area passive route monitoring
– Measure and classify BGP routing churn
– Identify pathologies and improve Internet health
• Intra-AS passive route monitoring
– Detailed measurements of BGP within an AS
– Aggregate data into small set of major events
Download