Active Measurements on the AT&T IP Backbone Len Ciavattone, Al Morton, Gomathi Ramachandran AT&T Labs Colleagues on This Project Nicole Kowalski Ron Kulper George Holubec Shashi Pulakurti L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 2 Measurements for Large Networks Must be: Easily understood Estimate or assess customer performance Useful for alarming and associated actions Not likely to generate false positives As close as possible to real-time notification Part of the traditional fault/passive management system L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 3 Traditional Measurements Fault Triggered by hard failures (link, card, router, etc) Near real-time alarms Passive Element level monitoring Traffic, drops, device health, card performance monitored Performance alarming possible per interface Where can traditional measurements be added to? Path level performance information Delay and delay variation measurements Indication of customer degradation (except hard failures) L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 4 Active Measurements Active measurements introduce synthetic traffic into the network Advantages: Traffic flow follows a sampled customer path Delay, delay variation and sampled loss directly measurable Possible to estimate customer impact of element level degradation Well designed sampling methodology will allow sound estimation of levels of degradation seen Can be used to give customers a sense of network behavior (e.g. AT&T’s Network Status Site http://www.att.com/ipnetwork) Disadvantages Need to introduce traffic into the network Based on sampling, not customer traffic L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 5 Practical Considerations From a practical standpoint, what limits the measurements? Amount of data generated Desire to use a standard/unmodified UNIX kernel Expense of bigger and more powerful servers Cost of deployment of new servers in COs. Difficulty of acquiring appropriate GPS feed L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 6 Measurement Design 24 hours ... 15 minutes Poisson Sequence Periodic Sequence 15 minute duration 1 minute duration = 0.3 pkts/sec Random Start Time Type UDP 20 ms spacing 278 bytes total Type UDP, IPv4 packet loss threshold is a 60 bytes total min of 3 s packet loss threshold is a min of 3 s Presented at the IETF 50 IPPM meeting by Al Morton L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 7 Sampling and Event Detection Poisson Sequence All 15 minutes tested with average inter-arrival time of 3.33s Assume 10 s congestion events (minimum length) If Number of probe packets Test Cycle Length Probability of Detection by one or more packets P(detectio n) 1 e *cong_event_length 1 e 0.3*10 0.95 L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 8 Sampling and Event Detection Periodic sequence 1-min test in a 15-min test cycle (2 if considering RT processes) Assume 10s congestion events (minimum length), assume 1 event per test cycle Test length Congestion event length Test Cycle Length 60 10 0.0778 (one - way events) 15 60 2 * 60 10 0.144 (RT event) 15 60 P(detection) Consider that only recurring events are actionable: Average Number of cycles to detection (one-way) = 1/0.0777 = 13 test cycles The Poisson Probe sequence detects accurately, the Periodic Probe sequence is used to characterize recurring events L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 9 Metrics Round Trip (RT) Loss RT Delay (std dev, 95th percentile, min, mean) Inter-Packet Delay Variation (IPDV) and DV jitter Out of sequence events (non-reversing sequence definition -- up for consideration in the IETF IPPM) Approximate one-way loss Degraded seconds or minutes Loss pattern (number of consecutive losses) Distributions of delay variations Traceroutes performed at the beginning of each test 85 Metrics kept indefinitely L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 10 IPDV Definition and Example IPDV is a measure of transfer delay variation. For Packet n, IPDV(n) = Delay(n) - Delay(n1) Tx Rcv Playout 1 2 1 If the nominal transfer time is =10msec, and packet 2 is delayed in transit for an additional 5 msec, then two IPDV values will be affected. t 3 Inter packet arrival time, longer than send interval 2 4 3 IPDV(2) = 15 - 10 = 5 msec 4 IPDV(3) = 10 - 15 = -5 msec IPDV(4) = 10 - 10 = 0 msec L. Ciavattone, A. Morton, G. Ramachandran Time spent in: AT&T Labs Transit Rcv Buffer Page 11 IP Packet Sequence Src Arriving Packets are compared with the “next expected” RefNum. Dst Playout 1 Packet 2 arrives Out-of-Sequence, since Packet 3 has arrived and the “next expected” packet in Packet 4. 2 3 Packet 2 is Offset by 1 packet, or Late by the arrival time of Packet 2 - Packet 3 = t t 4 1 Tolerance on R2 arrival with 2 Packet Buffer 2 3 Time spent in: L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Transit Rcv Buffer Page 12 Common Problems Detected Route Changes Card degradation Low-level fiber errors Effects of Maintenance (Card swaps etc) L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 13 Examples of Detection Bit errors that cause low-level (~0.03%) loss can be detected accurately using this method and can be fixed before customers feel the impact Typically in such cases the degradation is subtle enough that traditional IP alarms do not show the problem clearly Customers aren’t complaining….yet In the case shown, no customer complaints were made and the problem was fixed proactively L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 14 Increasing Bit Errors 0.2 More occasional Loss was seen with the Poisson Probe Sequence 0.18 Fiber span taken out of service 0.16 Percentage Loss 0.14 0.12 Two packet losses per Periodic test 0.1 0.08 Single packet loss per Periodic test 0.06 0.04 0.02 0 0:07:09 14:37:18 5:03:51 19:41:28 10:10:57 0:35:16 15:02:56 5:39:07 20:00:44 10:35:31 1:08:27 15:53:39 02/05/2002 02/07/2002 02/10/2002 02/12/2002 02/15/2002 02/18/2002 02/20/2002 02/23/2002 02/25/2002 02/28/2002 03/03/2002 03/05/2002 L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 15 Detection of Route Changes Time, h:m Type 0:49 1:07 1:09 1:18 1:30 Poisson Poisson Periodic Poisson Poisson Lost Packets 5 consec 5 consec 54 consec 4 consec 5 consec Burst Duration 23 sec 15 sec 1.04 sec 26 sec 28 sec Route Dur, m:s 4:32 <2:00 (return) 2:10 2:03 RT Delay 1:07 1:09 9 6 Periodic Sequence 1:00 L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs 1:15 Time Page 16 Poisson Probe Route change detection 80 70 60 Delay (ms) 50 40 30 20 10 0 19:45:08 20:30:08 21:15:08 11/28/2001 11/28/2001 11/28/2001 22:00:09 22:45:08 11/28/2001 11/28/2001 min L. Ciavattone, A. Morton, G. Ramachandran 23:30:08 0:15:08 1:00:08 11/28/2001 11/29/2001 11/29/2001 mean AT&T Labs 95% 1:45:09 2:30:09 3:15:08 11/29/2001 11/29/2001 11/29/2001 Loss (%) Page 17 Periodic probe (same incident) 70 60 Delay (ms) 50 40 30 20 10 0 19:38:03 20:16:00 21:00:19 21:48:18 22:39:32 23:17:34 0:00:54 0:57:22 1:41:09 2:22:55 3:10:11 11/28/2001 11/28/2001 11/28/2001 11/28/2001 11/28/2001 11/28/2001 11/29/2001 11/29/2001 11/29/2001 11/29/2001 11/29/2001 min L. Ciavattone, A. Morton, G. Ramachandran mean AT&T Labs 95% %Loss Page 18 The “Blenders” First shown by Steve Casner et al in the NANOG 22 conference (May 20-22, 2001, “A Fine-Grained View of High Performance Networking”, http://www.nanog.org/mtg-0105/agenda.html) Seem to be properties of route loops Rare events, but interesting as they may shed light on some properties of route convergence L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 19 Simple Blender 2000 1800 • 88 packets arrive within 64 ms • 79 OOS packets, 9 in sequence • 7 sequence discontinuities. • Zero Loss • Delay and IPDV actually describe this event best 1600 1400 1200 1000 800 600 400 200 0 0 2000 4000 6000 L. Ciavattone, A. Morton, G. Ramachandran 8000 10000 12000 RtDelay(ms) TxSeqNo AT&T Labs 14000 16000 18000 Page 20 Simple Blender Magnified 2000 1800 1600 1400 1200 1000 800 600 400 200 0 17000 17200 17400 17600 L. Ciavattone, A. Morton, G. Ramachandran 17800 18000 RtDelay(ms) AT&T Labs 18200 TxSeqNo 18400 18600 18800 19000 Page 21 Blender 2 7000 6000 • Scattered loss throughout • 250 packets in event, •10 separate sequence discontinuities • Delay of first packet 6s 5000 4000 3000 2000 1000 0 0 10000 20000 L. Ciavattone, A. Morton, G. Ramachandran 30000 40000 RtDelay(ms) TxSeqNo AT&T Labs 50000 60000 70000 Page 22 Blender 2 7000 6000 5000 4000 3000 2000 1000 0 53000 54000 55000 L. Ciavattone, A. Morton, G. Ramachandran 56000 57000 58000 RtDelay(ms) TxSeqNo AT&T Labs 59000 60000 61000 Page 23 Summary Active measurements: Can provide a view of customer performance Can be used to alert maintenance personnel proactively Can provide insight into network behavior Can be used to improve planned maintenance L. Ciavattone, A. Morton, G. Ramachandran AT&T Labs Page 24