Active Measurements on the AT&T IP Backbone Len Ciavattone, Al Morton, Gomathi Ramachandran

advertisement
Active Measurements on the AT&T IP
Backbone
Len Ciavattone,
Al Morton, Gomathi Ramachandran
AT&T Labs
Colleagues on This Project
 Nicole Kowalski
 Ron Kulper
 George Holubec
 Shashi Pulakurti
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 2
Measurements for Large Networks
 Must be:
 Easily understood
 Estimate or assess customer performance
 Useful for alarming and associated actions
 Not likely to generate false positives
 As close as possible to real-time notification
 Part of the traditional fault/passive management system
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 3
Traditional Measurements
 Fault
 Triggered by hard failures (link, card, router, etc)
 Near real-time alarms
 Passive
 Element level monitoring
 Traffic, drops, device health, card performance monitored
 Performance alarming possible per interface
 Where can traditional measurements be added to?
 Path level performance information
 Delay and delay variation measurements
 Indication of customer degradation (except hard failures)
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 4
Active Measurements
 Active measurements introduce synthetic traffic into
the network
 Advantages:
 Traffic flow follows a sampled customer path
 Delay, delay variation and sampled loss directly measurable
 Possible to estimate customer impact of element level
degradation
 Well designed sampling methodology will allow sound
estimation of levels of degradation seen
 Can be used to give customers a sense of network behavior
(e.g. AT&T’s Network Status Site
http://www.att.com/ipnetwork)
 Disadvantages
 Need to introduce traffic into the network
 Based on sampling, not customer traffic
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 5
Practical Considerations
 From a practical standpoint, what limits the
measurements?
 Amount of data generated
 Desire to use a standard/unmodified UNIX kernel
 Expense of bigger and more powerful servers
 Cost of deployment of new servers in COs.
 Difficulty of acquiring appropriate GPS feed
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 6
Measurement Design
24 hours
...
15 minutes
 Poisson Sequence
 Periodic Sequence
 15 minute duration
 1 minute duration
  = 0.3 pkts/sec
 Random Start Time
 Type UDP
 20 ms spacing
 278 bytes total
 Type UDP, IPv4
 packet loss threshold is a
 60 bytes total
min of 3 s
 packet loss threshold is a
min of 3 s
Presented at the IETF 50 IPPM meeting by Al Morton
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 7
Sampling and Event Detection
 Poisson Sequence
 All 15 minutes tested with average inter-arrival time of 3.33s
 Assume 10 s congestion events (minimum length)
 If

Number of probe packets
Test Cycle Length
 Probability of Detection by one or more packets
P(detectio n)  1  e *cong_event_length  1  e 0.3*10  0.95
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 8
Sampling and Event Detection
 Periodic sequence
 1-min test in a 15-min test cycle (2 if considering RT processes)
 Assume 10s congestion events (minimum length), assume 1 event
per test cycle
Test length  Congestion event length
Test Cycle Length
60  10

 0.0778 (one - way events)
15  60
2 * 60  10

 0.144 (RT event)
15  60
P(detection) 
 Consider that only recurring events are actionable:
Average Number of cycles to detection (one-way) = 1/0.0777 = 13
test cycles
The Poisson Probe sequence detects accurately, the Periodic
Probe sequence is used to characterize recurring events
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 9
Metrics
 Round Trip (RT) Loss
 RT Delay (std dev, 95th percentile, min, mean)
 Inter-Packet Delay Variation (IPDV) and DV jitter
 Out of sequence events (non-reversing sequence





definition -- up for consideration in the IETF IPPM)
Approximate one-way loss
Degraded seconds or minutes
Loss pattern (number of consecutive losses)
Distributions of delay variations
Traceroutes performed at the beginning of each test
 85 Metrics kept indefinitely
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 10
IPDV Definition and Example
IPDV is a measure of transfer
delay variation.
For Packet n,
IPDV(n) = Delay(n) - Delay(n1)
Tx
Rcv
Playout
1

2
1
If the nominal transfer time is
=10msec, and packet 2 is
delayed in transit for an
additional 5 msec, then two
IPDV values will be affected.
t
3
Inter packet
arrival time,
longer than
send interval
2
4
3
IPDV(2) = 15 - 10 = 5 msec
4
IPDV(3) = 10 - 15 = -5 msec
IPDV(4) = 10 - 10 = 0 msec
L. Ciavattone, A. Morton, G. Ramachandran
Time spent in:
AT&T Labs
Transit
Rcv Buffer
Page 11
IP Packet Sequence
Src
Arriving Packets are compared with
the “next expected” RefNum.
Dst
Playout
1

Packet 2 arrives Out-of-Sequence,
since Packet 3 has arrived and the
“next expected” packet in Packet 4.
2
3
Packet 2 is Offset by 1 packet,
or Late by the arrival time of
Packet 2 - Packet 3 = t
t
4
1
Tolerance
on R2
arrival with
2 Packet
Buffer
2
3
Time spent in:
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Transit
Rcv Buffer
Page 12
Common Problems Detected
 Route Changes
 Card degradation
 Low-level fiber errors
 Effects of Maintenance (Card swaps etc)
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 13
Examples of Detection
 Bit errors that cause low-level (~0.03%) loss can be
detected accurately using this method and can be
fixed before customers feel the impact
 Typically in such cases the degradation is subtle enough
that traditional IP alarms do not show the problem clearly
 Customers aren’t complaining….yet
 In the case shown, no customer complaints were made and
the problem was fixed proactively
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 14
Increasing Bit Errors
0.2
More occasional Loss was seen with the Poisson Probe Sequence
0.18
Fiber span taken out of service
0.16
Percentage Loss
0.14
0.12
Two packet losses per Periodic test
0.1
0.08
Single packet loss per Periodic test
0.06
0.04
0.02
0
0:07:09
14:37:18
5:03:51
19:41:28 10:10:57
0:35:16
15:02:56
5:39:07
20:00:44 10:35:31
1:08:27
15:53:39
02/05/2002 02/07/2002 02/10/2002 02/12/2002 02/15/2002 02/18/2002 02/20/2002 02/23/2002 02/25/2002 02/28/2002 03/03/2002 03/05/2002
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 15
Detection of Route Changes
Time, h:m
Type
0:49
1:07
1:09
1:18
1:30
Poisson
Poisson
Periodic
Poisson
Poisson
Lost
Packets
5 consec
5 consec
54 consec
4 consec
5 consec
Burst
Duration
23 sec
15 sec
1.04 sec
26 sec
28 sec
Route Dur,
m:s
4:32
<2:00
(return)
2:10
2:03
RT Delay
1:07
1:09
9
6
Periodic
Sequence
1:00
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
1:15
Time
Page 16
Poisson Probe Route change detection
80
70
60
Delay (ms)
50
40
30
20
10
0
19:45:08
20:30:08
21:15:08
11/28/2001 11/28/2001 11/28/2001
22:00:09
22:45:08
11/28/2001 11/28/2001
min
L. Ciavattone, A. Morton, G. Ramachandran
23:30:08
0:15:08
1:00:08
11/28/2001 11/29/2001 11/29/2001
mean
AT&T Labs
95%
1:45:09
2:30:09
3:15:08
11/29/2001 11/29/2001 11/29/2001
Loss (%)
Page 17
Periodic probe (same incident)
70
60
Delay (ms)
50
40
30
20
10
0
19:38:03
20:16:00
21:00:19
21:48:18
22:39:32
23:17:34
0:00:54
0:57:22
1:41:09
2:22:55
3:10:11
11/28/2001 11/28/2001 11/28/2001 11/28/2001 11/28/2001 11/28/2001 11/29/2001 11/29/2001 11/29/2001 11/29/2001 11/29/2001
min
L. Ciavattone, A. Morton, G. Ramachandran
mean
AT&T Labs
95%
%Loss
Page 18
The “Blenders”
 First shown by Steve Casner et al in the NANOG 22
conference (May 20-22, 2001, “A Fine-Grained View of
High Performance Networking”,
http://www.nanog.org/mtg-0105/agenda.html)
 Seem to be properties of route loops
 Rare events, but interesting as they may shed light
on some properties of route convergence
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 19
Simple Blender
2000
1800
• 88 packets arrive within 64 ms
• 79 OOS packets, 9 in sequence
• 7 sequence discontinuities.
• Zero Loss
• Delay and IPDV actually describe this
event best
1600
1400
1200
1000
800
600
400
200
0
0
2000
4000
6000
L. Ciavattone, A. Morton, G. Ramachandran
8000
10000
12000
RtDelay(ms)
TxSeqNo
AT&T Labs
14000
16000
18000
Page 20
Simple Blender Magnified
2000
1800
1600
1400
1200
1000
800
600
400
200
0
17000
17200
17400
17600
L. Ciavattone, A. Morton, G. Ramachandran
17800
18000
RtDelay(ms)
AT&T Labs
18200
TxSeqNo
18400
18600
18800
19000
Page 21
Blender 2
7000
6000
• Scattered loss throughout
• 250 packets in event,
•10 separate sequence discontinuities
• Delay of first packet 6s
5000
4000
3000
2000
1000
0
0
10000
20000
L. Ciavattone, A. Morton, G. Ramachandran
30000
40000
RtDelay(ms) TxSeqNo
AT&T Labs
50000
60000
70000
Page 22
Blender 2
7000
6000
5000
4000
3000
2000
1000
0
53000
54000
55000
L. Ciavattone, A. Morton, G. Ramachandran
56000
57000
58000
RtDelay(ms)
TxSeqNo
AT&T Labs
59000
60000
61000
Page 23
Summary
Active measurements:
 Can provide a view of customer performance
 Can be used to alert maintenance personnel
proactively
 Can provide insight into network behavior
 Can be used to improve planned maintenance
L. Ciavattone, A. Morton, G. Ramachandran
AT&T Labs
Page 24
Download