Exposing server performance to network managers through passive

advertisement
Exposing server
performance to
network managers
through passive
network measurements
Jeff Terrell
Dept. of Computer Science
University of North Carolina at Chapel Hill
October 19, 2008
1
web
databases
Aggregation
Point
secure
web
Server Farm
Internet
storage
portal
mail
Problem:
Monitor server performance
1. passively (no instrumenting or probes)
2. pervasively (all servers)
3. in real-time
2
Monitoring Traffic
Router/Firewall
Server
Farm packets
TCP/IP headers
Span port
(passive)
packets
Clients
Aggregation point
(pervasive)
Analysis
Gigabit
Streaming
(real-time)
Output
3
Monitoring Traffic
link tap
(passive)
Internet
TCP/IP headers
Local
Net
Border link
Analysis
Aggregation point
(pervasive)
Output includes:
• real-time display
• alarms/notifications
• forensic analysis
Output
4
web
databases
Aggregation
Point
secure
web
Server Farm
Internet
storage
portal
mail
SSL traffic (i.e.
no accessible
payload)
5
TCP Connection
Vectors
A connection vector is a representation of the
application-level dialog in a TCP connection.
For example:
Client:
Server:
A
A
T
B
Response Times
T
time
T
B
Client-side think-times
6
Constructing connection vectors
client
monitor
server
client
monitor server
SYN
SYN/ACK
ACK
SEQ=372
372
think-times
relative to
monitor
1460
=
Q
E
S
2
7
ACK=3
ACK=1460
SEQ=2920
SEQ=3030
ACK=3030
SEQ=712
=3730
Q
E
S
2
1
7
ACK=
ACK=3730
FIN
time
time
3030
340
700
(Connection closes...)
7
Needed Measurements
Application-level measurements from TCP/IP headers:
server response time
count of application-level requests/responses
per server (i.e. server load)
per connection (i.e. dialog length)
size of application-level requests/responses
connection duration
8
Viability of Netflow
What can Netflow provide?
server response time - No
count of application-level requests/responses - No
per server (i.e. server load) - sort of
per connection (i.e. dialog length) - No
size of application-level requests/responses - No
connection duration - sort of
9
Previous approach
Previous work by Felix Hernandez-Campos on
building connection vectors.
Internet
Capture
Local
Net
Offline Analysis
Packet header traces
(on disk)
Connection vectors
(on disk)
10
Our approach
Our innovation: build connection vectors online, with a
single pass.
Internet
1 Gbps
fiber link
1.4 GHz Xeon
1.5 GB RAM
No packet loss
Capability for continuous
measurement
Elements of connection
vectors available
immediately
Capture/Analysis
Local
Net
Now, no intermediate files
Connection vectors
Capability for online
understanding of server
performance
11
adudump
The tool we wrote to do this is called adudump.
Here’s the output of adudump for an example
connection:
TYPE TIMESTAMP
LOCAL_HOST DIR REMOTE_HOST
OTHER_INFO
SYN: 1202706002.650917 190.40.1.180.443 < 221.151.95.184.62015
RTT: 1202706002.651967 190.40.1.180.443 > 221.151.95.184.62015 0.001050
SEQ: 1202706002.681395 190.40.1.180.443 < 221.151.95.184.62015
ADU: 1202706002.688748 190.40.1.180.443 < 221.151.95.184.62015 163 SEQ 0.000542
ADU: 1202706002.733813 190.40.1.180.443 > 221.151.95.184.62015 2886 SEQ 0.045041
ADU: 1202706002.738254 190.40.1.180.443 < 221.151.95.184.62015 198 SEQ 0.004441
ADU: 1202706002.801408 190.40.1.180.443 > 221.151.95.184.62015 59 SEQ
END: 1202706002.821701 190.40.1.180.443 < 221.151.95.184.62015
computing all kinds of things in
real-time...contextual
information as well as ADUs...
12
Data
For this paper:
Overall:
66 days
180 days
1.54 TB (uncompressed)
3.35 TB (uncompressed)
16.8 billion requests and
responses
34.8 billion requests and
responses
1.6 billion connections
4.0 billion connections
13
Case study: the incident
web
databases
Aggregation
Point
secure
web
Server Farm
Internet
storage
portal
mail
Thursday, April 10th, 4:28pm
Representative, though manual analysis
14
Case study: the issue
avg. response time over 15 minutes (s)
3.5
2 weeks ago
last week
this week
3
2.5
2
1.5
1
0.5
0
Fri
Sat
Sun
Mon
Tue
Wed
Thu
Date/time (15-minute intervals)
15
Case study: the issue
1
cumulative probability
0.8
0.6
...comprise
~35% of the
distribution.
That is, 35% of
response times
are < 10 ms.
0.4
response
times
< 10 ms...
0.2
0
0.0001
0.001
0.01
0.1
1
Server’s response time (s)
10
100
16
Case study: the issue
1
cumulative probability
0.8
0.6
~95% of
response
times...
...are < 1s.
0.4
0.2
0
0.0001
0.001
0.01
0.1
1
Server’s response time (s)
10
100
17
Case study: the issue
1
cumulative probability
0.8
0.6
0.4
0.2
0
0.0001
0.001
0.01
0.1
1
Server’s response time (s)
10
100
18
Case study: the issue
1
cumulative probability
0.8
0.6
0.4
Historical (all prior to incident)
0.2
0
0.0001
Historical (Thursday 3:28-4:28 pm)
0.001
So, generally faster responses on
0.01
0.1
1
10
100
Tuesday afternoons (i.e. lower
Server’s response time (s)
response times).
19
Case study: the issue
1
cumulative probability
0.8
0.6
0.4
0.2
0
0.0001
0.001
So,
slower
0.01
0.1
1 generally
10
100 responses
Server’s response time
(s)
during
the hour of the incident.
20
Case study: investigation
What could cause this incident?
Larger requests (more processing required)
Larger responses (implying more processing)
More requests per connection (more work)
More requests per time unit (more work)
21
Case study: investigation
cumulative probability
1
Before - all
Before - hours
Hour of incident
0.8
0.6
0.4
0.2
0
1
10
100
1000
Client’s request size (bytes)
10000
22
Case study: investigation
What could cause this incident?
Larger requests (more processing required)
Larger responses (implying more processing)
More requests per connection (more work)
More requests per time unit (more work)
23
Case study: investigation
cumulative probability
1
Before - all
Before - hours
Hour of incident
0.8
0.6
0.4
0.2
0
1
10
100
1000 10000 100000 1e+06
Server’s response size (bytes)
24
Case study: investigation
What could cause this incident?
Larger requests (more processing required)
Larger responses (implying more processing)
More requests per connection (more work)
More requests per time unit (more work)
25
Case study: investigation
cumulative probability
1
0.8
0.6
0.4
0.2
Before - all
Before - hours
Hour of incident
0
0.1
1
10
Epochs per connection
100
26
Case study: investigation
What could cause this incident?
Larger requests (more processing required)
Larger responses (implying more processing)
More requests per connection (more work)
More requests per time unit (more work)
27
Case study: investigation
point is that this is an
app-level measurement
of load...NOT that we
couldn’t have figured out
high load via other
mechanisms.
35000
# requests per hour
30000
25000
20000
15000
10000
5000
0
Mar
13
Mar
20
Mar
27
Date
Apr
03
Apr
10
28
Case study: investigation
time
...
Bin 1
Bin 2
Bin 3
29
Case study: investigation
median w/ Q1 and Q3 - historical
14
median response time (s)
12
10
i.e. the median of bin 3
8
error bars are first and third quartile
6
4
2
0
0
5
10
15
20
25
30
response time ordinal
30
Case study: investigation
median w/ Q1 and Q3 - historical
median w/ Q1 and Q3 - incident
14
median response time (s)
12
10
8
6
4
2
0
0
5
10
15
20
25
30
response time ordinal
31
Case study: investigation
median w/ Q1 and Q3 - historical
median w/ Q1 and Q3 - incident
14
median response time (s)
12
10
8
94% of connections
have < 3 responses
(~55% have exactly 3)
6
4
2
0
0
5
10
15
20
25
30
response time ordinal
32
Conclusions
Achieved monitoring of server performance:
for all servers, of any type
in real-time, at gigabit speeds,
on older hardware,
completely passively.
adudump data provides diagnostic insight into
performance issues.
33
Questions?
34
Download