Exposing server performance to network managers through passive network measurements Jeff Terrell Dept. of Computer Science University of North Carolina at Chapel Hill October 19, 2008 1 web databases Aggregation Point secure web Server Farm Internet storage portal mail Problem: Monitor server performance 1. passively (no instrumenting or probes) 2. pervasively (all servers) 3. in real-time 2 Monitoring Traffic Router/Firewall Server Farm packets TCP/IP headers Span port (passive) packets Clients Aggregation point (pervasive) Analysis Gigabit Streaming (real-time) Output 3 Monitoring Traffic link tap (passive) Internet TCP/IP headers Local Net Border link Analysis Aggregation point (pervasive) Output includes: • real-time display • alarms/notifications • forensic analysis Output 4 web databases Aggregation Point secure web Server Farm Internet storage portal mail SSL traffic (i.e. no accessible payload) 5 TCP Connection Vectors A connection vector is a representation of the application-level dialog in a TCP connection. For example: Client: Server: A A T B Response Times T time T B Client-side think-times 6 Constructing connection vectors client monitor server client monitor server SYN SYN/ACK ACK SEQ=372 372 think-times relative to monitor 1460 = Q E S 2 7 ACK=3 ACK=1460 SEQ=2920 SEQ=3030 ACK=3030 SEQ=712 =3730 Q E S 2 1 7 ACK= ACK=3730 FIN time time 3030 340 700 (Connection closes...) 7 Needed Measurements Application-level measurements from TCP/IP headers: server response time count of application-level requests/responses per server (i.e. server load) per connection (i.e. dialog length) size of application-level requests/responses connection duration 8 Viability of Netflow What can Netflow provide? server response time - No count of application-level requests/responses - No per server (i.e. server load) - sort of per connection (i.e. dialog length) - No size of application-level requests/responses - No connection duration - sort of 9 Previous approach Previous work by Felix Hernandez-Campos on building connection vectors. Internet Capture Local Net Offline Analysis Packet header traces (on disk) Connection vectors (on disk) 10 Our approach Our innovation: build connection vectors online, with a single pass. Internet 1 Gbps fiber link 1.4 GHz Xeon 1.5 GB RAM No packet loss Capability for continuous measurement Elements of connection vectors available immediately Capture/Analysis Local Net Now, no intermediate files Connection vectors Capability for online understanding of server performance 11 adudump The tool we wrote to do this is called adudump. Here’s the output of adudump for an example connection: TYPE TIMESTAMP LOCAL_HOST DIR REMOTE_HOST OTHER_INFO SYN: 1202706002.650917 190.40.1.180.443 < 221.151.95.184.62015 RTT: 1202706002.651967 190.40.1.180.443 > 221.151.95.184.62015 0.001050 SEQ: 1202706002.681395 190.40.1.180.443 < 221.151.95.184.62015 ADU: 1202706002.688748 190.40.1.180.443 < 221.151.95.184.62015 163 SEQ 0.000542 ADU: 1202706002.733813 190.40.1.180.443 > 221.151.95.184.62015 2886 SEQ 0.045041 ADU: 1202706002.738254 190.40.1.180.443 < 221.151.95.184.62015 198 SEQ 0.004441 ADU: 1202706002.801408 190.40.1.180.443 > 221.151.95.184.62015 59 SEQ END: 1202706002.821701 190.40.1.180.443 < 221.151.95.184.62015 computing all kinds of things in real-time...contextual information as well as ADUs... 12 Data For this paper: Overall: 66 days 180 days 1.54 TB (uncompressed) 3.35 TB (uncompressed) 16.8 billion requests and responses 34.8 billion requests and responses 1.6 billion connections 4.0 billion connections 13 Case study: the incident web databases Aggregation Point secure web Server Farm Internet storage portal mail Thursday, April 10th, 4:28pm Representative, though manual analysis 14 Case study: the issue avg. response time over 15 minutes (s) 3.5 2 weeks ago last week this week 3 2.5 2 1.5 1 0.5 0 Fri Sat Sun Mon Tue Wed Thu Date/time (15-minute intervals) 15 Case study: the issue 1 cumulative probability 0.8 0.6 ...comprise ~35% of the distribution. That is, 35% of response times are < 10 ms. 0.4 response times < 10 ms... 0.2 0 0.0001 0.001 0.01 0.1 1 Server’s response time (s) 10 100 16 Case study: the issue 1 cumulative probability 0.8 0.6 ~95% of response times... ...are < 1s. 0.4 0.2 0 0.0001 0.001 0.01 0.1 1 Server’s response time (s) 10 100 17 Case study: the issue 1 cumulative probability 0.8 0.6 0.4 0.2 0 0.0001 0.001 0.01 0.1 1 Server’s response time (s) 10 100 18 Case study: the issue 1 cumulative probability 0.8 0.6 0.4 Historical (all prior to incident) 0.2 0 0.0001 Historical (Thursday 3:28-4:28 pm) 0.001 So, generally faster responses on 0.01 0.1 1 10 100 Tuesday afternoons (i.e. lower Server’s response time (s) response times). 19 Case study: the issue 1 cumulative probability 0.8 0.6 0.4 0.2 0 0.0001 0.001 So, slower 0.01 0.1 1 generally 10 100 responses Server’s response time (s) during the hour of the incident. 20 Case study: investigation What could cause this incident? Larger requests (more processing required) Larger responses (implying more processing) More requests per connection (more work) More requests per time unit (more work) 21 Case study: investigation cumulative probability 1 Before - all Before - hours Hour of incident 0.8 0.6 0.4 0.2 0 1 10 100 1000 Client’s request size (bytes) 10000 22 Case study: investigation What could cause this incident? Larger requests (more processing required) Larger responses (implying more processing) More requests per connection (more work) More requests per time unit (more work) 23 Case study: investigation cumulative probability 1 Before - all Before - hours Hour of incident 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 Server’s response size (bytes) 24 Case study: investigation What could cause this incident? Larger requests (more processing required) Larger responses (implying more processing) More requests per connection (more work) More requests per time unit (more work) 25 Case study: investigation cumulative probability 1 0.8 0.6 0.4 0.2 Before - all Before - hours Hour of incident 0 0.1 1 10 Epochs per connection 100 26 Case study: investigation What could cause this incident? Larger requests (more processing required) Larger responses (implying more processing) More requests per connection (more work) More requests per time unit (more work) 27 Case study: investigation point is that this is an app-level measurement of load...NOT that we couldn’t have figured out high load via other mechanisms. 35000 # requests per hour 30000 25000 20000 15000 10000 5000 0 Mar 13 Mar 20 Mar 27 Date Apr 03 Apr 10 28 Case study: investigation time ... Bin 1 Bin 2 Bin 3 29 Case study: investigation median w/ Q1 and Q3 - historical 14 median response time (s) 12 10 i.e. the median of bin 3 8 error bars are first and third quartile 6 4 2 0 0 5 10 15 20 25 30 response time ordinal 30 Case study: investigation median w/ Q1 and Q3 - historical median w/ Q1 and Q3 - incident 14 median response time (s) 12 10 8 6 4 2 0 0 5 10 15 20 25 30 response time ordinal 31 Case study: investigation median w/ Q1 and Q3 - historical median w/ Q1 and Q3 - incident 14 median response time (s) 12 10 8 94% of connections have < 3 responses (~55% have exactly 3) 6 4 2 0 0 5 10 15 20 25 30 response time ordinal 32 Conclusions Achieved monitoring of server performance: for all servers, of any type in real-time, at gigabit speeds, on older hardware, completely passively. adudump data provides diagnostic insight into performance issues. 33 Questions? 34