GRIDPROBE: Multi-Protocol Network Monitoring and Analysis Computer Laboratory Systems Research Group

advertisement
GRIDPROBE:
Multi-Protocol Network Monitoring and Analysis
Computer Laboratory
Systems Research Group
Christopher Clark, Tim Granger, James Hall, Euan Harris, Evangelia Kalyvianaki, Christian Kreibich, Andrew Moore and Ian Pratt
e-mail Firstname.Lastname@cl.cam.ac.uk
GRIDprobe develops the Computer Laboratory's prototype Nprobe monitor to investigate network data capture and processing at rates of 1, 10 and 40 Gbps. The range of network protocols recorded by Nprobe will be extended to other common
network, transport and applications using the GRID. GRIDprobe is a scalable network monitoring architecture that is able to monitor high-speed network links and collect detailed information about the traffic flowing through them. Unlike other
network probes that simply record packet headers, GRIDprobe collects and integrates information across all levels of the protocol stack, enabling accurate understanding of network performance, protocol interaction, and even user behaviour. The
information can be used to highlight problems with current protocols, and to answer `what if' questions about the impact of new network technologies and protocols. GRIDprobe collects data to log files, which are then post-processed to complete
the analysis. More detailed information about Nprobe can be found in the Computer Laboratory Technical Report "Multi-Layer Network Monitoring and Analysis" obtainable at http://www.cl.cam.ac.uk/TechReports/UCAM-CL-TR-table.html
and through the project Web pages at http://www.cl.cam.ac.uk/Research/SRG/netos/nprobe/.
HT T P
R T SP
Other
FT P
DNS
UDP
Data
M ethod and
State
A ttachment
IP
Because data collection may be tailored to specific research tasks, and may be extended,
the format of the saved data may vary between monitoring runs. Each log file may
contain data collected for a range of studies.
Scalable: To keep pace with high-bandwidth traffic
packets from the network are striped across multiple
data extraction processes, possibly running on
separate machines. A filter at the network interface
directs packets to the appropriate process on that
machine, or rejects them if to be processed by
another machine of the monitoring cluster.
State
TCP
Based upon readily available commodity hardware.
ICP
NF S
The data is more complex, and contains a wider and more complex set of
interrelationships.
I C MP
Although the data, and the information produced during
analysis, may vary between research exercises, many
analysis tasks share common elements. GRIDprobe
therefore provides an analysis toolkit designed to enable
fast design, verification and execution of analysis
processes. The toolkit, which is written in the Python
programming language, provides:
T ools
Pr esentatation
V isualisation
Statistical
R esults
Pr otocol
M odules
Summar y and
Selection
A nalysis
L og
A nalysis
State Pool
M AC
C ontr ol
T hr ead
User
L og B uffer
R x B uffer Pool
K er nel
R AI D
F ile System
F ilter
Networ k
E xtr acted Data
R eexamination
Data
Data
F or mat
Data
Selection
Data
R etr ieval
R ead
F or mat
F ilter
R efer ence to State
The principal software components of the
GRIDprobe monitor.
For the sake of clarity only one GRIDprobe process is
illustrated, and data extraction and state reference are
shown only for the TCP, HTTP, and HTML protocol
modules.
Stateful: The syntactic or semantic units employed by
protocols above the transport layer may span many
packets. Data extraction modules must therefore
maintain sufficient state to interpret and associate
data throughout the lifetime of the connections
monitored.
Data Analysis Components
Python R etr ieval
I nter face Object
Python
M ethods
C Data
Str uctur es
C M ethod
F unctions
M eta
Data
F ast
A ccessor
F unctions
Data
C onver sion
M ethods
Object
C onstr uctor
Npr obe M onitor I
L ength
A nalysis
M ethods
R ead
F unctions
Log Buffer
RAID File System
Log Buffer
Data
M anipulation
F unctions
T r ace F ile
Log Buffer
Object I nstantiation
Data
A nalysis
F or mat Data
New algorithms are being developed to
synchronise time stamps generated by
distributed monitors.
Python
Pr otocol
A nalysis
Object
Python
A ttr ibutes
F ile R ead
All significant events observed on the
network (e.g. packet arrivals, applicationlevel events) are associated with a precise
time stamp generated by the monitor's
network interface as packets arrive.
A set of protocol-specific analysis classes, and further classes associating,
integrating and analysing protocol-spanning data.
V ar iable L ength data
Log Buffer
CPU 1
Nprobe
Process
CPU N
Nprobe
Process
CPU 1
Nprobe
Process
CPU N
Nprobe
Process
Rx Buffer
Rx Buffer
Rx Buffer
Rx Buffer
Filter
NIC
Filter
NIC
Filter
NIC
Filter
NIC
M ethod I nvocation
Data F low
Data retrieval from Log Files
[ 2]
[ 1]
E
C
GRIDprobe scalability.
The simple NIC filter rejects packets which are to
be processed by other probes in the cluster;
accepted packets are placed in the buffer pool
associated with the GRIDprobe process which
will extract the required data.
400
350
300
250
TCP modelling reveals that server latency increased sharply at around 12.15
and reached several hundreds of milliseconds.
200
150
100
50
0
11.30am
12.00
12.30pm
1.00
Time of day
1.30
60
50
Network round trip times also increased, but only by an order of tens of
milliseconds.
D1
[ 3]
B
A
D2
Visualisation of a Small web Page
Download
Classes implementing analysis techniques developed to make use of the
rich data set collected. A TCP modelling class, for instance, builds a
model of TCP connection activity which uses knowledge of higher-level
activity to distinguish time components contributed by network, transport
and application-level activity. By determining the causality relationship
between packets and between TCP-level and application-level events,
delays contributed by applications, round trip times, loss and network
path performance characteristics can be inferred. The precise
characterisation of connection and application behaviour obtained may
be used as the input to further simulation classes which can be used to
investigate `what if' scenarios by varying parameters such as network
transit times, application latencies or loss rates.
Data visualisation and presentation tools which present the collected data
in readily comprehensible form and aid reasoning about the data and its
contained relationships. The results of analysis are also presented with the
raw data and its derivation can therefore be examined.
Existing analysis and infrastructure classes can be subtyped to introduce the new or modified functionality
appropriate to specific studies. As new protocol extraction
modules are added to the data collection architecture, new
analysis classes are written to enhance the existing analysis
repertoire.
40
30
20
10
0
11.30am
12.00
12.30pm
1.00
1.30
Time of day
40000
35000
Nested
Data
RAID File System
450
An analysis control and result collection and aggregation infrastructure.
Data
Npr obe M onitor 1
Log file records for the period between 11.30am and 1.50pm (when the server
was expected to become busy as users start browsing during their lunch hour)
were analysed in detail:
500
A data retrieval interface automatically generated from the monitor
configuration used to collect the log file.
R aw
Data
T ime Stamp
Networ k I nter face
Packets
Modular and extensible: Data extraction is carried
out by protocol-specific modules which can be
configured to collect only the data of interest. New
protocol modules can be added as required and data
collection tailored to the needs of specific studies.
Data
Definition
GRIDprobe was used to monitor all traffic between the University and the site for a 24 hour
period. The monitor was configured to collect full details of all TCP packets carrying HTTP
traffic to and from the site, information relevant to all HTTP transactions observed
(gathered by reassembling the TCP byte stream and parsing the HTTP headers carried), all
of the links contained in the HTML documents returned, their type (e.g., to in-lined images
or page decorations, style sheets, frames or other pages) and the time of their arrival.
Calculated pRTT ms
Other
The data is more heterogeneous as it spans a wider range of protocols.
Users of a popular news web site commonly experience frustrating
delays when downloading pages from the site.
Using the known data gathered from the TCP, HTTP and HTML levels the
patterns of network, browser and server activity involved in downloading the
entire pages were reconstructed giving the times shown. Average download
times increased from approximately 6 seconds before lunchtime to a peak of
approximately 27 seconds. The increases in server latency and network round
trip times alone do not account for a rise of this magnitude.
30000
25000
20000
15000
10000
5000
0
11.30am
12.00
12.30pm
1.00
Time of Day
1.30
0.7
The loss probability for a data-carrying TCP segment did not rise
significantly over the period, but that of a SYN segment (used to open a
connection) rose sharply. The implication is that the server became
overwhelmed and shed load by refusing connections; this introduced
considerable delay as the browser will wait for 3 seconds before trying again
(the period doubling for each subsequent retry).
Probability of Retransmission
GRIDprobe is:
H T ML
The log files collected by GRIDprobe are post processed to extract
the information that they contain. Because the contained data is
more comprehensive than that produced by more conventional
monitors, post-collection analysis presents new challenges:
Page download time ms
GRIDprobe passively monitors the network(s) to which it is
attached, typically through a fibre-optic splitter or port monitoring
at a network switch. Because GRIDprobe is designed to collect data
from high-bandwidth sources across a range of protocols, verbatim
recording of packet contents would generate an unacceptable volume
- the data of interest must be extracted and stored in a compact form.
GRIDprobe in use
Calculated server latency ms
Post-collection data analysis
Data
SYN
0.6
0.5
0.4
0.3
0.2
0.1
0
11.30am
12.00
12.30pm
1.00
Time of Day
1.30
12000
10000
Page download time ms
Data Collection
Downloads of the same pages were simulated, but using persistent HTTP
connections on which multiple page components are requested. In this case the
server's capacity to shed load by refusing connections is limited to only a few
instances per page; although page download times would have risen, they
would peak at the much lower time of approximately 9 seconds.
8000
6000
4000
2000
0
11.30am
12.00
12.30pm
Time of Day
1.00
1.30
T
t
c
Download