Internet End-to-end Performance Monitoring – Bandwidth

advertisement

Internet End-to-end Performance Monitoring –

Bandwidth (IEPM-BW)

Tasks

Develop documentation

Develop web site with goals, benefits, desired outcome

Add reports for existing SLAC measurements

Add technical information

Develop distributable versions of code for bandwidth and network performance measurements, to include:

Choice of amount of network bandwidth to be used

Choice of tools to make measurements o Very low network traffic bandwidth estimator (ABwE) o TCP memory to memory throughput (iperf) o Bulk data throughput applications (bbftp, bbcp, GridFTP) o Ping, traceroute

Enable choice of security requirements at remote hosts (ssh vs. run servers)

Integrated traceroute recording, analysis and reporting

Integrated anomalous event detection and reporting

Infrastructure management tools for monitoring sites

Code distribution tools

Establish relations with support people at CMS and Atlas tier 0 and 1 ssites

Contact responsible people at sites

Provide information on goals, benefits and desired outcome from the project

Establish Point Of Contact (POC) person at each site for project

Get ssh accounts at sites, or set up servers with checks to restart when necessary

For each monitoring site in turn:

Install code, provide initial configuration template

Work with POC at each tier 0 and 1 site to assist in tuning configuration so tier 0 and 1 sites can monitor each other (as required by site)

On request, provide guidance as tier 0 and 1 sites set up to monitor chosen tier 2 sites of interest

Upgrade code at tier 0 and 1 sites as new versions become available

Possibly: scheduling

Deliverables

A small focused infrastructure of 10-20 self-managed sites with regular active bandwidth performance measurements. Initial sites to include:

CERN (LHC tier 0 site)

BNL, FNAL (ATLAS and CMS tier 1 sites)

Caltech, U Michigan, SDSC (LHC tier 2 sites)

SLAC (BaBar tier 0 site)

Network sites (ESnet, StarLight)

BaBar tier 1 sites (INFN/Padova, IN2P3, RAL)

A new production quality anomalous event detection toolkit

Publication on algorithm and its performance

Production level, distributable code to implement

Integrated into IEPM-BW toolkit

Distributable code

Lightweight bandwidth estimation (ABwE)

Robust toolkit for making regular end-to-end throughput performance measurements, archiving the data, analyzing and reporting on the results

Automated detection of anomalous events

Access to data:

Interactively via the web in various easy to use formats (e.g. CSV)

Upon demand for large volumes of data

Via a prototype web services interface

Milestones

Year 1, Q1:

Develop initial documentation on goals, benefits, desired outcome, suitable to be presented to contact at ATLAS and CMS tier 0 sites.

Extend the IEPM-BW home web site to add information on the current project

Identify contacts at tier 0 and 1 sites.

Contact contacts by phone and email, explain what si proposed, identify POCs.

Set up ssh accounts at two tier 0 and 1 sites (e.g. CERN and FNAL)

Polish up the IEPM-BW toolkit to improve its distributability

Year 1, Q2:

Install IEPM-BW toolkit at the 2 tier 0 and 1 sites

Provide initial configuration for IEPM-BW toolkit at two sites

Improve documentation on setting up and managing monitoring sites

As needed visit selected sites and provide hands-on guidance to POC to add the sites own chosen tier 2 sites and to effectively use the management tools

Extend IEPM-BW web site to provide pointers to new sites

Identify contacts at further tier 0 and 1 sites (e.g. BNL)

Solicit feedback for needed improvement to IEPM-BW toolkit

Year 1, Q3:

Understand and analyze requests, decide on most cost-effective to add

Study and set up a CVS repository for the IEPM-BW toolkit to enable others to contribute to its development

Define required changes and identify who will make them

Install IEPM-BW at BNL

Finish implementing a traceroute analysis toolkit

Study, develop and provide rough implementations for algorithms to detect anomalous behavior (events) in measurements (bandwidth, throughput and RTT)

Year 1, Q4

Coordinate activities to improve IEPM-BW across multiple sites

Make selected changes to IEPM-BW toolkit

As appropriate add more measurement tools such as GridFTP

Compare the performance of various implementations of anomalous event detection algorithms and decide on effectiveness and applicability

YEAR 2, Q1:

Tune parameters of chosen anomalous event algorithm(s) and document guidance for users

Identify further sites for IEPM-BW deployment

Add traceroute analysis and presentation to toolkit

Design new IEPM-BW toolkit to simplify the security requirements (remove need

 for ssh accounts) and make much less intensive use of the network

YEAR 2, Q2:

Make contact with further sites for IEPM-BW deployment

Identify POCs, train POCs, set up ssh accounts

Make anomalous event detection code distributable and integrate with IEPM-BW toolkit

Document anomalous event detection and add to web site

Start implementation of new version lightweight (bandwidth used and security requirements) of IEPM-BW

Design filters for anomalous events to add hysteresis and reduce the amount of

“noise”.

Test out anomalous event alert notification on developers and network admins at

SLAC

Year 2, Q3

Develop distributable version of lightweight IEPM-BW (IEPM-LITE)

Set up first IEPM-LITE remote site

Extend alert notification to other beta sites

Publish paper on anomalous event notification techniques and experiences

Year 2, Q4:

Add anomalous event notification to the IEPM-BW toolkit

Develop web services prototype access to bandwidth, thoughput and RTT data

Develop documentation on how to access IEPM-BW data interactively, or via web services

Year 3, Q1:

Investigate providing a central archive of IEPM-BW data

Develop web services prototype access to traceroute data

Download