Develop documentation
Develop web site with goals, benefits, desired outcome
Add reports for existing SLAC measurements
Add technical information
Develop distributable versions of code for bandwidth and network performance measurements, to include:
Choice of amount of network bandwidth to be used
Choice of tools to make measurements o Very low network traffic bandwidth estimator (ABwE) o TCP memory to memory throughput (iperf) o Bulk data throughput applications (bbftp, bbcp, GridFTP) o Ping, traceroute
Enable choice of security requirements at remote hosts (ssh vs. run servers)
Integrated traceroute recording, analysis and reporting
Integrated anomalous event detection and reporting
Infrastructure management tools for monitoring sites
Code distribution tools
Establish relations with support people at CMS and Atlas tier 0 and 1 ssites
Contact responsible people at sites
Provide information on goals, benefits and desired outcome from the project
Establish Point Of Contact (POC) person at each site for project
Get ssh accounts at sites, or set up servers with checks to restart when necessary
For each monitoring site in turn:
Install code, provide initial configuration template
Work with POC at each tier 0 and 1 site to assist in tuning configuration so tier 0 and 1 sites can monitor each other (as required by site)
On request, provide guidance as tier 0 and 1 sites set up to monitor chosen tier 2 sites of interest
Upgrade code at tier 0 and 1 sites as new versions become available
Possibly: scheduling
A small focused infrastructure of 10-20 self-managed sites with regular active bandwidth performance measurements. Initial sites to include:
CERN (LHC tier 0 site)
BNL, FNAL (ATLAS and CMS tier 1 sites)
Caltech, U Michigan, SDSC (LHC tier 2 sites)
SLAC (BaBar tier 0 site)
Network sites (ESnet, StarLight)
BaBar tier 1 sites (INFN/Padova, IN2P3, RAL)
A new production quality anomalous event detection toolkit
Publication on algorithm and its performance
Production level, distributable code to implement
Integrated into IEPM-BW toolkit
Distributable code
Lightweight bandwidth estimation (ABwE)
Robust toolkit for making regular end-to-end throughput performance measurements, archiving the data, analyzing and reporting on the results
Automated detection of anomalous events
Access to data:
Interactively via the web in various easy to use formats (e.g. CSV)
Upon demand for large volumes of data
Via a prototype web services interface
Year 1, Q1:
Develop initial documentation on goals, benefits, desired outcome, suitable to be presented to contact at ATLAS and CMS tier 0 sites.
Extend the IEPM-BW home web site to add information on the current project
Identify contacts at tier 0 and 1 sites.
Contact contacts by phone and email, explain what si proposed, identify POCs.
Set up ssh accounts at two tier 0 and 1 sites (e.g. CERN and FNAL)
Polish up the IEPM-BW toolkit to improve its distributability
Year 1, Q2:
Install IEPM-BW toolkit at the 2 tier 0 and 1 sites
Provide initial configuration for IEPM-BW toolkit at two sites
Improve documentation on setting up and managing monitoring sites
As needed visit selected sites and provide hands-on guidance to POC to add the sites own chosen tier 2 sites and to effectively use the management tools
Extend IEPM-BW web site to provide pointers to new sites
Identify contacts at further tier 0 and 1 sites (e.g. BNL)
Solicit feedback for needed improvement to IEPM-BW toolkit
Year 1, Q3:
Understand and analyze requests, decide on most cost-effective to add
Study and set up a CVS repository for the IEPM-BW toolkit to enable others to contribute to its development
Define required changes and identify who will make them
Install IEPM-BW at BNL
Finish implementing a traceroute analysis toolkit
Study, develop and provide rough implementations for algorithms to detect anomalous behavior (events) in measurements (bandwidth, throughput and RTT)
Year 1, Q4
Coordinate activities to improve IEPM-BW across multiple sites
Make selected changes to IEPM-BW toolkit
As appropriate add more measurement tools such as GridFTP
Compare the performance of various implementations of anomalous event detection algorithms and decide on effectiveness and applicability
YEAR 2, Q1:
Tune parameters of chosen anomalous event algorithm(s) and document guidance for users
Identify further sites for IEPM-BW deployment
Add traceroute analysis and presentation to toolkit
Design new IEPM-BW toolkit to simplify the security requirements (remove need
for ssh accounts) and make much less intensive use of the network
YEAR 2, Q2:
Make contact with further sites for IEPM-BW deployment
Identify POCs, train POCs, set up ssh accounts
Make anomalous event detection code distributable and integrate with IEPM-BW toolkit
Document anomalous event detection and add to web site
Start implementation of new version lightweight (bandwidth used and security requirements) of IEPM-BW
Design filters for anomalous events to add hysteresis and reduce the amount of
“noise”.
Test out anomalous event alert notification on developers and network admins at
SLAC
Year 2, Q3
Develop distributable version of lightweight IEPM-BW (IEPM-LITE)
Set up first IEPM-LITE remote site
Extend alert notification to other beta sites
Publish paper on anomalous event notification techniques and experiences
Year 2, Q4:
Add anomalous event notification to the IEPM-BW toolkit
Develop web services prototype access to bandwidth, thoughput and RTT data
Develop documentation on how to access IEPM-BW data interactively, or via web services
Year 3, Q1:
Investigate providing a central archive of IEPM-BW data
Develop web services prototype access to traceroute data