Monitoring Grid Services

Yin Chen

1. Issues of Grid monitoring

1.1 What the goals of Grid monitoring

Propagate errors to users/management

Performance monitoring to

- tune the application

- use the Grid more efficiently

The question is

- NOT how to measure resources

- but how to deliver information to end-users and system/Grid

1.2 What's the characteristics of Grid system

Complex distributed system =>often observe unexpectedly low performance

- Where is the bottleneck?

- application

- operating system

- disks

- network adapters on either the sending or the receiving host

- network switches, routers

- Experience of the Netlogger group

- 40% network, 40% application, 20% host problems

- application: 50% client, 50% server process problems

Dynamic environment

World-wide distributed environment with

- high latency

- frequent faults

- very heterogeneous resources

Security(authentication, authorisation, encoding)

1.3 What information may need to be monitored

Disk space, speed of processor, network bandwidth, specialised device time and CPU time,

CPU load, memory load, network load, network communication time that includes both TCP/IP protocol-processing times and raw network transmission time, number of parallel streams, stripes TCP/IP buffer size, disk access time that includes time to copy data to or from the local hard disk on the server.[2][3]

Some of this information is relative static information while others are run-time dynamic information

1.4 What's the Characteristics of performance-monitoring data

Run-time monitoring data goes "Old" quickly:

- When information being accessed and transported through the network, the state of the system

component have changed, potentially rendering the information invalid.

- Producer should near the entities,

- The transport information should be as rapidly and efficiently as possible from producer to consumer.

- Information should be explicit, for example by timestamps and time-to-live metadata.

Updates are frequent:

- Dynamic performance information is typically updated more frequently that it is read.

Performance information is often stochastic

- It is often impossible to characterise the performance of a resource or an application component by using a single value. Thus, dynamic performance information may carry quality-of- information metrics quantifying its accuracy, distribution, lifetime, and so on, which may need to be calculated from the raw data. [1][4]

1.5 Related work

1.5.1 MDS

- The Monitoring and Discovery Service (MDS) is the Grid information service used in the

Globus Toolkit and built on top of the Lightweight Directory Access Protocol (LDAP) .

- MDS is primarily used to address the resource selection problem, namely, how does a user identify the host or set of hosts on which to run an application?

- It has a decentralised structure that allows it to scale, and it can handle static or dynamic data about resources, queues and the like.

MDS Architecture

MDS has a hierarchical structure that consists of three main components:

- A Grid Index Information Service (GIIS) provides an aggregate directory of lower level data.

- A Grid Resource Information Service (GRIS) runs on a resource and acts as a modular content gateway for a resource.

- Information Providers (IPs) interface from any data collection service and then talk to a


- Each service registers with others using a soft-state protocol that allows dynamic cleaning of dead resources. Each level also has caching to minimise the transfer of un-stale data and lessen network overhead. [5][7]


1.5.2 GMA

- Grid Monitoring Architecture (GMA) defined within the Global Grid Forum (GGF) .

- GMA is an architecture for monitoring components that specifically addresses the characteristics of Grid platforms.

- The GMA consists of three components:

Consumers, Producers, and a Registry.

Producers register themselves with the

Registry and Consumers query the Registry to find out what types of information are available and to locate the corresponding

Producers. Then the Consumer can contact a specific Producer directly.

- GMA as defined currently does not specify the protocols or the underlying data model to be used.[5][1]


1.5.3 R-GMA

- European Data Grid Relational Grid Monitoring Architecture(R-GMA) is an implementation of the Grid Monitoring Architecture (GMA) .

- It is based on Relational Database Management System(RDBMS)[8] and Java Servlet technologies.

- Its main use is the notification of events-that is, a user can subscribe to a flow of data with specific properties directly from a data source. For example, a user can subscribe to a loaddata data stream, and create a new Producer/Consumer pairing to allow notification when the load reaches some maximum or minimum.

R-GMA Architecture

- To register with a Registry, a Producer advertises a table name and the row(s) of a table to the


- The Producer module communicates with a ProducerServlet, which registers the information to the RDBMS in the Registry.

- The RDBMS holds the information for all the Producers (the registered table name, the

identity, and the values of those fixed attributes) and the descriptions of each Producer’s tables.

- Consumers can issue SQL queries against a set of supported tables.

- The ConsumerServlet consults the Registry to find suitable Producers. Then the

ConsumerServlet acting on behalf of the Consumer issues new queries to the located

Producers to request and return the data to the Consumer.

- The ProducerServlet and ConsumerServlet are usually distributed and may run on machines remote from where the Producer or Consumer is located. [5][6]


1.5.4 Hawkeye

- Hawkeye is a tool developed by the Condor group and designed to automate problem detection.

- The main use case was being able to offer warnings (e.g., high CPU load, low disk space, or resource failure) . It also allows for easier software maintenance within a distributed system.

Architecture of Hawkeye

- Hierarchical architecture that consists of four major components: Pool, Manager, Monitoring

Agent, and Module.

- A Pool is a set of computers, in which one computer serves as the Manager and the remaining computers serve as Monitoring Agents.

- A Manager is the head computer in the Pool that collects and stores monitoring information from each Agent registered to it. It is also the central target for queries about the status of any

Pool member.

- A Monitoring Agent is a distributed information service component that collects data from its

Modules and integrates them into a single Startd object. At fixed intervals, the Agent sends the Startd object to its registered Manager. An Agent can also directly answer queries about a particular Module.

- A Module is simply a sensor. [5][10]


1.5.5 HBM

- The Globus Heartbeat Monitor (HBM) was a simple but reliable mechanism for detecting and reporting the failure (and state changes).

- A daemon ran on each host gathering local process status information.

- A client was required to register each process that needed monitoring.

- Periodically, the daemon reviews the status of all registered client processes, update its local state and transmit a report (on a per process basis) to a number of specific external data collection daemons.

- The data collecting daemons provided local repositories that permitted knowledge of the availability of monitored components based on the received status reports.

- The daemons also recognised status changes and notified applications that registered an interest in a particular process.

- The HBM was capable of process status monitoring and fault detection.

- The HBM was unable to monitor resource performance. [11][12]


1.5.6 NWS

- The Network Weather Service (NWS) [4] is a distributed system that periodically monitors and forecasts the performance that various network and computational resources can deliver over a given time interval.

- The NWS has been developed to provide statistical quality of service (QoS) information and to support dynamic schedulers.

NWS Architecture

- NWS includes sensors for TCP performance (bandwidth and latency), and available CPU and memory.

- Information taken from the NWS sensors provides data for the current system conditions to be forecast based on numerical models.

- Sensors measure the current performance of different resources.

- To achieve scalability, sensors are organised into sets (cliques), which are then ordered hierarchically. Each clique is configurable and has only one leader (determined by a distributed election protocol).

- The sensor controller persistently stores sensor measurements in plain text, time stamped strings.

- The NWS forecaster utilises this information to predict network performance over a specified time interval.

- NWS has a name server that provides a system wide directory service for NWS processes.

- All NWS processes are required to periodically refresh registration data with the name server.

NWS processes are stateless.

- The NWS provides a mechanism for monitoring current resource conditions and forecasting of future conditions.

- The NWS provides a scalable, extensible and non-intrusive means of monitoring resources.

- The NWS uses non-standard message formats, its name server and forecaster are centralised, and appears to have no specialised security built in.[13][11]

See Network Weather Service,

1.5.7 GridRM[11]

- Base on GMA, designed to monitor Grid resources rather than the executing applications.

- The Global Layer of GridRM

- The Local GridRM Layer


1.5.8 Summary and Conclusion

- Varieties of different systems exist for monitoring and managing distributed Grid-based resources and applications. Each system has its own strengths and weaknesses.

- Most of system tend to use standard and open components, taking advantage of the GGF advocated architecture to bind together the monitored resources (GMA) and security.

- The similarities in architecture:

- At the lowest level, most of approaches have a sensor or other program that generates a piece of data.

- At the resource level, some of systems gather together the data from several information collectors into a component.

- Some systems allow data to be aggregated from a set of resources;

- Some systems have Directory component

- Most of systems have decentralised hierarchy structure, which have higher ability in fault tolerance.

- There are some differences in using push or pull mechanism for data transferring. e.g. the

MDS allows only a pull model. R-GMA supports both the pull and the push models.[5]

- Some study[5] strong advantages to caching or pre-fetching the data, as well as the need to have primary components at well connected sites due to high load seen by all system

2 Project Proposal

2.1 Goal

- Realisation

- Lightweight and simple design

- Reliability and Robustness

2.1 Requirement ???

The requirement detail should be inquired; a possible idea can be this:

Monitoring use case: Use Performance Monitoring for Management


A farm has several hundred nodes. The site administrators need to collect node status for each of these nodes. The statistics can be used for monitoring the usage of computing resources for grid organisations. Based on these statistics, the site manager can know whether the maximum load for subsystem is reached. They can justify that a new purchase will be needed based on whether the maximum system load has been reached. The site manager can decide what type of new computer nodes is best for users based on the load information on different nodes. This can help them to select the type of hardware for the new purchase.

Performance events required:

Configuration information: node name, domain, IP address, Mac Address, Gateway, Rack number, Position, Brand, Hardware type, Network card type, CPU identification number, OS version, Kernel version, memory size, sway space, home directory, local disk space.

Node status: Machine up time, CPU load, (5 minutes, 10, 15), memory load, disk load, and network load


How the performance information will be used: Overall utilisation of farm should be reported periodically to upper management.

Individual node utilisation of farms should be reported periodically to upper management on what is the best hardware.

Management decisions concerning linux node can be made.

Access needed: Streaming of data,

Summary of the data stream.

Requires access to historical information.

Archive database should be published.

Size of data to be gathered Individual statistics will be small if all that is needed is a <timestamp, value> tuple.

Historical archives may become large after years of monitoring.

Overhead constraints: Daemon needs to run on each node to collect machine status.

Machine status will be sent through local network.

Large amount of disk space is needed to save the log information.

All these activities should not interface the normal system running.

Frequency data will be updated: Requirement: As often as possible without adding significant overhead to the local host and network.

Scalability should be considered. The current testing system gets data sample every 10 minutes.

Frequency data will be accessed: Every month, a report needs to be generated.

How timely does data need to be: The data sample time should be long enough to avoid the fluctuation.

Data need to be archived. The data will be compressed every six months.

Scale issues: There are at least 17 grid users who could simultaneously access the tool.

Security requirements The facility managers.

Consistency or failure concerns: The sensitive data will be saved the database and mirror site in case of failure.

Duration of the logging:

If cumulative measurements are taken daily, logging can continue for several years before removing old data from the archives. Due to the space limitation, everything half year, the log

data will be compressed, i.e. only one data sample will be picked from two data samples in the database.

Platforms : ??

Prototype of monitoring report: ???

2.3 Architecture

In this project I will attempt Pull model

2.3.1 What is Pull model

- The monitor sends requests to the service for information. This implies repeated queries of resource attributes over some time period at a specific frequency.

- On the other hand in a Push model the service sends out notifications to a subscribed sink.

2.3.3 What are the benefits

- Less network traffic: collections initiated only from top.

- Has no time synchronisation problem: collect data from resources at the same time.

- According to Globus, "push" model "generates a considerable amount of data and results in constant updates to the MDS. Standard LDAP databases are not designed to handle frequent updates. Furthermore, although this information is useful to applications, it is not used frequently. "

- In a pull model, control rests with the server accepting the data. The server can determine the size of the file, select the appropriate alternate server that can best handle the data, and passively control the bandwidth and storage space. This is a far simpler management model.

-In a scheduled request, a client attempts to reserve a certain amount of space at some bandwidth for a future transfer operation. In a push model, the storage domain must make sure that the resources are available at the scheduled time whether or not the resources will actually be used. Recovery is also more difficult both in terms of not meeting the resource requirement as well as transfer error recovery. The pull model allows the storage domain to be in complete control. Thus, resource allocation is simplified and error recovery is confined to the storage domain.

- Autonomic computing: The 'Pull' model is based on distributed intelligence to the asset site -

it becomes automated. Using machine-to-machine communications with connected sensors and autonomic computing the asset does self-diagnostics, self maintain and repair, re routes energy flows, schedules non-routine maintenance and reports on any out of the ordinary activity that poses a security threat. IBM has invested many 100s of millions in its project eLiza to create chips for its computers to carry out self-diagnostic and self-healing activities.

IBM calls it autonomic computing where machine to machine communications take place to optimise the performance of computing and network resources.

2.3.4 What might be the problem

- must gathering current measurements from all resources,.

- if the data volume is large in real-time may cause bottleneck problem.

- may be not useful in fault detection, since heartbeat events are valid only for a short time interval and should be delivered in this time constraint

- may be not useful in dynamic sensor management

- "...Scripts running on the queue nodes would simply ship the results when finished, rather than the job-manager having to pull them or "poll" for job completion."

- Some other studies said the push model is the most efficient in terms of bandwidth as requests are not sent just responses from the service.

2.4 Specification ??

2.5 Implementation ???

2.6 Testing Plan ????

2.7 Timetable See separate page


