Monitoring service in distributed systems: review of INCA

advertisement

Monitoring service in distributed systems: review of INCA monitoring system.

Harshad Joshi

Abstract:

Distributed computing is currently one of the most exploited computing platform with the emerging techniques such as cloud computing. Thus it is becoming increasingly important to understand all aspects of the distributed computing. From setting up the hardware to applications of various softwares on the distributed systems, probably the most important factor is

“monitoring service”. In this survey, Inca service is as implemented on Teragrid computing service is analyzed in detail along with other services.

Keywords: monitoring, distributed computing, Grid Computing, publish/subscribe systems

Introduction:

As the size of the distributed system increases, the control of its operation and maintenance becomes increasingly difficult. For geographically distributed system (such as Teragrid, which is a high-performance scientific computing facility spread across the US), Monitoring and controlling is quite a challenge. The main challenge occurs due to the requirement of some operations to be real-time or quasi real-time [1]. In this paper, some standard publish/subscribe middleware candidates, specially designed and developed for Grid are examined for their architecture and functionality, and the advantages and disadvantages of these are discussed.

Distributed systems (DS) are becoming very popular and in the near future, there will be a larger number of services based on the concepts of DS that become routine for many operations.

Previously, when CPU power and/or memory was limited the main driving force for DS was to build the system with larger compute power and large amount of (shared) memory to tackle more compute-intensive task (mainly scientific and engineering problems). This also led to supercomputer and cluster architecture. However, with the advent of hardware technologies, not only in CPU technology (Moore’s law still driving the increase rate for CPU power [2]), but also for other hardware such as memory, GPU computing, the cluster computing is achieving petascale computing with a modest size cluster [3]. With internet, and other technologies such as mobile computing, however, in the last decade, new concepts started to realize, such as geographically distributed computing, and cloud computing and this has attracted both academia

(to construct large scale scientific computing facilities such as TeraGrid [4]) as well as industry

[5-7]. These systems are highly dispersed in different physical locations. On one hand, these advances are really attractive making computing boundaries invisible and making what is now called “Internet of things” a reality [8]. On the other hand, the ever increasing complexity is making it rather a challenging task for the control and monitoring of these systems to make sure that everything works just as fine [9].

Monitoring a DS typically needs production of data that can be collected remotely and updated frequently. This requires algorithms that can efficiently work even with slow connectivity,

accurately yielding a real-time (or at least quasi real-time) control over the system of interest. For example, if a compute node of a remote supercomputer has been switched on but does not respond for a long time then it will be considered to be malfunctioning. A real-time system does not need to be very fast but should be stable and respond within a reasonable predefined time limit. So that a monitoring system is considered as a distributed realtime monitoring system, most of the data for monitoring should be received within a reasonable time limit. Traditional monitoring systems are highly centralized and may not scale very well.

Detailed reviews of how the monitoring service should be can be found in the lliterature (e.g., ref

[11]). Here we restrict the discussion by stating that a publish/subscribe (pub/sub) system seems to be the best solution for monitoring services due both its ability to disseminate many-to-many data and highly distributed nature of the DS. Publishers publish data and subscribers receive data that they are interested in. Publishers and subscribers in pub/sub system are independent and need to know nothing about each other. The middleware not only delivers data to its destination but also exhibits higher functionality features such as data discovery, dissemination, filtering, persistence and reliability, etc. The subscriber can be automatically notified when new data becomes available. Compared to a traditional centralized client/server communication model, pub/sub system is asynchronous and is usually distributed and scalable.

A variety of monitoring and discovery services exists, ZenOSS, VMware vCloud, XCat,

MonALISA, INCA, to name a few [12]. In the following we review the features of INCA, as successfully implemented on TeraGrid computing platform [13].

INCA architecture and Features

Inca is developed at SDSC to create the monitoring system for TeraGrid portal. Inca is deployed on a wide variety of production Grids such as TeraGrid, GEON, TEAM, University of California

(UC Grid), ARCS, DEISA, NGS and ZIH. Inca has also been used to monitor Open Source

DataTurbine deployments on CREON and GLEON as well as execute and collect performance data from IPM instrumented applications.Inca offers a variety of web status pages from cumulative summaries to reporter execution details and result histories [see Fig. 1]. While other

Grid monitoring tools provide system-level information on the utilization of Grid resources, the

Inca system provides user-level Grid monitoring with periodic, automated user level testing of the software and services required to support Grid operation. Thus, Inca can be used by Grid operators, system administrators, and application users to identify, analyze, and troubleshoot user-level Grid failures, thereby improving Grid stability. User-level Grid monitoring provides

Grid infrastructure testing and performance measurement from a generic, impartial user’s perspective. The goal of user-level monitoring is to detect and fix Grid infrastructure problems before users notice them – user complaints should not be the first indication of Grid failures. A successful user-level Grid monitoring needs to include following features (cf. Inca Tech. reports from Inca website)

• Runs from a standard user account in order to reflect regular user experiences.

• Executes with a standard user GSI credential mapped to a standard user account when tests or performance measurements require authentication to Grid services.

• Emulates a regular user by using tests and performance measurements created and configured based on user documentation, rather than on system administrator knowledge (of hostnames, ports, pathnames, etc.). In cases where documentation and tests are developed simultaneously

during pre-production, test development should be closely coordinated with the documentation as it is written.

• Centrally manages the configuration of user-level tests or performance measurements in order to ensure consistent testing across resources.

• Easily updates and maintains user-level tests and performance measurements. This is important because tests and measurements are often updated when Grid infrastructure changes. Also, multiple iterations of test development are often required to determine whether a detected test failure stems from a faulty test, incomplete user documentation, or a failed Grid resource.

• Provides a representative indication of Grid status by testing documented user commands and individual Grid software components.

• Automates the periodic execution of user-level tests or performance measurements to understand Grid behavior over time.

• Executes locally on Grid resources to verify user accessible Grid access points. Executes from each resource to every other resource (all-to-all) to detect site-to-site configuration errors such as authentication problems.

Inca implementation provides these features to provide a user-level Grid monitoring system.

Inca Features

Inca is a system that provides user-level monitoring of Grid functionality and performance. It was designed to be general, flexible, scalable, and secure, in addition to being easy to deploy and maintain. Inca benefits Grid operators who oversee the day-to-day operation of a Grid, system administrators who provide and manage resources, and users who run applications on a Grid.

The Inca system (taken from Inca user manual [15]):

1. Collects a wide variety of user-level monitoring results (e.g., simple test data to more complex performance benchmark output).

2. Captures the context of a test or benchmark as it executes (e.g., executable name, inputs, source host, etc.) so that system administrators have enough information to understand the result and can troubleshoot system problems without having to know the internals of Inca.

3. Eases the process of writing tests or benchmarks and deploying them into Inca installations.

4. Provides means for sharing tests and benchmarks between Inca users.

5. Easily adapts to new resources and monitoring requirements in order to facilitate maintenance of a running Inca deployment.

6. Stores and archives monitoring results (especially any error messages) in order to understand the behavior of a Grid over time. The results are available through a flexible querying interface.

7. Securely manages short-term proxies for testing of Grid services using MyProxy.

8. Measures the system impact of tests and benchmarks executing on the monitored resources in order to tune their execution frequency and reduce the impact on resources as needed.

Figure 1 shows the architecture of Inca, which incorporates three core components (highlighted box) – the agent, depot, and reporter manager. The agent and reporter managers coordinate the execution of tests and performance measurements on the Grid resources and the depot stores and archives the results. The inputs to Inca are one or more reporter repositories that contain userlevel tests and benchmarks, called reporters , and a configuration file describing how to execute them on the Grid resources. This configuration is normally created using an administration GUI tool called incat (Inca Administration Tool). The output or results collected from the resources

are queried by the data consumer and displayed to users. The following steps describe how an

Inca administrator would deploy user-level tests and/or performance measurements to their resources.

1. The Inca administrator either writes reporters to monitor the user-level functionality and performance of their Grid or uses existing reporters in a published repository.

2. The Inca administrator creates a deployment configuration file that describes the user-level monitoring for their Grid using incat and submits it to the agent.

3. The agent fetches reporters from the reporter repository, creates a reporter manager on each resource, and sends the reporters and instructions for executing them to each reporter manager.

4. Each reporter manager executes reporters according to its schedule and sends data to the depot.

5. Data consumers display collected data by querying the depot.

A reporter is an executable program that tests or measures some aspect of the system or installed software.

A report is the output of a reporter and is a XML document complying to the reporter schema.

A suite specifies a set of reporters to execute on selected resources, their configuration, and frequency of execution.

A reporter repository contains a collection of reporters and is available via an URL

A reporter manager is responsible for managing the schedule and execution of reporters on a single resource.

A agent is a server that implements the configuration specified by the Inca Administrator. incat is a GUI used by the Inca administrator to control and configure the Inca deployment on a set of resources.

A depot is a server that is responsible for storing the data produced by reporters.

A data consumer is typically a web page client that queries a depot for data and displays it in a userfriendly format

Fig1. Inca Architecture (taken from Ref[14])

Fig2. Inca provides detailed test results both useful for users and system administrators (taken from Ref[14])

Conclusions:

Inca has proved to be highly successful monitoring service and is at the core of teraGrid facility.

It detects Grid infrastructure problems by executing periodic, automated, user-level testing of

Grid software and services. It emulates a Grid user by running under a standard user account and executing tests thus ensuring consistent testing across resources with centralized test configuration. Inca manages and collects a large number of results through a GUI interface

(incat). It measures resource usage of tests and benchmarks to help Inca administrators balance data freshness with system impact. Data is collected by reporters, executables that measure particular aspects of the system and output the result as XML. Multiple types of data can be collected since Inca offers a number of prewritten test scripts, called reporters, for monitoring

Grid health. Reporter APIs make it easy to create new Inca tests. By storing and archiving complete monitoring results it allows system administrators to debug detected failures using archived execution details. Inca offers a variety of Grid data views from cumulative summaries to reporter execution details and result histories. Inca components communicate using SSL making it a secure monitor for DS service testing.

References:

[1] Chenxi Huang, Peter R. Hobson, Gareth A. Taylor, Paul Kyberd, "A Study of Publish/Subscribe

Systems for Real-Time Grid Monitoring," 2007 IEEE International Parallel and Distributed Processing

Symposium , 2007, pp 360

[2] http://news.cnet.com/New-life-for-Moores-Law/2009-1006_3-5672485.html

[3] Glotzer, S.C.; Panoff, B.; Lathrop, S.; Challenges and Opportunities in Preparing Students for

Petascale Computational Science and Engineering Computing in Science & Engineering . 2009, Vol 11, issue 5, pp 22-27

[4] https://www.teragrid.org/

[5] Yi Wei, M. Brian Blake, "Service-Oriented Computing and Cloud Computing: Challenges and

Opportunities," IEEE Internet Computing , vol. 14, no. 6, pp. 72-75, Nov./Dec. 2010, doi:10.1109/MIC.2010.147

[6] Microsoft Azure: http://www.microsoft.com/windowsazure/

[7] Amazon cloud service: http://aws.amazon.com/ec2/

[8] Neil Gershenfeld, Raffi Krikorian, Danny Cohen; The Internet of Things. Scientific

American 291:44, 76-81, 10/2004

[9] G. A. Taylor, M. R. Irving, P. R. Hobson, C. Huang, P. Kyberd and R. J. Taylor, “Distributed

Monitoring and Control of Future Power Systems via Grid Computing”,

IEEE PES General Meeting

2006, Montreal, Quebec, Canada, 18-22 June 2006.

[10] http://www.globus.org/grid_software/monitoring

[11] Andrzej Goscinski, Michael Brock; Toward dynamic and attribute based publication, discovery and selection for cloud computing; Future Generation Computer Systems 26 (2010) 947–970.

[12] http://sites.google.com/site/cloudcomputingsystem/research/monitoring

[13] http://inca.sdsc.edu/drupal/

[14] Shava Smallen, Kate Ericson, Jim Hayes, Catherine Olschanowsky; User-level Grid Monitoring with

Inca 2

[15] http://inca.sdsc.edu/releases/2.5/guide/

Download