OxGrid, a campus grid for the University of Oxford Abstract

advertisement
OxGrid, a campus grid for the University of Oxford
David C. H. Wallom, Anne E Trefethen
Oxford e-Research Centre, University of Oxford,
7 Keble Road, Oxford OX2 6NN
david.wallom@oerc.ox.ac.uk
Abstract
The volume of computationally and data intensive research in a leading university can only increase. This
though cannot be said of funding, so it is essential that every penny of useful work be extracted from existing
systems. The University of Oxford has invested in creating a campus wide grid. This will connect all large
scale computational resources as well as providing a uniform access method for ‘external’ resources such as
the National Grid Service (NGS) and the Oxford Supercomputing Centre (OSC).
The backbone of the campus grid is made using standard middleware and tools but the value add services have
been provided by in-house designed software including resource brokering, user management and accounting.
Since the system was first started in November 2005 we have attracted six significant users from five different
departments. These use a mix of bespoke and licensed software and have run ~6300 jobs by the end of July
2006. Currently all users have access to ~1000 CPUs including NGS and OSC resources. With approximately
2 new users a week approaching the e-Research centre the current limitation on rate of uptake is the amount of
time that is spent with each user to make their interactions as successful as possible.
1. Introduction
Within a leading university such as Oxford there
could be expected to be as many as 30 separate
clustered systems. These will have been purchased
through a variety of sources such as grant income,
donations, and central university funding. It is
becoming increasingly important that full use is
made of these. As well as these specialist systems,
this is also true for all ICT resources throughout an
organisation. These can include shared desktop
computers within teaching laboratories as well as
personal systems on staff desks.
There are also a significant number of resources
that are available either nationally or
internationally and it is therefore very important
that the interfaces as defined by these projects are
supported within the infrastructure. This has
therefore a significant steer on the middleware
chosen as the basis for the project. This it should
be noted is true not only for computation but data
as well.
The other large impediment that can occur for a
project of this type is the social interactions
between departments that may jealously guard
their own resources and the users from different
groups that could make best use of them. This is
especially true in a collegiate university where you
have possible resources that are located within
separate colleges as well as the academic
departments. This has therefore led to a large
outreach effort through talks and seminars given to
a university wide audience as well as contacting all
serial users of the OSC.
The design of each of these components will be
discussed showing their functional range as well as
future plans to make further use of GGF standards.
2. Requirements
Before embarking on a exercise such as the
construction of a campus wide infrastructure, it is
important that an initial set of minimum user
requirements are considered. The most important
requirement is that the users current methods of
working must be affected as little as possible, i.e.
they should be able switch from working on their
current systems to the campus grid with a seamless
transition. The inherent system design should be
such that its configuration can be dynamically
altered without user interruption. This will include
the interruption of service to particular clusters that
make up nodes on the grid but also the central
services. In this case the user may not be able to
submit more tasks but should be safe in the
knowledge that those tasks that they have already
submitted will run uninterrupted. The final core
requirement involves monitoring, since it is
essential that once a user has submitted a job or
stored a piece of data that it is monitored until its
lifetime has expired.
2.1 Data provision as well as computation
The provision of data services will become
increasingly important in the coming years. This
will be especially true for the move by the arts,
humanities and social sciences into e-Science, as
these subjects include studies that make extensive
use of data mining as a primary tool for research.
The data system must be able to take the following
factors into account:
• Steep increase in the volume of data as
studies progress including the new class
of research that is generated by
computational grid work.
• Metadata to describe the properties of the
data stored, the volume of which will be
directly proportional to its quality.
• As the class of data stored changes from
final post analysis research data towards
the raw data on which many different
studies can be done then the need for
replication and guaranteed quality of
storage will increase.
Therefore a system is needed which can allow a
range of physical storage medias to be added to a
common core to present a uniform user interface.
The actual storage must also be location
independent different physical locations as
possible.
3. The OxGrid System
The solution as finally designed is one where
individual users interact with all connected
resources through a central management system as
shown in Figure 1. This has been configured such
that as the usage of the system increases each
component can be upgraded so as to allow organic
growth of the available resources.
Figure 1 Schematic of the OxGrid system
3.1 Authorisation and Authentication
The projected users for the system can be split into
two distinct groups, those that want to use only
resources that exist within the university and those
that want access to external systems such as the
National Grid Service etc.
For those users requiring external system access,
we are required to use standard UK e-Science
Digital Certificates. This is a restriction bought
about by those system owners.
For those users that only access university
resources, a Kerberos Certificate Authority [3]
system connected to the central university
authentication system has been used. This has been
taken up by central computing services and will
therefore become a university wide service once
there is a significant enough user base.
3.2 Connected Systems
Each resource that is connected to the system has
to have a minimum software stack installed. The
middleware installed is the Virtual Data Toolkit
[4]. This includes the Globus 2.4 middleware [5]
with various patches that have been applied
through the developments of the European Data
Grid project [6].
3.3 Central Services
The central services of the campus grid may be
broken down as follows:
•
Information Server
•
Resource Broker
•
Virtual Organisation Management
•
Data Vault
These will all be described separately.
3.3.1
Information Server
The Information server forms the basis of the
Campus Grid with all information about resources
registered to it. Using the Globus MDS 2.x [7]
system to provide details of the type of resource,
including full system information. This also
contains details of the installed scheduler and its
associated queue system. Additional information is
added using the GLUE schema [7].
3.3.2
Resource Broker
The Resource Broker (RB) is the core component
of the system and the one with which the users of
the system have the most interaction. Current
designs for a resource broker are either very
heavyweight with added functionality which is
unlikely to be needed by the users or have a
significant number of interdependencies which
could make the system difficult to maintain.
The basic required functionality of a resource
broker is actually very simple is listed below:
• Submit tasks to all connected resources,
• Automatically
decide
the
most
appropriate resource to distribute a task
to,
• Depending on the requirements of the
task submit to only those resources which
fulfil them,
• Dependant on the users list of registered
systems distribute tasks appropriately.
Using the Condor-G [9] system as a basis an
additional layer was built to interface the Condor
Matchmaking system with MDS. This involved
interactively querying the Grid Index Information
Service to retrieve the information stored in the
system wide LDAP database as well as extracting
the user registration information and installed
software from the Virtual Organisation
Management system.
Additionally other information such as which
system a particular user is allowed to submit tasks
is available from the VOM system. Therefore this
must be queried whenever a task is submitted.
jobs. The list of installed software is also retrieved
at this point from the Virtual Organisation
Manager database from when the resources were
added.
Each of these generated job advertisements for
each resource are input into the Condor
Matchmaking [10] system once every 5 mins.
3.3.2.2 Job Submission Script
Figure 2 Resource Broker operation
3.3.2.1 The Generated Class advertisement
The information passed into the resource class
advertisement may be classified in three ways, that
which would be included in a standard Condor
Machine class-ad, additional items that are needed
for Condor-G grid operation and items which we
have added to give extra functionality. This third
set of information will be described here;
Requirements = (CurMatches
(TARGET.JobUniverse == 9)
<
20)
&&
It is important to ensure that a maximum number
of submitted jobs can be matched to the resource at
any one time and the resource will only accept
Globus universe jobs.
CurMatches
=0
This is the number of currently matched jobs as
determined by the advertisement generator using
MDS and Condor queue information every 5 mins.
OpSys = "LINUX“
Arch = "INTEL"
Memory = 501
Information of the type of resource as is
determined from the MDS information. It is
important to note though that a current limitation is
that this is for the head node only not workers so
heterogeneous clusters cannot at present be used
on the system.
MPI = False
INTEL_COMPILER=True
GCC3=True
Special capabilities of the resource need to be
defined. In this case the cluster resource does not
have MPI installed and so cannot accept parallel
As the standard user for the campus grid will have
normally only submitted their jobs to a simple
scheduler it is important that we can abstract users
from underlying grid system and resource broker.
The Condor-G system uses non-standard job
description mechanisms and so a simpler more
efficient method has been implemented. It was
decided that most users were experienced with a
command line executable that required arguments
rather than designing a script submission type
system. It is intended though to alter this with
version 2 so that users may also use the GGF JSDL
[11] standard to describe their tasks.
The functionality of the system must be as follows:
• User must be able to specify the name of
the executable and whether this should be
staged from the submit host or not,
• Any arguments that must this executable
be passed to run on the execution host,
• Any input files that are needed to run and
so must be copied onto the execution
host,
• Any output files that are generated and so
must be copied back to the submission
host,
• Any special requirements on the
execution host such as MPI, memory
limits etc.
• Optional, so as to override the resource
broker as necessary the user should also
be able to specify the gatekeeper URL of
the resource he specifically wants to run
on. This is useful for testing etc.
It is important also that when a job is submitted
that the user will not get his task allocated by the
resource broker onto system to which he doesn’t
have access. The job submission script accesses
the VOM system to get the list of allowed system
for that user. This works through passing the DN
into the VOM and retrieving a comma separated
list of systems and reformats this into the format as
accepted by the resource broker.
job-submission-script -n 1 -e /usr/local/bin/rung03
-a test.com -i test.com -o test.log -r GAUSSIAN03 g maxwalltime=10
This example runs a Gaussian job, which in the
current configuration of the grid will through the
resource broker only run on the Rutherford NGS
node. This was a specific case that was developed
to test capability matching.
3.3.3
Virtual Organisation Manager
This is another example of a new solution being
designed in-house due to over complicated
solutions being only currently available. The
functionality required is as follows:
• Add/Remove system to a list of available
systems,
• List available systems,
• Add/Remove users to a list of users to
access general systems,
• List users currently on the system.
• Add user to the SRB system,
When removing users though it is important that
their record is set as invalid rather than simply
removing entries so that system statistics are not
disrupted.
So that attached resources can retrieve the list of
stored distinguished names we have also created an
LDAP database that can then be retrieved from via
the standard EDG MakeGridmap scripts as
distributed in VDT.
3.3.3.1 Storing the Data
The underlying functionality has been provided
using a relational database. This has allowed the
alteration of tables as and when additional
functionality has become necessary. The decision
was made to use the PostgreSQL relational
Database [12] due to its know performance and
zero cost option. It was also important at this stage
to build in the extra functionality to support more
than one virtual organisation. The design of the
database is shown in Appendix 1.
•
Which registered systems the user can use
and their local username on each, this can
either be pool accounts or a real
username.
An example of the interface is shown in Figure 3.
Figure 3; The interface to add a new user to the
VOM
When a user is added into the VOM system this
gives them access to the main computational
resources through insertion of their DN into LDAP
and relational databases. It is also necessary that
each user is also added into the Storage Resource
Broker [13] system for storage of their large output
datasets.
3.3.3.2 Add, Remove and List Functions
Administration of the VOM system is through a
secure web interface. This has each available
management function as a separate web page
which must be navigated to.
The underlying mechanisms for the addition,
removal and list functions are the basically the
same for both systems and users.
The information required for an individual user
are:
• Name: The real name of the user.
• Distinguished Name (DN): This can
either be a complete DN string as per a
standard x509 digital certificate or if the
Oxford Kerberbos CA is used then just
their Oxford username, the DN for this
type of user is constructed automatically.
• Type: Within the VOM a user may either
be an administrator or user. This is used
to define the level of control that he has
over the VOM system, i.e. can alter the
contents. This way the addition of new
system administrators into the system is
automated.
Figure 4; The interface for system registration into
the VOM
To register a system with the VOM the following
information is needed;
• Name: Fully Qualified Domain Name.
• Type: Either a Cluster or Central system,
users will only be able to see Clusters.
• Administrator e-Mail: For support
quereies.
• Installed Scheduler: Such as PBS, LSF,
SGE or Condor.
• Maximum number of submitted jobs, to
give the resource broker the value for
‘CurMatches’.
•
Installed Software; List of the installed
licensed software to again pass onto the
resource broker.
• Names of allowed users and their local
usernames.
An example of the interface is shown in Figure 4.
There are three different methods of displaying the
information that is stored about tasks run on
OxGrid. The overall number of tasks per month is
shown in Figure 5.
3.4 Resource Usage Service
Within a distributed system where different
components are owned by different organisations it
is becoming increasingly important that tasks run
on the system are accounted for properly.
3.4.1
Information Presented
As a base set of information the following should
be recorded:
• Start time
• End time
• Resource name job run on
• Local job ID
• Grid user identification
• Local user name
• Executable run
• Arguments passed to the executable
• Wall time
• CPU time
• Memory used
As well as these basic variables an additional
attribute has been set to account for the differing
cost of a resource. This can be particularly used for
systems that have special software or hardware
installed. This can also be used to form the basis of
a charging model for the system as a whole.
• Local Resource cost
3.4.2
This can then be split into total numbers per
individual user and per individual connected
system as shown in Figure 6 and Figure 7.
Recording usage
There are two parts to the system, a client that can
be installed onto each of the attached resources and
a server that will record the information for
presentation to system administrators and users. It
was decided that it would be best to present the
accounting information to the server on a task by
task basis. This would result in instantaneously
correct statistics should it be necessary to apply
limits and quotas. The easiest way to achieve this
is through the creation of a set of Perl library
functions that attaché to the standard job-managers
that are distributed within VDT. These collate all
of the information to be presented to the server and
then call a ‘cgi’ script through a web interface on
the server.
The server database is part of the VOM system
described in section 3.3.3.1. An extra table has
been added with each attribute corresponding to a
column and each recorded task is a new row.
3.4.3
Figure 5; Total number of run tasks per month on
all connected resources owned by Oxford, i.e.
NOT including remote NGS core nodes or
partners.
Displaying Usage
Figure 6; Total number of submitted tasks per user
Figure 7; Total number of jobs as run on each
connected system
3.5 Data Storage Vault
It was decided that the best method to create an
interoperable data vault system would be to
leverage work already undertaken within the UK eScience community and in particular by the NGS.
The requirement is for a location independent
virtual filesystem. This can make use of spare
diskspace that is inherent in modern systems.
Initially though as a carrot to attract users we have
added a 1Tb RAID system onto which users data
can be stored. The storage system uses the Storage
Resource Broker (SRB) from SDSC. This has the
added advantage of not only fulfilling the location
independent requirement but can also add metadata
to annotate stored data for improved data mining
capability.
The SRB system is also able to interface not only
to plain text files that are stored within normal
attached filesystems but also relational databases.
This will allow large data users within the
university to make use of the system as well as
install the interfaces necessary to attach their own
databases with minimal additional work.
on several occasions to assist with cluster upgrades
before these resources can be added into the
campus grid. This has illustrated the significant
problems with the Beowulf solution to clustering,
especially if very cheap hardware has been
purchased. This has lead on several occasions to
the need to spend a significant amount of time
installing operating systems which is reality has
little to do with construction of a campus grid.
3.6.3
Sociological issues
The biggest issue when approaching resource
owners is always the initial reluctance to allow
anyone but themselves to use resources they own.
This is a general problem with academics all over
the world and as such can only be rectified with
careful consideration of their concerns and specific
answers to the questions they have. This has
resulted in a set of documentation that can be given
to owners of clusters and Condor systems so that
they can make informed choices on whether they
want to donate resource or not.
When constructing the OxGrid system the greatest
problem that was encountered was with the
different resources that have been connected to the
system. These fell into separate classes as
described.
The other issue that has had to be handled is the
communication with staff responsible for
departmental security. To counteract this we have
produced a standard set of requirements with
firewalls which are generally well received. By
ensuring that communication from departmental
equipment is a single system for the submission of
3.6.1
3.7 User Tools
3.6 Attached Resources
Teaching Lab Condor Pools
The largest single donation of systems into the
OxGrid has come from the Computing Services
teaching lab systems. These are used during the
day as Windows systems and have a dual boot
installation setup on them with a minimal Linux
2.4.X installation on them. The systems are
rebooted into Linux at 2100 each night and then
run as part of the pool until 0700 where they are
restarted back into Windows systems. Problems
have been encountered with the system imaging
and control software used by OUCS and its Linux
support. This has lead to significant reduction in
available capacity in this system until the problem
is rectified. Since this system also is configured
without a shared filesystem several changes have
been made to the standard Globus jobmanager
scripts to ensure that Condor file transfer is used
within the pool.
A second problem has been found recently with
the discovery of a large number of systems within
a department that use the Ubuntu Linux
distribution which is currently not supported by
Condor. This has resulted in having to distribute
by hand a set of base C libraries that solve issues
with running the Condor installation
3.6.2
Clustered Systems
Since there is significant experience within the
OeRC with clustered systems we have been asked
In this section various tools are described that have
been implemented to make user interaction with
the campus grid easier.
3.7.1
oxgrid_certificate_import
One of the key problems within the UK e-Science
community has always been that users have
complained about difficult interactions with their
digital certificates and certainly when dealing with
the format required by the GSI infrastructure. It
was therefore decided to produce an automated
script. The is used to translate the certificate as it is
retrieved from the UK e-Science CA,
automatically save it in the correct location for GSI
as well as set permissions and the passphrase used
to verify the creation of proxy identities. This has
been found to reduce the number of support calls
about certificates as these can cause problems
which the average new user would be unable to
diagnose. To ensure that the operation has
completed successfully it also checks the creation
of a proxy and prints its contents.
3.7.2
oxgrid_q and oxgrid_status
So that a user can check their individually
submitted tasks we developed a script that sits on
top of the standard condor commands for showing
the job queue and registered systems. Both of these
commands though as default show only the jobs
the user has submitted or the systems they are
allowed to run jobs on, though both of these
commands also have global arguments to allow all
jobs and registered systems to be viewed.
3.7.3
oxgrid_cleanup
When a user has submitted many jobs and
discovered that they have made an error in
configuration etc. it is important for a large
distributed system such as this that a tool exists for
the easy removal of not just the underlying
submitted tasks but also the wrapper submitter
script as well. Therefore this tool has been created
to remove the mother process as well as all its
submitted children.
4. Users and Statistics
Currently there are 25 registered users of the
system. They range from a set of students who are
using the submission capabilities to the National
Grid Service to those users that are only wanting to
use the local Oxford resources.
They have collectively run ~6300 tasks over the
last six months using all of the available systems
within the university (including the OSC), not
including though the rest of the NGS.
The user base is currently determined by those that
have been registered with the OSC or similar
organizations before. Through outreach efforts
though this is being moved into the more data
intensive Social Sciences as their e-Science
projects move along. Several whole projects have
also registered their developers to make use of the
large data vault capability including the Integrative
Biology Virtual Research Environment.
We have had several instances where we have
asked for user input and a sample of them are
presented here.
“My work is the simulation of the quantum
dynamics of correlated electrons in a laser field.
OxGrid made serious computational power
easily available and was crucial for making the
simulating algorithm work.” Dr Dmitrii
Shalashilin (Theoretical Chemistry)
“The work I have done on OxGrid is on
molecular evolution of a large antigen gene
family in African trypanosomes. OeRC/OxGrid
has been key to my research and has allowed
me to complete within a few weeks calculations
which would have taken months to run on my
desktop.” Dr Jay Taylor (Statistics)
5. Costs
There are two ways that the campus grid can be
valued. Either in comparison with a similar sized
cluster or in the increased throughput of the
supercomputer system.
Considering the 250 systems of the teaching
cluster can be the basis for costs. The electricity
costs for these systems would be £7000 if they
were left turned on for 24 hours. These systems
though do produce 1.25M CPU hours of
processing power so the produced value of the
resource is very high.
Since the introduction of the campus grid it is
considered that the utilisation of the
supercomputing centre has increased for MPI tasks
by ~10%.
6. Conclusion
Since November 2005 the campus grid has
connected systems from four different departments
including Physics, Chemistry, Biochemistry and
University Computing Services. The resources
located in the National Grid Service and OSC have
also been connected for registered users. User
interaction with these physically dislocated
resources is seamless when using the resource
broker and for those under the control of the OeRC
accounting information has been saved for each
task run. This has showed that ~6300 jobs have
been run on the Oxford based components of the
system (i.e. not including the wider NGS core
nodes or partners).
The added advantage is that significant numbers of
serial users from the Oxford Supercomputing
Centre have moved to the campus grid so that it
has also increased its performance.
7. Acknowledgments
I would like to thank Prof. Paul Jeffreys and Dr.
Anne Trefethen for their continued support in the
startup and construction of the campus grid. I
would also like to thanks Dr. Jon Wakelin for his
assistance in the implementation and design of
some aspects of the Version 1 of the underlying
software when the author and he were both staff
members at Centre for e-Research Bristol.
8. References
1.
National Grid Service, http://www.ngs.ac.uk
2.
Oxford Supercomputer Centre,
http://www.osc.ox.ac.uk
3.
Kerberos CA and kx509,
http://www.citi.umich.edu/projects/kerb_pki/
4.
Virtual Data Toolkit,
http://www.cs.wisc.edu/vdt
5.
Globus Toolkit, The Globus Alliance,
http://www.globus.org.
6.
European Data Grid, http://eudatagrid.web.cern.ch/eu-datagrid/
7.
Cza jkowski k., Fitzgerald s., Foster I., and
Kesselman C. Grid Information Services for
Distributed Resource Sharing Proceedings of
the Tenth IEEE International Symposium on
High-Performance Distributed Computing
(HPDC-10), IEEE Press, August 2001.
8.
GLUE schema The Grid Laboratory Uniform
Environement (GLUE)”,
http://www.hicb.org/glue/glue.htm. Computing
(HPDC-10), IEEE Press, August 2001.
11. Job Submission Description Language (JSDL)
Specification,
Version
1.0,
GGF,
https://forge.gridforum.org/pro
jects/jsdlwg/document/draft-ggf-jsdl-spec/en/21
9.
J. Frey, T. Tannenbaum, M. Livny, I. Foster,
S. Tuecke, Condor-G: A Computation
Management Agent for Multi-Institutional
Grids, Cluster Computing, Volume 5, Issue 3,
Jul 2002, Pages 237 – 246
12. PostgreSQL, http://www.postgresql.org
10. R. Raman, M. Livny, M. Solomon,
"Matchmaking:
Distributed
Resource
Management
for
High
Throughput
Computing," hpdc, p. 140, Seventh IEEE
International
Symposium
on
High
Performance Distributed Computing (HPDC-7
'98), 1998.
13. C. Baru, R. Moore, A. Rajasekar and M. Wan.
The SDSC storage resource broker. In
Proceedings of the 1998 Conference of the
Centre For Advanced Studies on Collaborative
Research
(Toronto,
Ontario,
Canada,
November 30 - December 03, 1998)
14. LHC Computing Grid Project
http://lcg.web.cern.ch/LCG/
Appendix 1
Table structure of the VOM database system.
Table Primary Key
Unique Value
VO_VIRTUALORGANISATION
VO_ID (ALSO UNIQUE)
NAME
DESCRIPTION
VO_USER
VO_ID
USER_ID (ALSO UNIQUE)
REAL_USER_NAME
DN
USER_TYPE
VALID
VO_RESOURCE
VO_ID
RESOURCE_ID (ALSO UNIQUE)
TYPE
JOBMANAGER
SUPPORT
VALID
MAXJOBS
SOFTWARE
VO_USER_RESOURCE_STATUS
VO_ID
USER_ID
RESOURCE_ID
UNIX_USER_NAME
VO_USER_RESOURCE_USAGE
VO_ID
JOB_NUMBER
USER_ID
JOB_ID
START_TIME
END_TIME
RESOURCE_ID
JOBMANAGER
EXECUTABLE
NUMNODES
CPU_TIME
WALL_TIME
MEMORY
VMEMORY
COST
Download