High Throughput Computing Week Introduction National e-Science Centre, Edinburgh, 27 – 30

advertisement
High Throughput Computing Week
National e-Science Centre, Edinburgh, 27th – 30th November 2007
David Wallom
Introduction
The High-Throughput Computing (HTC) event hosted at NeSC spanned four days and was intended to
interest those who may benefit from HTC in their research or businesses and those who provide HTC for
their users. It covered the stages of recognising how to transform a task so that it can benefit from HTC,
through choosing technologies that deliver HTC, to providing cost effective services that are convenient to
use.
With 54 registered participants from Institutions both already involved in e-Science and those not, to
institutions from Europe the breadth of participation was rightly judged a great success. We had speakers
including Miron Livny of Condor and John Powers of Digipede representing software designed within the
HTC space, as well as Jason Stowe of Cycle Computing and Akash Chopra of Barrie & Hibbart representing
commercial suppliers and users of HTC. From the academic community we had presentations from several
UK universities that have made institutional commitments to HTC installations as well as a special
presentation from Clemson university in the US that has through the installation of a Condor system made a
quantum leap in the availability of research computing resources to researchers within the university.
As each of the four days focused on a different aspect of High-Throughput Computing delegates were able to
register for individual days that gave a different spread of participants on each and every day.
DAY 1: Example Solutions: presentations from users in academia and enterprise show how HTC has
transformed their work.
DAY 2: Technology comparison and training: two technology providers demonstrate how their systems
would tackle the same problem
DAY 3: Requirements gathering: researchers, applications developers and providers discuss their HTC
requirements, including security, usability, energy efficiency and reliability.
DAY 4: The Future of HTC: users, service providers and technology providers discuss long term roadmaps.
Day 1, Example Solutions provided by HTC It was decided during planning for this meeting that the first day would allow a cross over from the BELIEF
workshop that had been held previously to the HTC week. This would allow BELIEF participants to be
involved. This day was primarily to show how HTC had been used in projects and how easily it is to port a
new application. This would be through academic and commercial participation.
The opening presentation was given by Miron Livny, describing HTC itself and how it can harness resources
that are hitherto wasted [1].
This started off considering what distributed computing is and how its true benefits can only be realised
through the democratisation of access. To this end the amount of compute power available needs to be
considered in some ways using a different measure, what Miron coined as the FLOPPY, is 24hours per day,
7 days a week, 52 weeks a year multiplied by the power of the system in FLOPs.
It is important though that we consider the production aspects of the systems that we build if this is to
become important. Condor has been developed for 20 years and an example of this is that one commercial
client has run billions of tasks on their system. Overall though the future of HTC is about people as much as
the system, user, developers, administrators, accountants, operators, policy makers and educators. In all
everyone that is a stakeholder in wanting successful output.
This was then followed by a presentation by one of the BELIEF participants Prof. Prof Antonio Mungioli on
“Grid Computing: A Teaching Discovery”[2]. The emphasis was on collaboration and co-operation. Other
topics that were raised included the impact that IT makes on the environment.
Further presentations were then given by:
• Mark Calleja: The e-Minerals Project [3]
The e-Minerals project has been a long term user of HTC through a collaboration between three UK
universities. This enables analysis of rock salts structure quickly and easily. Thy have enabled a
parallel data grid system using SRB which has led to changes in the job submission language and
custom tool to do the actual submissions. This has allowed bulk data upload and job submission to
become a single command operation with the output data all also being described in XML. This has
led to the creation of a number of other tools fulfilling simple functionality across the distributed
resources. Overall the eMinerals toolset empowers the user integrating the compute, data and
collaborative aspects of grid computing.
•
Donna Lammie: Using the Condor system at Cardiff University [4]
As part of the Biophysics group Donna performs detailed analysis on the outputs from
experimentation on protein structures using COTS software packages DAMMIN and GASBOR.
These need to be run many times to enable determination of a realistic protein shape. This has
enabled runs that previously took more than 12 hours to be done within less than an hour.
Support from the WeSC enabled development of a custom submission front end greatly simplifying
the process for the user and allowing quick and easy access to the large number of nodes on the
Cardiff Condor system. This is an illustration of how with appropriate local support even users with
little or no knowledge of HTC can get up and running easily and quickly.
•
Barr Von Olsen: The HTC system at Clemson University [5]
Clemson University from the US have built up their computational resource from nothing since
2007. Since that time they have served over 1.9M CPU hours with relatively little explicit costs in
staff or systems. Their system has 1085 Windows machines, condor reporting ~1700 slots, with 845
maintained by CCIT & 241 from other campus departments. They are open to everyone in the
Clemson University Community. Applications installed include R, Maple, MatLab and Maya. Since
starting off with no users the sheer number of users within the fields they are working on such as
Industrial Engineering, Economics, Bioengineering, Biophysics and computational psychology has
led to a nice problem of what to do with all the data. They are very interested in building CoLinux
workers to allow more applications to use the system. They have also used this as a opportunity to
join Open Science Grid.
•
Jason Stowe: Commercial use of Condor (Cycle Computing) [6]
Cycle Computing provide commercial access to Condor resources in an enterprise environment. This
as can be imagined leadds to considerations of what the user interface to the systems are due to high
standards being demanded. One example given was the Disney film, "The Wild", which was
rendered entirely on Condor. Other features that are necessary include easily accessible interfaces to
low-level things such as Oracle and higher level business functions such as usage negotiation. He
gave a detailed description of why they chose Condor, including most of the standard reasons
including lack of vendor local in etc. Their major business model though is for users that have large
loosely parallel batch submitted Monte-Carlo simulations.
•
Akash Chopra: Use of Digipede at Barrie & Hibbart [7]
A description was given of the use made of Digipede within B&H. Their main use is with an
application for Economic Scenario Generation, this is a Monte Carlo simulator that calculates the
distribution of variables into the future enabling the modeling of risk for life insurers.
Since life insurer clients need to produce reports at end of each quarter, usage of the ESG is bursty in
nature that means that having dedicated hardware for this is simply not economical. HTC also lends
itself to the nature of these calculations by reducing sampling errors through the ability to run many
"What if?" scenarios, redo mistakes etc.
The exact model definitions include Stochastic "inner models" for each "outer model", e.g. liability
values at times other than t=0. Also known as "stochastic on stochastic". These are particularly
common in USA market to satisfy their regulators.
Digipede was chosen since it is a .NET application, which avoids issues with use of avoid
CoM/DCOM; has good tutorials, examples & support; and there are several Grid design patterns
provided as standard. The operational model is that data must be sent to each agent on the grid once
per job. There is potentially large amount (gigabytes) of output data and it became necessary to
bypass the Digipede return mechanism for performance reasons.
Day 2, HTC Technology comparison and training Both researchers and providers need evidence on which to base their technical choices and ideas about how
easy a technology is to use for a certain class of problems. The day was split in two with presentations by
John Powers[8] and Dan Cerulli from Digipede after which there was a hands on training session. This was
then followed by a detailed technical question and answer session showing how with the Digipede system
you were able to do real application level integration of an HTC system within the Windows .NET
framework. After the lunch break the Condor team, Todd Tannenbaum and Alain Roy gave a presentation on
the Condor system[9]. After this there was again a hands on tutorial and question and answer session. The
day was concluded by a short general question and answer session.
Day 3, HTC requirements gathering The three communities of: researchers using HTC, applications developers who exploit HTC and providers
who run HTC services will gather their requirements and identify those that most urgently need addressing.
This discussion will take place with HTC providers present - the report it produces is certainly intended to
shape their road maps. The day will conclude with an industrial evaluation of HTC experiences organised by
Grid Computing Now! This will conclude with a panel on the cost-benefit analysis of using HTC in business
critical systems.
•
Academic HTC Users
•
Cardiff university: Condor as a Service, James Osborne [10]
A description of the Cardiff Condor environment was given after which other considerations were
discussed. This included the green IT aspects. They have calculated that hibernate saves
£60/computer/year, where using Condor on the system costs £30 / computer which works out
significantly cheaper than a dedicated resource at £150 /computer/year. Their Condor pool cost
£50K/year fir equipment, power & staff but but provides a resource equivalent to a £500K
computing system. They are also looking at the use of virtualisation and have as key issues a wish to
store a set of design patterns that have been successful where they can be easily accessed.
•
Cambridge University Campus Grid, Mark Calleja [11]
The Cambridge campus grid described is actually a set of inter-flocking Condor pool. These systems
each belong to different departments that have agreed to work together. They each maintain their
own user priority controls. Developments on going within Cambridge e-Science Centre have
included the inclusion of operating system level checkpointing using the Berkeley Lab
Checkpoint/Restart (BLCR) kernel modules for Linux. These are able to restore resources that userlevel checkpoint/restore can't. There are some issues still within the whole system principally with
the lack of a common shared file system. To get around this the Cambridge team are investigating
the use of the Parrot user level file system which is one of the Condor suite of projects. There is an
issue though as it can clash with BLCR.
Issues that they raised included the need for HTC to better protect resources on execute nodes e.g.
limit local memory and hard-disk usage. They are very interested in virtualization as a way of
guaranteeing this.
•
Reading University Campus Grid, Chris Chapman [12]
Reading are running their Condor on system maintained by central IT department. After initial trials
using dual boot a solution using virtualization, using coLinux was more suitable due to limitations
on the operating model due to central machines not being able to be rebooted overnight for
maintenance reasons. Instead, run virtual instance of Linux under Windows. This solution was come
to after extensive trials comparing VMWare, Virtual PC and co-Linux. Co-Linux suited their needs
best (free, good performance) but requires modified Linux kernel.
The systems operate on a private network enabling easy separation between the host systems and the
operating Linux condor worker. Projects that are using the condor system include;
• TRACK - Feature detection, Tracking and Statistical analysis for Meteorological and
Oceanographic data.
o Processing around 2000 jobs a day
• Computing Mutual Information - Face Verification Using Gabor Wavelets
•
•
•
o Condor cut processing from 105 days to 20 hours
Bayes Phylogenies - inferring phylogenetic trees using Bayesian Markov Chain Monte Carlo
(MCMC) or Metropolis-coupled Markov chain Monte Carlo (MCMCMC) methods.
Implementation of MPI on condor
Analysing Chess moves
•
Discussion on issues of importance to the academic community
The overall topics that were labeled as important included;
• Working patterns
• Virtualisation
• Political cookbook
• Resource protection
• Data access
•
Commercial HTC Users
• Cycle Computing, Jason Stowe [13]
Cycle computing provides Condor resources as a professional service to any sector willing to pay for
the utilization. They have therefore had to develop several additional tools to provide up to date
business information to users as well as a more user friendly interface. This includes Condor
accounting groups which are used by the company to provide policy and access control. The overall
package provided includes compute, automated data management as well as consulting and training.
Working on Condor itself they have found that small alterations to the configuration can improve
negotiation performance by a factor of 20 and scheduling performance by factor of 10. Important
issues they raise include analysis, reporting and audit of condor configuration changes and
performance. Reliability is also key.
•
•
The Digipede Network, Dan Ciruli (Digipede) [14]
Dan introduced the Digipede network and illustrated how the use of SOA is desirable within the
enterprise community. Grids have enabled the construction of collections of services scalable. This
balances CPU use for computation and local use on commodity hardware. In a three-tier model, this
scales the application layer. Network and database scaling will be handled by other appropriate
methods. To enable utilization of the system by users with little or no experience of HTC they have
tools for both job design and submission. These look very much like any other Windows
applications with which users will be familiar.
Currently there are a number of different large users of the system, examples of which are the
Novartis Foundation and Pacific events (previously produced PDFs for reports from a database.
PDF generation didn't scale, so used Digipede). Other users include two projects using Grids for
preprocessing - along the lines of Google Earth. Key issues raise include usability!
•
Requirements from Barrie & Hibbart, Akash Chopra:
The most important point of note is that as a service the provider is completely customer-driven.
Therefore needs flexibility from vendors both in terms of rapid expansion but also usage in general.
As the provider he also needs to be able to control scheduling configuration from within
applications. This scenario is that B&H is selling a wrapped version of Digipede and may want to
provide some scheduling options to his customers without them becoming Digipede experts. The big
issue though that was raised again was Data management and its need for flexibility.
Enterprise requirements for HTC
It was noted that flexibility and simplicity are both important and often in conflict. This affects the usability
from the point of view of the two different types of user.The end users, who see submit pages and current
status of their jobs, and the IT department who want to control the configuration.
•
•
•
•
Simplicity for end user
Ease of integration (end use app into HTC app, IT manager deployment)
Scalability (to maintain QoS)
Reliability
•
•
•
•
Defined competitive advantage
Stable, easy to use, interfaces
Flexibility
Support (possibly 24/7)
Digipede are focussing hard on the ease of integration, by focusing on .NET, integrating with Dev. Studio,
etc. This has lead to an explicit decision not to address multi-platform, VOs, etc. There is a similar tool
being developed in Java from a company called GridGain.
The differences between academic & enterprise; Enterprise have more defined idea of what they term the
Grid, whilst academia are more interested in crossing administrative boundaries and virtual organisations.
The issue of standards was also raised and it should be noted that the vendors were not too interested until
the users demand them. They don't want to operate to the lowest common denominator. Also some feeling
that the area is too immature.
Day 4, The Future of HTC Users, service providers and technology providers will discuss their roadmaps. Service providers will look
at common policy issues. Technology providers will be encouraged to show how their roadmaps address the
prioritised researcher and service-provider prioritised requirements. Service providers will be encouraged to
share their road maps for rolling out new services. Researchers will be expected to compare these plans with
their developing exploitations of HTC. Topics to be developed during the week and integrated during these
discussions include:
1. Data handling - good practice: continuity between desktop & HTC, metadata
and provenance, copying data, global file systems.
2. Monitoring, security and accounting.
3. Policy issues and harmonisation.
4. Campus Grids document.
5. Environmental issues - "green computing"
6. Interoperability and portability - GridSAM and SAGA
The day was opened by presentations on new and exciting ways that the Condor HTC product has been
adopted by various organisations, raising the profile of HTC.
•
Condor Usage on Blue Gene, Todd Tannebaum[15]
Recently there has been a move to port the Condor application onto the on Blue Gene system
working with IBM. This way they are able to share HTC and HPC on same resource. This has
significant benefits with the system being able to be kept full using HTC jobs where necessary using
the backfill functionality. Also, Blue Gene is low power consumption that makes for greener
computing. There have been several significant challenges to getting it working on BG.
1) Linux on compute node is a very lightweight version; designed to run a single application.
IBM made some changes so that each launcher thread loops (effectively rebooting after
each launch).
2) In HPC mode, a single failure of a node in a partition stopped the whole partition. IBM
changed this in HPC mode partitions.
Currently there is a prototype Condor running on Blue Gene in HTC mode. Condor master runs on
the Blue Gene IO node instead of the compute node. The mid-term goal is to have Condor support
HPC workloads as well. With longer-term plans too, focussing on I/O. HTC mode changes the IO
patterns compatible to HPC mode - Blue Gene currently optimises MPI-style IO. Application case
studies currently being looked at include financial risk calculations and molecular drug docking.
•
The Open Science Grid, Alain Roy [16]
The OSG started off as the US equivalent of GridPP. It currently has approximately 70 sites, runs
130K jobs/day, of which 86% jobs succeeded. The system now supports a number of different
virtual organisations, from the standard HEP VOs as well as others such as
nanoHUB, LIGO, molecular dynamics, maths.
An example applications working on the DZero experiment at Fermi lab was given. This involved
transferring 90TB of input data each time (easier than pre-staging, apparently). Overall the system
uses VDT (Condor, Globus, etc.) and performs no software development beyond integration.
Users can make use of the VDT compilation facilities to build their own applications. This facility is
a suite of Condor systems to which build applications can be submitted. A very similar thing is
operated by the OMII.
•
Red Hat and Condor, Alan Roy [17]
Red Hat has joined with the Condor team to introduce the Condor system as a supported and fully
installed part of their Enterprise ready operating systems. This is a full partnership with two way
development and code exchanges.
Hiring 12 engineers in Madison to work alongside Condor. Their key development objectives are
tighter kernel integration, creating and packaging Condor components for Linux distributions,
transfer of skills in virtualisation and messaging. The Condor team already have a list of requests for
kernel changes which will be going into newer releases directly. Condor will remain cross-platform
(though working on Linux may have advantages). RH will provide technical support & engage with
community groups.
Conclusion
Overall there were a number of themes that came out during the meeting that spanned both the academic and
enterprise communities. It was noted that this is an artificial separation in reality since both want reliable,
easy to use and efficient HTC! Key points of note are the need for business models that work within the
organisations here to be disseminated so that others may learn and benefit from their experiences.
The other key issue is data handling. With the proliferation of systems over the next few years in terms of
available cores etc the volume of data that these HTC systems are able to produce can only multiply
significantly. These systems must start to seriously consider how they internally handle and externally
present data so that users are not left with the headache of handling it.
Presentations
1.
2.
3.
4.
5.
http://www.nesc.ac.uk/talks/831/1.1-What%20is%20HTC-Livny.ppt
http://www.nesc.ac.uk/talks/831/1.2-Mungioli%20_ASRM-Grid%20Computing%20re%20formated.ppt
http://www.nesc.ac.uk/talks/831/1.3-eMinerals-Mark_Calleja.ppt
http://www.nesc.ac.uk/talks/831/1.4-Cardiff%20-%20Donna%20Lammie.ppt
http://www.nesc.ac.uk/talks/831/1.5-Clemson-Barr_von_Oehsen.zip (Caution this is a 28MB zip file
with demo videos etc.)
6. http://www.nesc.ac.uk/talks/831/1.6-Enterprise_UseCases-stowe.ppt
7. http://www.nesc.ac.uk/talks/831/1.7-Barrie-Hibbet_Akash.ppt
8. http://www.nesc.ac.uk/talks/831/2.1-Age%20of%20Computational%20Abundance-Powers.ppt
9. http://www.nesc.ac.uk/talks/831/2.4-tannenbaum_condor_htc_week_2007.ppt
10. http://www.nesc.ac.uk/talks/831/3.1-071129%20-%20HTC_Osborne.ppt
11. http://www.nesc.ac.uk/talks/831/3.2-CamGrid.ppt
12. http://www.nesc.ac.uk/talks/831/3.3-Reading-CampusGrid_Chapman.ppt
13. http://www.nesc.ac.uk/talks/831/3.5-EnterpriseRequirements-stowe.ppt
14. http://www.nesc.ac.uk/talks/831/3.6-HTC%20Week%20Digipede%20session-Curilli.ppt
15. http://www.nesc.ac.uk/talks/831/4.2-BlueGene_htc_week-Tannenbaum.ppt
16. http://www.nesc.ac.uk/talks/831/4.1-OSGForHTCWeek2007-Roy.ppt
17. http://www.nesc.ac.uk/talks/831/4.3-redhat-htc-week-Tannenbaum.ppt
Acknowledgements
The organisers would like to thank the e-SciNet project, GridComputingNow! And the National e-Science
Centre for their generous support.
Download