High Throughput Computing Week Introduction National e-Science Centre, Edinburgh, 27 – 30

High Throughput Computing Week National e-Science Centre, Edinburgh, 27th – 30th November 2007 David Wallom Introduction The High-Throughput Computing (HTC) event hosted at NeSC spanned four days and was intended to interest those who may benefit from HTC in their research or businesses and those who provide HTC for their users. It covered the stages of recognising how to transform a task so that it can benefit from HTC, through choosing technologies that deliver HTC, to providing cost effective services that are convenient to use. With 54 registered participants from Institutions both already involved in e-Science and those not, to institutions from Europe the breadth of participation was rightly judged a great success. We had speakers including Miron Livny of Condor and John Powers of Digipede representing software designed within the HTC space, as well as Jason Stowe of Cycle Computing and Akash Chopra of Barrie & Hibbart representing commercial suppliers and users of HTC. From the academic community we had presentations from several UK universities that have made institutional commitments to HTC installations as well as a special presentation from Clemson university in the US that has through the installation of a Condor system made a quantum leap in the availability of research computing resources to researchers within the university. As each of the four days focused on a different aspect of High-Throughput Computing delegates were able to register for individual days that gave a different spread of participants on each and every day. DAY 1: Example Solutions: presentations from users in academia and enterprise show how HTC has transformed their work. DAY 2: Technology comparison and training: two technology providers demonstrate how their systems would tackle the same problem DAY 3: Requirements gathering: researchers, applications developers and providers discuss their HTC requirements, including security, usability, energy efficiency and reliability. DAY 4: The Future of HTC: users, service providers and technology providers discuss long term roadmaps. Day 1, Example Solutions provided by HTC It was decided during planning for this meeting that the first day would allow a cross over from the BELIEF workshop that had been held previously to the HTC week. This would allow BELIEF participants to be involved. This day was primarily to show how HTC had been used in projects and how easily it is to port a new application. This would be through academic and commercial participation. The opening presentation was given by Miron Livny, describing HTC itself and how it can harness resources that are hitherto wasted [1]. This started off considering what distributed computing is and how its true benefits can only be realised through the democratisation of access. To this end the amount of compute power available needs to be considered in some ways using a different measure, what Miron coined as the FLOPPY, is 24hours per day, 7 days a week, 52 weeks a year multiplied by the power of the system in FLOPs. It is important though that we consider the production aspects of the systems that we build if this is to become important. Condor has been developed for 20 years and an example of this is that one commercial client has run billions of tasks on their system. Overall though the future of HTC is about people as much as the system, user, developers, administrators, accountants, operators, policy makers and educators. In all everyone that is a stakeholder in wanting successful output. This was then followed by a presentation by one of the BELIEF participants Prof. Prof Antonio Mungioli on “Grid Computing: A Teaching Discovery”[2]. The emphasis was on collaboration and co-operation. Other topics that were raised included the impact that IT makes on the environment. Further presentations were then given by: • Mark Calleja: The e-Minerals Project [3] The e-Minerals project has been a long term user of HTC through a collaboration between three UK universities. This enables analysis of rock salts structure quickly and easily. Thy have enabled a parallel data grid system using SRB which has led to changes in the job submission language and custom tool to do the actual submissions. This has allowed bulk data upload and job submission to become a single command operation with the output data all also being described in XML. This has led to the creation of a number of other tools fulfilling simple functionality across the distributed resources. Overall the eMinerals toolset empowers the user integrating the compute, data and collaborative aspects of grid computing. • Donna Lammie: Using the Condor system at Cardiff University [4] As part of the Biophysics group Donna performs detailed analysis on the outputs from experimentation on protein structures using COTS software packages DAMMIN and GASBOR. These need to be run many times to enable determination of a realistic protein shape. This has enabled runs that previously took more than 12 hours to be done within less than an hour. Support from the WeSC enabled development of a custom submission front end greatly simplifying the process for the user and allowing quick and easy access to the large number of nodes on the Cardiff Condor system. This is an illustration of how with appropriate local support even users with little or no knowledge of HTC can get up and running easily and quickly. • Barr Von Olsen: The HTC system at Clemson University [5] Clemson University from the US have built up their computational resource from nothing since 2007. Since that time they have served over 1.9M CPU hours with relatively little explicit costs in staff or systems. Their system has 1085 Windows machines, condor reporting ~1700 slots, with 845 maintained by CCIT & 241 from other campus departments. They are open to everyone in the Clemson University Community. Applications installed include R, Maple, MatLab and Maya. Since starting off with no users the sheer number of users within the fields they are working on such as Industrial Engineering, Economics, Bioengineering, Biophysics and computational psychology has led to a nice problem of what to do with all the data. They are very interested in building CoLinux workers to allow more applications to use the system. They have also used this as a opportunity to join Open Science Grid. • Jason Stowe: Commercial use of Condor (Cycle Computing) [6] Cycle Computing provide commercial access to Condor resources in an enterprise environment. This as can be imagined leadds to considerations of what the user interface to the systems are due to high standards being demanded. One example given was the Disney film, "The Wild", which was rendered entirely on Condor. Other features that are necessary include easily accessible interfaces to low-level things such as Oracle and higher level business functions such as usage negotiation. He gave a detailed description of why they chose Condor, including most of the standard reasons including lack of vendor local in etc. Their major business model though is for users that have large loosely parallel batch submitted Monte-Carlo simulations. • Akash Chopra: Use of Digipede at Barrie & Hibbart [7] A description was given of the use made of Digipede within B&H. Their main use is with an application for Economic Scenario Generation, this is a Monte Carlo simulator that calculates the distribution of variables into the future enabling the modeling of risk for life insurers. Since life insurer clients need to produce reports at end of each quarter, usage of the ESG is bursty in nature that means that having dedicated hardware for this is simply not economical. HTC also lends itself to the nature of these calculations by reducing sampling errors through the ability to run many "What if?" scenarios, redo mistakes etc. The exact model definitions include Stochastic "inner models" for each "outer model", e.g. liability values at times other than t=0. Also known as "stochastic on stochastic". These are particularly common in USA market to satisfy their regulators. Digipede was chosen since it is a .NET application, which avoids issues with use of avoid CoM/DCOM; has good tutorials, examples & support; and there are several Grid design patterns provided as standard. The operational model is that data must be sent to each agent on the grid once per job. There is potentially large amount (gigabytes) of output data and it became necessary to bypass the Digipede return mechanism for performance reasons. Day 2, HTC Technology comparison and training Both researchers and providers need evidence on which to base their technical choices and ideas about how easy a technology is to use for a certain class of problems. The day was split in two with presentations by John Powers[8] and Dan Cerulli from Digipede after which there was a hands on training session. This was then followed by a detailed technical question and answer session showing how with the Digipede system you were able to do real application level integration of an HTC system within the Windows .NET framework. After the lunch break the Condor team, Todd Tannenbaum and Alain Roy gave a presentation on the Condor system[9]. After this there was again a hands on tutorial and question and answer session. The day was concluded by a short general question and answer session. Day 3, HTC requirements gathering The three communities of: researchers using HTC, applications developers who exploit HTC and providers who run HTC services will gather their requirements and identify those that most urgently need addressing. This discussion will take place with HTC providers present - the report it produces is certainly intended to shape their road maps. The day will conclude with an industrial evaluation of HTC experiences organised by Grid Computing Now! This will conclude with a panel on the cost-benefit analysis of using HTC in business critical systems. • Academic HTC Users • Cardiff university: Condor as a Service, James Osborne [10] A description of the Cardiff Condor environment was given after which other considerations were discussed. This included the green IT aspects. They have calculated that hibernate saves £60/computer/year, where using Condor on the system costs £30 / computer which works out significantly cheaper than a dedicated resource at £150 /computer/year. Their Condor pool cost £50K/year fir equipment, power & staff but but provides a resource equivalent to a £500K computing system. They are also looking at the use of virtualisation and have as key issues a wish to store a set of design patterns that have been successful where they can be easily accessed. • Cambridge University Campus Grid, Mark Calleja [11] The Cambridge campus grid described is actually a set of inter-flocking Condor pool. These systems each belong to different departments that have agreed to work together. They each maintain their own user priority controls. Developments on going within Cambridge e-Science Centre have included the inclusion of operating system level checkpointing using the Berkeley Lab Checkpoint/Restart (BLCR) kernel modules for Linux. These are able to restore resources that userlevel checkpoint/restore can't. There are some issues still within the whole system principally with the lack of a common shared file system. To get around this the Cambridge team are investigating the use of the Parrot user level file system which is one of the Condor suite of projects. There is an issue though as it can clash with BLCR. Issues that they raised included the need for HTC to better protect resources on execute nodes e.g. limit local memory and hard-disk usage. They are very interested in virtualization as a way of guaranteeing this. • Reading University Campus Grid, Chris Chapman [12] Reading are running their Condor on system maintained by central IT department. After initial trials using dual boot a solution using virtualization, using coLinux was more suitable due to limitations on the operating model due to central machines not being able to be rebooted overnight for maintenance reasons. Instead, run virtual instance of Linux under Windows. This solution was come to after extensive trials comparing VMWare, Virtual PC and co-Linux. Co-Linux suited their needs best (free, good performance) but requires modified Linux kernel. The systems operate on a private network enabling easy separation between the host systems and the operating Linux condor worker. Projects that are using the condor system include; • TRACK - Feature detection, Tracking and Statistical analysis for Meteorological and Oceanographic data. o Processing around 2000 jobs a day • Computing Mutual Information - Face Verification Using Gabor Wavelets • • • o Condor cut processing from 105 days to 20 hours Bayes Phylogenies - inferring phylogenetic trees using Bayesian Markov Chain Monte Carlo (MCMC) or Metropolis-coupled Markov chain Monte Carlo (MCMCMC) methods. Implementation of MPI on condor Analysing Chess moves • Discussion on issues of importance to the academic community The overall topics that were labeled as important included; • Working patterns • Virtualisation • Political cookbook • Resource protection • Data access • Commercial HTC Users • Cycle Computing, Jason Stowe [13] Cycle computing provides Condor resources as a professional service to any sector willing to pay for the utilization. They have therefore had to develop several additional tools to provide up to date business information to users as well as a more user friendly interface. This includes Condor accounting groups which are used by the company to provide policy and access control. The overall package provided includes compute, automated data management as well as consulting and training. Working on Condor itself they have found that small alterations to the configuration can improve negotiation performance by a factor of 20 and scheduling performance by factor of 10. Important issues they raise include analysis, reporting and audit of condor configuration changes and performance. Reliability is also key. • • The Digipede Network, Dan Ciruli (Digipede) [14] Dan introduced the Digipede network and illustrated how the use of SOA is desirable within the enterprise community. Grids have enabled the construction of collections of services scalable. This balances CPU use for computation and local use on commodity hardware. In a three-tier model, this scales the application layer. Network and database scaling will be handled by other appropriate methods. To enable utilization of the system by users with little or no experience of HTC they have tools for both job design and submission. These look very much like any other Windows applications with which users will be familiar. Currently there are a number of different large users of the system, examples of which are the Novartis Foundation and Pacific events (previously produced PDFs for reports from a database. PDF generation didn't scale, so used Digipede). Other users include two projects using Grids for preprocessing - along the lines of Google Earth. Key issues raise include usability! • Requirements from Barrie & Hibbart, Akash Chopra: The most important point of note is that as a service the provider is completely customer-driven. Therefore needs flexibility from vendors both in terms of rapid expansion but also usage in general. As the provider he also needs to be able to control scheduling configuration from within applications. This scenario is that B&H is selling a wrapped version of Digipede and may want to provide some scheduling options to his customers without them becoming Digipede experts. The big issue though that was raised again was Data management and its need for flexibility. Enterprise requirements for HTC It was noted that flexibility and simplicity are both important and often in conflict. This affects the usability from the point of view of the two different types of user.The end users, who see submit pages and current status of their jobs, and the IT department who want to control the configuration. • • • • Simplicity for end user Ease of integration (end use app into HTC app, IT manager deployment) Scalability (to maintain QoS) Reliability • • • • Defined competitive advantage Stable, easy to use, interfaces Flexibility Support (possibly 24/7) Digipede are focussing hard on the ease of integration, by focusing on .NET, integrating with Dev. Studio, etc. This has lead to an explicit decision not to address multi-platform, VOs, etc. There is a similar tool being developed in Java from a company called GridGain. The differences between academic & enterprise; Enterprise have more defined idea of what they term the Grid, whilst academia are more interested in crossing administrative boundaries and virtual organisations. The issue of standards was also raised and it should be noted that the vendors were not too interested until the users demand them. They don't want to operate to the lowest common denominator. Also some feeling that the area is too immature. Day 4, The Future of HTC Users, service providers and technology providers will discuss their roadmaps. Service providers will look at common policy issues. Technology providers will be encouraged to show how their roadmaps address the prioritised researcher and service-provider prioritised requirements. Service providers will be encouraged to share their road maps for rolling out new services. Researchers will be expected to compare these plans with their developing exploitations of HTC. Topics to be developed during the week and integrated during these discussions include: 1. Data handling - good practice: continuity between desktop & HTC, metadata and provenance, copying data, global file systems. 2. Monitoring, security and accounting. 3. Policy issues and harmonisation. 4. Campus Grids document. 5. Environmental issues - "green computing" 6. Interoperability and portability - GridSAM and SAGA The day was opened by presentations on new and exciting ways that the Condor HTC product has been adopted by various organisations, raising the profile of HTC. • Condor Usage on Blue Gene, Todd Tannebaum[15] Recently there has been a move to port the Condor application onto the on Blue Gene system working with IBM. This way they are able to share HTC and HPC on same resource. This has significant benefits with the system being able to be kept full using HTC jobs where necessary using the backfill functionality. Also, Blue Gene is low power consumption that makes for greener computing. There have been several significant challenges to getting it working on BG. 1) Linux on compute node is a very lightweight version; designed to run a single application. IBM made some changes so that each launcher thread loops (effectively rebooting after each launch). 2) In HPC mode, a single failure of a node in a partition stopped the whole partition. IBM changed this in HPC mode partitions. Currently there is a prototype Condor running on Blue Gene in HTC mode. Condor master runs on the Blue Gene IO node instead of the compute node. The mid-term goal is to have Condor support HPC workloads as well. With longer-term plans too, focussing on I/O. HTC mode changes the IO patterns compatible to HPC mode - Blue Gene currently optimises MPI-style IO. Application case studies currently being looked at include financial risk calculations and molecular drug docking. • The Open Science Grid, Alain Roy [16] The OSG started off as the US equivalent of GridPP. It currently has approximately 70 sites, runs 130K jobs/day, of which 86% jobs succeeded. The system now supports a number of different virtual organisations, from the standard HEP VOs as well as others such as nanoHUB, LIGO, molecular dynamics, maths. An example applications working on the DZero experiment at Fermi lab was given. This involved transferring 90TB of input data each time (easier than pre-staging, apparently). Overall the system uses VDT (Condor, Globus, etc.) and performs no software development beyond integration. Users can make use of the VDT compilation facilities to build their own applications. This facility is a suite of Condor systems to which build applications can be submitted. A very similar thing is operated by the OMII. • Red Hat and Condor, Alan Roy [17] Red Hat has joined with the Condor team to introduce the Condor system as a supported and fully installed part of their Enterprise ready operating systems. This is a full partnership with two way development and code exchanges. Hiring 12 engineers in Madison to work alongside Condor. Their key development objectives are tighter kernel integration, creating and packaging Condor components for Linux distributions, transfer of skills in virtualisation and messaging. The Condor team already have a list of requests for kernel changes which will be going into newer releases directly. Condor will remain cross-platform (though working on Linux may have advantages). RH will provide technical support & engage with community groups. Conclusion Overall there were a number of themes that came out during the meeting that spanned both the academic and enterprise communities. It was noted that this is an artificial separation in reality since both want reliable, easy to use and efficient HTC! Key points of note are the need for business models that work within the organisations here to be disseminated so that others may learn and benefit from their experiences. The other key issue is data handling. With the proliferation of systems over the next few years in terms of available cores etc the volume of data that these HTC systems are able to produce can only multiply significantly. These systems must start to seriously consider how they internally handle and externally present data so that users are not left with the headache of handling it. Presentations 1. 2. 3. 4. 5. http://www.nesc.ac.uk/talks/831/1.1-What%20is%20HTC-Livny.ppt http://www.nesc.ac.uk/talks/831/1.2-Mungioli%20_ASRM-Grid%20Computing%20re%20formated.ppt http://www.nesc.ac.uk/talks/831/1.3-eMinerals-Mark_Calleja.ppt http://www.nesc.ac.uk/talks/831/1.4-Cardiff%20-%20Donna%20Lammie.ppt http://www.nesc.ac.uk/talks/831/1.5-Clemson-Barr_von_Oehsen.zip (Caution this is a 28MB zip file with demo videos etc.) 6. http://www.nesc.ac.uk/talks/831/1.6-Enterprise_UseCases-stowe.ppt 7. http://www.nesc.ac.uk/talks/831/1.7-Barrie-Hibbet_Akash.ppt 8. http://www.nesc.ac.uk/talks/831/2.1-Age%20of%20Computational%20Abundance-Powers.ppt 9. http://www.nesc.ac.uk/talks/831/2.4-tannenbaum_condor_htc_week_2007.ppt 10. http://www.nesc.ac.uk/talks/831/3.1-071129%20-%20HTC_Osborne.ppt 11. http://www.nesc.ac.uk/talks/831/3.2-CamGrid.ppt 12. http://www.nesc.ac.uk/talks/831/3.3-Reading-CampusGrid_Chapman.ppt 13. http://www.nesc.ac.uk/talks/831/3.5-EnterpriseRequirements-stowe.ppt 14. http://www.nesc.ac.uk/talks/831/3.6-HTC%20Week%20Digipede%20session-Curilli.ppt 15. http://www.nesc.ac.uk/talks/831/4.2-BlueGene_htc_week-Tannenbaum.ppt 16. http://www.nesc.ac.uk/talks/831/4.1-OSGForHTCWeek2007-Roy.ppt 17. http://www.nesc.ac.uk/talks/831/4.3-redhat-htc-week-Tannenbaum.ppt Acknowledgements The organisers would like to thank the e-SciNet project, GridComputingNow! And the National e-Science Centre for their generous support.

High Throughput Computing Week Introduction National e-Science Centre, Edinburgh, 27 – 30

Related documents

Products

Support

High Throughput Computing Week Introduction National e-Science Centre, Edinburgh, 27 – 30

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib