Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004 Outline 2 • • • • • • The University of Bristol Grid (UoBGrid). The UoBGrid Resource Broker Users & their environment. Problems encountered. Other Condor use within Bristol. Summary. 3 The UoBGrid • Planned for ~1000+ CPUs from 1.2 → 3.2GHz arranged in 7 clusters & 3+ Condor pools located in 4 different departments. • Core services run on individual servers, e.g. Resource Broker & MDS. 4 The UoBGrid System Layout 5 The UoBGrid, now • Currently 270 CPUs in 3 clusters and 1 Windows Condor pool. • Central services run on 2 beige boxes in my office. • Windows Condor pool only in single student open access area. • Currently only two departments (Physics, Chemistry) fully engaged though more on their way. • Remainder of large clusters still on legacy versions of operating systems, University wide upgrade programme started. 6 Middleware • Virtual Data Toolkit. – Chosen for stability. – Platform independent installation method. – Widely used in other European production grid systems. • Contains the standard Globus Toolkit version 2.4 with several enhancements. • Also. – GSI enhanced OpenSSH. – myProxy Client & Server. • Has a defined support structure. 7 Condor-G Resource Broker • Uses the Condor-G matchmaking mechanism with Grid resources. • Set to run immediately a job appears. • Custom script for determination of resource status & priority. • Integrated the Condor Resource description mechanism and Globus Monitoring and Discovery Service. 8 Resource Broker Operation 9 Information Passed into Resource ClassAd MyType = "Machine" TargetType = "Job" Name and gatekeeper URLs dependant on resource system name and installed scheduler as systems may easily have more than one jobmanager installed. Name = "grendel.chm.bris.ac.uk-pbs“ gatekeeper_url = "grendel.chm.bris.ac.uk/jobmanager-pbs" Make sure Globus universe, check number of nodes in cluster and set max number of matched jobs to a particular resource. Requirements = (CurMatches < 5) && (TARGET.JobUniverse == 9) WantAdRevaluate = True Time classad constructed. UpdateSequenceNumber = 1097580300 Currently hard coded in the ClassAd CurMatches =0 System information retreived from Globus MDS information for head node only not worker OpSys = "LINUX“ Arch = "INTEL" Memory = 501 Installed software defined in resource broker file for each resource INTEL_COMPILER=True GCC3=True 10 Possible extensions to Resource Information • Resource state information (LoadAvg, Claimed etc): – How is this defined for a cluster, maybe Condor-G could introduce new states of % full?? • Number of CPUs and free diskspace: – How do you define this in a cluster? Is the number of CPUs set as per worker or overall the whole system? Same for disk space. • Cluster performance (MIPS, KFlops): – This is not commonly worked out for small clusters so would need to be hardwired in but could be very useful for Ranking resources. 11 Results of condor_status 12 Load Management • Only defines the raw numbers of jobs running, idle & held (for whatever reason). • Has little measure of relative performance of nodes within grid, currently based on: – Head node processor type & memory. – MDS value of nodeCount for the jobmanager (this is not always the same as the real number of worker nodes. • Executes currently only to a single queue on each resource. 13 What is currently running and how do I find out? •Simple interface to condor_q •Planning to use Condor Job Monitor when installed due to scalability issues. 14 Display of jobs currently running 15 Issues with Condor-G • The following is a list of small issues we have: – How do you do some resource definitions for clusters? – When using condor_q –globus actual hostname job matched to is not displayed. – No job exit codes….. • The job exit codes will become more important with increased number of users/problems. • Once a job has been allocated to a remote cluster then rescheduling elsewhere is difficult. 16 The Users • BaBar: – One resource is the Bristol BaBar farm so Monte-Carlo event production in parallel to UoBGrid usage. • GENIE: – installing software onto each connected system by agreement with owners. • LHCb: – Windows compiled Pythia event generation. • Earth Sciences: – River simulation. • Myself… – Undergraduate written charge distribution simulation code. 17 Usage • Current record: – ~10000 individual jobs in a week, – ~2500 in one day. 18 Windows Condor through Globus • • • • Install Linux machine as Condor Master only. Configure this to flock to Windows Condor pool. Install Globus Gatekeeper. Edit jobmanager.pm file so that architecture for submitted jobs always WINNT51 (matches all the workers in the pool). • Appears in Condor-G Resource list as WINNT51 Resource. 19 Windows Condor pools available through a Globus Interface from a flocked Linux pool • Within three departments, currently there are three separate Windows Condor pools approximately 200 CPUs. • Planning to allow all student teaching resources in as many departments as possible to have the software installed. • This will allow a significant increase in university processing power with little cost increase. • When department gives the OK then they will be added to the flocking list on single Linux submission system machine. • Difficulty encountered with this setup is the lack of Microsoft Installer file. – Affects ability to use group policy method of software delivery and installation, directly affects how some computer officers view installing etc. 20 Evaluation of Condor again United Devices GridMP • Computational Chemistry group have significant links with industrial partner who is currently using U.D. GridMP. • Suggested to CC group they also use GridMP though after initial contact this was suggested to be very costly. • e-Science group suggested that Condor would be a better system for them to use. • Agreement from UD to do a published function & usage comparison between Condor & GridMP. • Due to start this autumn.