Scheduling and Workload Management in the European DataGrid Project Contents: •How the EDG works from a job submission point of view •What is the current state of the EDG •What will be coming in (near) future David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project The EDG has been focused on three areas: Particle Physics, Earth Observation and Bioinformatics The focus has been on high throughput applications such as Monte Carlo simulation, data analysis. Currently no support for MPI, interactive jobs, etc (Although much more functionality to come in release 2) David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Middleware has been built on the “standard tools” Globus Toolkit 2 CondorG Reasonably large amount of new code in current release … much more in future releases David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project User Interface MDS Sandboxes Replica Catalogue Resource Broker RSL (information index, Job Submission Service, CondorG Logging and bookkeeping) site Compute Element Storage Element Site Site WN WNWN WNWN Site David Colling, Imperial College London Site Scheduling and Workload Management in the European DataGrid Project Currently 4 Resource Brokers: 2 at CERN 1 at CNAF 1 at Imperial College User Describes the requirements of the in Job Description Language (JDL) Specify Data required and other preferences. JDL is based on the Condor ClassAd David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Executable = "WP1testF"; StdOutput = "sim.out"; StdError = "sim.err"; InputSandbox = {"/home/datamat/sim.exe", "/home/datamat/DATA/*"}; OutputSandbox = {"sim.err","sim.err","testD.out"}; Rank = other.TotalCPUs * other.AverageSI00; Requirements = other.LRMSType == "PBS" \ && (other.OpSys == "Linux RH 6.1" || other.OpSys == "Linux RH 6.2") && \ self.Rank > 10 && other.FreeCPUs > 1; RetryCount = 2; Arguments = "file1"; InputData = "LF:test10099-1001"; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; OutputSE = "grid001.cnaf.infn.it"; David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Current functionality: Currently similar to a primitive batch system, allowing job submission [collngdj@gw26 grid]$ dg-job-submit -c UI.cfg dg-submit.jdl David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Connecting to host gm03.hep.ph.ic.ac.uk, port 7771 Logging to host gm03.hep.ph.ic.ac.uk, port 15830 ****************************************************************************************** JOB SUBMIT OUTCOME The job has been successfully submitted to the Resource Broker. Use dg-job-status command to check job current status. Your job identifier (dg_jobId) is: - https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771 ****************************************************************************************** David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project dg-job-status ************************************************************* BOOKKEEPING INFORMATION: Printing status info for the Job : https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771 --dg_JobId = https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771 Status = Ready Last Update Time (UTC) = Thu Feb 6 15:19:23 2003 Job Destination = tuber5.phy.bris.ac.uk:2119/jobmanager-pbs-tbq Status Reason = job accepted Job Owner = /O=Grid/O=UKHEP/OU=hep.ph.ic.ac.uk/CN=Dr D J Colling Status Enter Time (UTC) = Thu Feb 6 15:17:53 2003 Location = JobSubmissionService ************************************************************* David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Retrieve the output [collngdj@gw26 grid]$ dg-job-get-output -c UI.cfg https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771 ************************************************************************************** JOB GET OUTPUT OUTCOME Output sandbox files for the job: - https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771 have been successfully retrieved and stored in the directory: /tmp/151738209955581 ************************************************************************************* A few more other commands such as dg-job-cancel but not many… David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Deployment in the UK: David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project EDG Application testbed: More than 1000 CPUs 5 Terabyte of storage David Colling, Imperial College London EDG sw installed at more than 40 sites Scheduling and Workload Management in the European DataGrid Project Nb. of evts Is it used? Part of a real Monte Carlo production Each events ~6 cpu minute time David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Is it stable? Well … getting there. Over long tests ~80% of jobs now run without problems. Have found many bugs/limitations in core grid middleware like Globus and CondorG. Experiments have submitted “job storms” of hundreds of jobs successfully David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Future functionality: Interactive jobs, MPI Jobs, Accounting, Checkpointing and Job Partitioning, Advanced reservation, Dependent Jobs The order of implementation is decided by the user community David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Also changes in coming up in information services and importantly in data/replication management David Colling, Imperial College London Scheduling and Workload Management in the European DataGrid Project Conclusions: • The EDG has a working grid • It is being used for real science • It is becoming more stable • Have had scaling problems, often in core products • Currently limited functionality • Greater functionality in near future David Colling, Imperial College London