Scheduling and Workload Management in the European DataGrid Project

advertisement
Scheduling and Workload Management
in the European DataGrid Project
Contents:
•How the EDG works from a job submission point
of view
•What is the current state of the EDG
•What will be coming in (near) future
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
The EDG has been focused on three areas:
Particle Physics, Earth Observation and Bioinformatics
The focus has been on high throughput applications
such as Monte Carlo simulation, data analysis.
Currently no support for MPI, interactive jobs, etc
(Although much more functionality to come in release 2)
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Middleware has been built on the “standard tools”
Globus Toolkit 2
CondorG
Reasonably large amount of new code in current
release … much more in future releases
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
User
Interface
MDS
Sandboxes
Replica
Catalogue
Resource Broker
RSL
(information index,
Job Submission Service,
CondorG
Logging and bookkeeping)
site
Compute
Element
Storage
Element
Site
Site
WN
WNWN
WNWN
Site
David Colling, Imperial College
London
Site
Scheduling and Workload Management
in the European DataGrid Project
Currently 4 Resource Brokers:
2 at CERN
1 at CNAF
1 at Imperial College
User Describes the requirements of the in Job
Description Language (JDL)
Specify Data required and other preferences.
JDL is based on the Condor ClassAd
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Executable = "WP1testF";
StdOutput = "sim.out";
StdError = "sim.err";
InputSandbox
= {"/home/datamat/sim.exe", "/home/datamat/DATA/*"};
OutputSandbox = {"sim.err","sim.err","testD.out"};
Rank = other.TotalCPUs * other.AverageSI00;
Requirements
= other.LRMSType == "PBS" \
&& (other.OpSys == "Linux RH 6.1" || other.OpSys == "Linux RH 6.2") && \
self.Rank > 10 && other.FreeCPUs > 1;
RetryCount = 2;
Arguments = "file1";
InputData = "LF:test10099-1001";
ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test
Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it";
DataAccessProtocol = "gridftp";
OutputSE = "grid001.cnaf.infn.it";
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Current functionality:
Currently similar to a primitive batch system, allowing
job submission
[collngdj@gw26 grid]$ dg-job-submit -c UI.cfg dg-submit.jdl
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Connecting to host gm03.hep.ph.ic.ac.uk, port 7771
Logging to host gm03.hep.ph.ic.ac.uk, port 15830
******************************************************************************************
JOB SUBMIT OUTCOME
The job has been successfully submitted to the Resource Broker.
Use dg-job-status command to check job current status. Your job identifier (dg_jobId) is:
- https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771
******************************************************************************************
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
dg-job-status
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the Job :
https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771
--dg_JobId
=
https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771
Status
=
Ready
Last Update Time (UTC) =
Thu Feb 6 15:19:23 2003
Job Destination
=
tuber5.phy.bris.ac.uk:2119/jobmanager-pbs-tbq
Status Reason
=
job accepted
Job Owner
=
/O=Grid/O=UKHEP/OU=hep.ph.ic.ac.uk/CN=Dr D J Colling
Status Enter Time (UTC) = Thu Feb 6 15:17:53 2003
Location
= JobSubmissionService
*************************************************************
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Retrieve the output
[collngdj@gw26 grid]$ dg-job-get-output -c UI.cfg
https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771
**************************************************************************************
JOB GET OUTPUT OUTCOME
Output sandbox files for the job:
- https://gm03.hep.ph.ic.ac.uk:7846/155.198.216.137/151738209955581?gm03.hep.ph.ic.ac.uk:7771
have been successfully retrieved and stored in the directory:
/tmp/151738209955581
*************************************************************************************
A few more other commands such as dg-job-cancel
but not many…
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Deployment in the UK:
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
EDG Application testbed:
More than 1000 CPUs
5 Terabyte of storage
David Colling, Imperial College
London
EDG sw installed at
more than 40 sites
Scheduling and Workload Management
in the European DataGrid Project
Nb. of evts
Is it used?
Part of a real Monte Carlo production
Each events ~6 cpu minute
time
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Is it stable?
Well … getting there.
Over long tests ~80% of jobs now run without
problems.
Have found many bugs/limitations in core grid
middleware like Globus and CondorG.
Experiments have submitted “job storms” of hundreds
of jobs successfully
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Future functionality:
Interactive jobs, MPI Jobs, Accounting,
Checkpointing and Job Partitioning, Advanced
reservation, Dependent Jobs
The order of implementation is decided by the
user community
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Also changes in coming up in information services
and importantly in data/replication management
David Colling, Imperial College
London
Scheduling and Workload Management
in the European DataGrid Project
Conclusions:
• The EDG has a working grid
• It is being used for real science
• It is becoming more stable
• Have had scaling problems, often in core products
• Currently limited functionality
• Greater functionality in near future
David Colling, Imperial College
London
Download