http://www.grid-support.ac.uk http://www.ngs.ac.uk Middleware emerging onto the NGS: Resource Broker Mike Mineter mjm@nesc.ac.uk http://www.nesc.ac.uk/ http://www.pparc.ac.uk/ http://www.eu-egee.org/ Outline • NGS middleware : Toolkits inviting development of higher level services – By projects – e.g. RealityGrid and BRIDGES – For deployment as NGS services • What is a Resource Broker? • Where does it come from? – LCG-2 (= EGEE-0) – Providing production service for LCG-2 – Being configured for the NGS • Current LCG-2 activity 2 Resource broker • On the current NGS we have – GRAM to submit jobs – Information service to tell us what queues are busy • The RB takes the work out of deciding where to run a job • First step: the LCG-2 RB is being added to the NGS (LCG = Large Hadron Collider Compute grid) 3 Current production m’ware: LCG-2 Application level services User interfaces Applications EU DataGrid “Collective” services App monitoring system VDT (Condor, Globus, GLUE) User access Data management “Basic” services Information system NFS, … RedHat Linux Workload management Operating system Information schema System software File system Data transfer Security PBS, Condor, LSF,… Local scheduler Hardware Computing cluster Network resources HPSS, CASTOR… Data storage 4 Major components “User interface” Input “sandbox” Output “sandbox” DataSets info Replica Catalogue Information Service Resource Broker Publish Logging & Book-keeping Job Query Job Submit Event Author. &Authen. Storage Element Job Status Computing Element Network Server RB node Replica Location Server UI Workload Manager Inform. Service Job Contr. Characts. & status Computing Element Storage Element 6 Job Status RB node Replica Location Server Network Server submitted UI Workload Manager UI: allows users to access the functionalities of the WMS (via command line, GUI, C++ and Java APIs) Computing Element Inform. Service Job Contr. CondorG CE characts & status SE characts & status Storage Element 7 edg-job-submit myjob.jdl Myjob.jdl UI Job Statu s RB node submitted JobType = “Normal”; Replica Network Location Executable = "$(CMS)/exe/sum.exe"; Server Server InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Workload Requirements =Manager other. GlueHostOperatingSystemName == Inform. “linux" && Service other. GlueHostOperatingSystemRelease == "Red Hat 7.3“ && other.GlueCEPolicyMaxCPUTime > 10000; Job Contr. Rank = other.GlueCEStateFreeCPUs; CondorG CE characts & status Computing Element SE characts & status Job Description Language (JDL) to specify job Storage characteristics and Element requirements 8 Job RB node Network Server Job NS: network daemon Status responsible for accepting submitted Replica incoming requests Location Server waiting UI Input Sandbox files RB storage Workload Manager Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element 9 Job Status RB node Job submission Replica Location Server Network Server submitted waiting Job UI RB storage Workload manager WM: acts to satisfy the request Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element 10 Job submission Network Server UI RB storage Job Status RB node Workload Manager Job Contr. CondorG Replica Location Server MatchMaker/ Broker Where must job be executed ? waiting Inform. thisService CE characts & status Computing Element submitted SE characts & status Storage Element 11 Job submission Matchmaker: responsible Network to find the “best” CEServer UIfor a job RB storage Job Status RB node MatchMaker/ Broker Workload Manager Replica Location Server submitted waiting Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element 12 Job Job Status Where are (which RB nodeSEs) submission the needed data ? Network Server MatchMaker/ Broker UI RB storage Workload Manager Replica Location Server submitted waiting Inform. Service Job Contr. - What CondorG is the status of the characts Grid ? CE & status Computing Element SE characts & status Storage Element 13 Job Status RB node Job submission Network Server MatchMaker/ Broker UI RB storage Workload Manager CE choice Replica Location Server submitted waiting Inform. Service Job Contr. CondorG CE characts & status Computing Element SE characts & status Storage Element 14 Job Status RB node Job submission Replica Location Server Network Server submitted waiting UI RB storage Workload Manager Inform. Service Job Adapter Job Contr. CondorG characts SE characts Job Adapter: responsibleCE for “touches” & statusthe final & status to the job before performing submission (e.g. creation of wrapper script, PFN, etc.) Computing Element Storage Element 15 Job Status RB node Job submission submitted Replica Location Server Network Server waiting UI RB storage ready Workload Manager Inform. Service Job Job Contr. Job Controller: responsible for the actual job management operations (done via Computing CondorG) Element CE characts & status SE characts & status Storage Element 16 Job Status RB node Job submission Replica Location Server Network Server UI RB storage submitted waiting ready Workload Manager Inform. Service scheduled Job Contr. CondorG CE characts & status SE characts & status Job Computing Element Storage Element 17 “Compute element” – reminder! Job request Logging Logging Globus gatekeeper I.S. Info system gridmapfile Grid gate node Local resource management system: Condor / PBS / LSF master Homogeneous set of worker nodes 18 Job Status RB node Job submission Replica Location Server Network Server UI RB storage submitted waiting ready Workload Manager Inform. Service scheduled Job Contr. CondorG running Input Sandbox files “Grid enabled” data transfers/ accesses Computing Element Job Storage Element 19 Job Status RB node Job submission Network Server Replica Location Server submitted waiting UI RB storage Output Sandbox files Computing Element Workload Manager ready Inform. Service Job Contr. CondorG scheduled running done Storage Element 20 Job Status RB node edg-job-get-output <dg-job-id> Job submission Network Server Replica Location Server submitted waiting UI RB storage Workload Manager ready Inform. Service Job Contr. CondorG scheduled running done Computing Element Storage Element 21 Job submission submitted Network Server UI RB storage Output Sandbox files Job Status RB node Workload Manager Replica Location Server waiting ready Inform. Service Job Contr. CondorG scheduled running done cleared Computing Element Storage Element 22 RB node Job monitoring edg-job-status <dg-job-id> edg-job-get-logging-info <dg-job-id> UI LB: receives and stores job events; processes corresponding job status Network Server Workload Manager Job status Job Contr. CondorG Logging & Bookkeeping Log Monitor Log of job events LM: parses CondorG log file (where CondorG logs info about jobs) and notifies LB Computing Element 23 LCG-2 and NGS • LCG-2 replica management: – Logical file names, mapped by catalogue to multiple physical files • Storage element – Corresponds to NGS data node (approx.) • Compute element – A batch queue – PBS or Condor for example • Information service – Same middleware and GLUE schema are used 24 More about the RB • Developed by the European DataGrid project, EDG then “hardened” by LCG, and now one of the sources for the EGEE middleware (next talk) • Uses components of Condor – matchmaker and Condor-G • Try the GENIUS portal on GILDA – GILDA is a dissemination grid running the LCG-2 middleware – Demo site: https://grid-demo.ct.infn.it/ • And look at http://lcg.web.cern.ch/LCG/ http://www.hep.ph.ic.ac.uk/escience/projects/demo/index.html 25 Implications for the NGS • Are being worked out! • Integration with NGS core nodes in progress • “UI” requirements??: – LCG user interface + OGSA-DAI + SRB client – Lighter-weight alternatives? • To packaging? • For client software 26 Summary • The resource broker receives a job description in JDL • It choose a batch queue for job submisison • Its an example of the higher services that will be deployed for the NGS, built upon the current toolkits 27