Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21 Outline • RACF background • RACF condor batch system • USATLAS grid job submission using condor-g RACF • Brookhaven (BNL) is multi-disciplinary DOE lab. • RHIC and ATLAS Computing Facility (RACF) provides computing support for BNL activities in HEP, NP, Astrophysics, etc. – RHIC Tier0 – USATLAS Tier1 • Large installation – 7000+ cpus, 5+ PB of storage, 6 robotic silos with capacity of 49,000+ tapes • Storage and computing to grow by a factor ~5 by 2012. New Data Center rising New data center will increase floor space by a factor ~2 in summer of 2009. BNL Condor Batch System • Introduced in 2003 to replace LSF. • Steep learning curve – much help from Condor staff. • Extremely successful implementation. • Complex use of job slots (formerly VM’s) to determine job priority (queues), eviction, suspension and back-filling policies. Condor Queues • Originally designed with vertical scalability – Complex queue priority configuration per core – Maintainable with old less core hardware • Changed to horizontal scalability in 2008 – More and more Multi-core hardware now – Simplified queue priority configuration per core – Reduce administrative overhead Condor Policy for ATLAS (old) ATLAS Condor configuration (old) Condor Policy @ BNL ATLAS Condor configuration (new) Condor Queue Usage Job Slot Occupancy (RACF) • Left-hand plot is for 01/2007 to 06/2007. • Right-hand plot is for 06/2007 to 05/2008. • Occupancy remained at 94% between the two periods. Job Statistics (2008) • Condor usage by RHIC experiments increased by 50% (in terms of number of jobs) and by 41% (in terms of cpu time) since 2007. • PHENIX executed ~50% of its jobs in the general queue. • General queue jobs amounted to 37% of all RHIC Condor jobs during this period. • General queue efficiency increased from 87% to 94% since 2007. Near-Term Plans • Continue integration of Condor with Xen virtual systems. • OS upgrade to 64-bit SL5.x – any issues with Condor? • Condor upgrade from 6.8.5 to stable series 7.2.x • Short on manpower – open Condor admin position at BNL. If interested, please talk to Tony Chan. Condor-G Grid job submission • BNL, as USATLAS Tier1, provides support to the ATLAS PanDA production system. PanDA Job Flow • One critical service is to maintain PanDA autopilot submission using Condor-G – Very large number (~15000) of current pilot jobs as a single user – Need to maintain very high submission rate • Autopilot attempts to always keep a set number of pending jobs in every queue of every remote USATLAS production sites – Three Condor-G submit hosts in production • Quad-core Intel Xeon E5430 @ 2.66GHz, 16G Memory and two 750GB SATA drives (mirrored disks) Weekly OSG Gratia Job Count Report for USATLAS VO • We work closely with condor team to tune Condor-G for better performance. Many improvements have been implemented and suggested by Condor team. New Features and Tuning of Condor-G submission (not a complete list) • Gridmanager publishes resources classads to collector, users can easily query and get the grid job submission status to all remote resources. $> condor_status -grid Name Job Limit gt2 atlas.bu.edu:211 2500 gt2 gridgk04.racf.bn 2500 gt2 heroatlas.fas.ha 2500 gt2 osgserv01.slac.s 2500 gt2 osgx0.hep.uiuc.e 2500 gt2 tier2-01.ochep.o 2500 gt2 uct2-grid6.mwt2. 2500 gt2 uct3-edge7.uchic 2500 Running 376 1 100 611 5 191 1153 0 Submit Limit 200 200 200 200 200 200 200 200 In Progress 0 0 0 0 0 0 0 0 • Nonessential jobs – Condor assumes every job is important, it carefully holds and retries • Pile-up of held jobs often clogs condor-g, prevents it from submitting new jobs – A new job attribute , Nonessential, is introduced. • Nonessential jobs will be aborted instead of being put on hold. – Suited for “pilot” jobs • pilots are job sandbox, not real job payload. Pilots themselves are not as essential as real jobs. • Job payload connects to PanDA server through its own channel. PanDA server knows their status and can abort them directly if needed. • GRID_MONITOR_DISABLE_TIME – New configurable condor-g parameter • Controls how long condor-g waits, after a grid monitor failure, before submitting a new grid monitor job – Old default value of 60 minutes is too long • New job submission quiet often pauses during the wait time, job submission can not sustain at high rate level – New value is 5 minutes • Much better submission rate seen in production. – Condor-G developers have plan to trace the underneath Grid monitor failures, in Globus context • Separate throttle for limiting jobmanagers based on their role – Job submission won’t compete with job stage_out/removal • Globus bug fix – GRAM client (inside GAHP) stops receiving connections from remote jobmanager for job status updates. – We ran cronjob to periodically kill GAHP server to clear up the connections issue. Slower job submission rate. – New condor-g binary compiles against newer Globus libraries, so far so good. Need more time to verify. • Some best practices in Condor-G submission – Reduce frequency of voms-proxy renewal on the submit host • Condor-G aggressively pushes out new proxies to all jobs • Frequent renewal of voms-proxy on the submit hosts slow down job submission – Avoid hard-kill jobs (-forcex) from client side • Reduces job debris on the remote gatekeepers • On the other hand, on the remote gatekeepers, we need to more aggressively clean up debris Near-Term Plans Continue the good collaboration with condor team for better performance of condor/condor-g in our production environment.