Review of WLCG Tier-2 Workshop Duncan Rand Royal Holloway, University of London Brunel University ....from the perspective of a Tier-2 system manager Workshop 3 days – lectures from experiments Tutorial 2 days – parallel programme Lots of talks with lots of detail! General overview - refer to original slides for details Oriented towards ATLAS (RHUL) and CMS (Brunel) What did I expect? An overview of the future the big picture more details about the experiments data flows and rates how were they going to use the Tier-2 sites? what did they expect from us? Perhaps, a tour of the LHC or an experiment What do the experiments have in common? Large volume of data to analyse (we knew that) Need to distribute data to CPU’s, keep track of it, analyse it and upload results However, also need to run lots of Monte Carlo (MC) jobs common to all particle physics experiments large fraction of all jobs run (ATLAS:1/3; CMS:1/2) submitted from a central server – 'production' explains mysterious 'prd' users e.g. lhcbprd running on our Tier-2 now What do they do in Monte Carlo production? Start with small dataset (KB) with initial conditions describing experiment Model experiment from collision to analysis Model proton-proton interactions, detector physics etc.. CPU intensive; about 10 kSI2k hours Upload larger data-set to Tier-1 at the end Relatively low network demands; steady data flow from Tier-2 to Tier-1 of about 50Mbit/s (varies for each expt.) Data Management Data is immediately transferred from Tier-0 to Tier-1's for backup RAW data is first calibrated and reconstructed to give Event Summary Data (ESD) and Analysis Object Data (AOD) suitable for analysis AOD data sets transferred to Tier-2's for analysis – ‘bursty’ depending on user needs, ~300 Mbit/s (varies for each expt.) Tier-1’s will provide reliable storage of data Tier-2’s act more like dynamic cache Tier-1’s handle more or less of essential services such as file catalogues, FTS services etc. Computing Experiments have developed complex software tools to: handle all this data transfer and keep track of datasets (CMS:PhEDEx, ATLAS: DDM) handle submission of MC production (CMS: ProdManager/ProdAgent) direct jobs to where the datasets are enable physicist in office to carry out ‘chaotic user analysis’ (doesn’t describe their mode of work, more the lack of central submission of jobs) (CMS:CRAB) these make more or less demands on a site ALICE Alice - not highly relevant to UK as only supported by Birmingham at Tier-2 level Distinction between Tier-1 and Tier-2 is by Quality of Service Require extra VO box installed at a site; unlikely to use nonAlice Tier-2's opportunistically? Developing ‘parallel root facility’ (PROOF) clusters at Tier-2’s for faster interactive data analysis LHCb Not going to use Tier-2's for analysis of data – concentrate analysis at Tier-1 Only going to run Monte Carlo jobs at Tier-2's Simplifies data transfer requirements at Tier-2 level So, easiest for a Tier-2 to support Low networking demands: 40Mbit/s aggregated over all Tier-2’s UKI-LT2-Brunel (100 Mbit/s) recently in top 10 providers for LHCb Monte Carlo ATLAS Tier-2's provide 40% of total computing and storage requirements Hierarchical structure between Tier-1's and Tier-2's a Tier-1 provides services (FTS, LFC) to group of Tier-2's no extra services required at Tier-2 level Tier-2's will carry out MC simulations - results sent back to Tier-1's for storage and further distribution and processing – steady 30Mbit/s from site AOD (analysis object data) will be distributed to Tier-2's for analysis: 160Mbit/s to site SC4: how long to analyse 150TB data equivalent to 1 year running of LHC? CMS CPU intensive processing mostly carried out at Tier-2’s Tier-2’s run 50% MC and 50% analysis jobs MC production jobs handled by central queue called ‘ProductionManager’ submit, track jobs and register output in CMS databases jobs handed to ProductionAgents for processing MC job output does not go from WN to Tier-1 directly data is stored locally and small files are merged together by new jobs (heavy I/O) large file (~TB) returned to Tier-1 CMS Importance of good LAN bandwidth from WN’s to SE to do this merging of files Use ‘CRAB’ (CMS Remote Analysis Builder) at a UI to analyse data User specifies dataset CRAB ‘discovers’ data, prepares job and submits it ‘Surviving the first years’; until detector is understood AOD’s not that useful - will rely heavily on raw data – large networking demands CMS: requirements of Tier-2 site Division of labour CMS look after global issues Tier-2 look after local issues to keep site running What is required: a good batch farm with reliable storage good LAN and WAN networking install PhEDEx, LFC and Squid cache (calibration data) pass Site Functional Tests a good Tier2 is ‘active, responsive, attentive, proactive’ Support and operations afternoon Discovered that WLCG = EGEE + OSG i.e. we are now working more closely with the US Open Science Grid OSG not too relevant for the average Tier-2 sys-admin in UK UKI ROC meeting Small room, face to face meeting – lots of discussion Grumbles about GGUS tickets and time taken to close solved ticket Close it yourself add ‘status=solved’ to first line of reply Highlighted for me the somewhat one-directional flow of information in the workshop itself Would have been good for Tier-2’s to have been able to present at the workshop Middleware tutorials Popular – lots of discussion Understandable given fact that Tier-2 system admins more interested in middleware than experimental computing models Good to be able to hear roadmap for LFC, DPM, FTS, SFT’s etc. from middleware developers and ask questions Tier-2 interaction Didn't appear to be much interaction between Tier-2's Lack of name badges? Missed chance to find out how others do things Michel Jouvin from GRIF (Paris) gave a summary of his survey on Tier-2’s large variation between resources at Tier-2’s 1 to 8 sites per Tier-2; 1 to 13 FTE! Difference between distributed vs. federated Tier-2’s? Post-workshop survey excellent idea Providing a Service We are the users and customers of the middleware Tier-2 providing a service for experiments ➢ CMS: ‘Your customers will be remote users’ Tier-2's need to generate a customer service mentality Need good communication paths to ensure this works well CMS have VRVS integration meetings and email list – sounds promising Not very clear how other experiments will communicate proactively Summary Learnt a lot about how the experiments intend to use Tier-2's Pretty clear about what they need from Tier-2 sites Could have been more feedback from Tier-2’s Could have been more interaction between Tier-2’s Tier-2’s are critical to success of LHC: service mentality Communication between experiments and Tier-2’s unclear The LHC juggernaut is changing up a gear !