MC_prod_disun_final_04

Monte Carlo Production An efficient and performant Monte Calro (MC) production system is crucial for delivering the large data samples of fully simulated and reconstructed events required for detector performance studies and physics analysis in CMS. The worldwide LHC Computing Grid (WLCG) which is mostly composed of the LHC Computing Grid (LCG) and Open Science Grid in the US (OSG), makes available of a large amount distributed computing, storage, and network resources for data processing. While the LCG and OSG provides the basic services for distributed computing, reliability and stability of both the grid services and the sites connected to them remain key issues, given the complexity of services and heterogeneity of resources in both grid flavors. A roubutst, scalable, automated and easy to maintain MC production system that can make an efficient use of both the grid resources is mandatory. Current Production System - ProdAgent The former CMS MC production system developed prior to 2006 had been used for several years and had shown a set of limitations such as : the lack of proper automation, monitoring, error handling, which were necessary for processing with a distributed environment with inherent instability and unreliability. The large central database in the old system was the source of a single point of failure. As a consequence, the production system was not efficient and robust enough, did not scale beyond running about thousand jobs in parallel and was very manpower-intensive. There was therefore the need for a new production system to automatically handle job preparation, submission, tracking, resubmission and data registration into the various data management databases (data bookkeeping service i.e DBS, location and transfer systems i.e. PhEDEx). In addition, the new CMS event data model (EDM) and software framework (CMSSW) made it possible to simplify the quite complex processing and publication workow in the old system. In particular, the lack of support for multi-step processing in a single job and the need of metadata generation from the event data in order to make the data available for further processing or analysis were two sources of big overhead. DISUN contributed to the redesign of the new MC production system at the design and implementation level, and the new production software that evolved as a result was called “ProdAgent”. The design of the new system (ProdAgent) aimed at automation, ease of maintenance, scalability, avoid single points of failure and support multiple Grid systems. In addition, ProdAgent integrated the production system with the new CMS event data model, data management system and data processing framework. The complete new production framework was a combination of several components : a) the production request management system (ProdRequest), and b) the actual production software (ProdAgent), and the c) ProdManager which was responsible of distributing the production workflows among all the instances of ProdAgents. The design was based on a pull approach i.e. the participating ProdAgents will pull the requests from the ProdManager system for processing when resources are available. ProdAgent was built as a set of loosely coupled components that cooperate to carry out production workflows. Components are python daemons that communicate through a mysql database. Components use an asynchronous subscribe/publish model for communication. Their states are persistently recorded in the database. Work is split into these atomic components that encapsulate specific functionalities. DISUN contributed significantly to the design/implementation/testing/debugging of several of these components which include: Requests management (queue, feedback and retrieval), job creation (process jobs and merge jobs), job submission, job tracking, and data management including movement using phedex, and data catalogs. Based on the initial and subsequent operational experiences over the past 4 years, the following contributions from DISUN were made periodically to improve the performance, scalability, and robustness of the ProdAgent, the condor batch system, job routing to the production sites and monitoring :          Addition of error handler, trigger module, and job cleanup components in ProdAgent for the respective tasks Automatic monitoring of the health of all components and managing their states Implementation of better optimized/efficient database algorithm for the MergeSensor component with significantly improved database operations (5-7x faster) Parallelization of job creation/submission task (i.e. Bulk mode operation) enhanced the job creation/submission speed by 10x over the serial creation/submission mode Implementation of algorithm to manage multi-tier workflow processing (i.e. simulation, reconstruction, skimming etc.) in a single step to improve production efficiency and reduce overall latency in data delivery. Implementation of optimized algorithm for improved performance of the ResourceMonitor component for condor batch system Significant enhancement in the scalability of the condor batch software i.e. the ability of Condor to handle several tens of thousands of jobs at any given time without being too cpu-hungry in the process High performance dynamic routing of production jobs using the “JobRouter” (aka condor schedd-on-the-side) to OSG sites based on demand, workflow-to-site mapping, site availability and resource capabilities with “blackhole node” throttling ability Extension of MC production to CMS T3 and opportunistic (non-CMS sites) in the OSG with extensive testing/debugging etc Production Performance on the OSG : With constant improvements to the ProdAgent software, enhanced scalability of the Condor batch system including improved job routing, reliability of grid-middleware and operational stability at sites over the last 4 years, the performance of MC production on OSG has shown very consistent and significant improvement during this period. The volume of MC production has also steadily increased from year-to-year. This was possible due to a combination of a) the availability of increased computing resources at the sites over the years, b) clear and time tested procedure for quick and on demand installation/configuration of new production servers to expand production activities to additional computing resources when available, and c) increased operational expertise by the operators and their ability to quickly detect and resolve any production issues. The number of simultaneous running production jobs in OSG has gone up to 15000 (includes the resources at FNAL T1 and the 7 CMS T2s) in 2011 compared to the 2000 job slots that we started with at the beginning of 2006. With a well designed configuration of a production cluster at UW (includes several ProdAgent and MySQL database servers) coupled with a robust dynamic condor job routing mechanism, each single production server has demonstrated to handle 15K parallal running jobs without any significant performance issues. The increased production efficiency and quick error recovery cycle has enabled the MC production team in the US/OSG to consistently produce more than 50% of the total yearly MC statistics, with the rest coming from the LCG production sites/team. Around 25% of all the production workflows (out of ~ 10000 total workflows handled by all production teams) have been produced in the OSG over the last 3.5 years. This is because most of the large workflows were handled by the OSG team due to the large number of batch slots available at the US T2 (and FNAL) sites while most of the smaller workflows (many of them) were handled by the LCG teams. The reliability of the MC production at the CMS T2s and several T3s in OSG has steadily increased over the years. The performance of the grid middleware has also greatly improved and stabilized over the years due to significant improvement in the operations strategy, monitoring, and timely detection of issues by the site admin(s). The biggest chunk of time in running the production to-date is still spent on the various operations steps in ProdAgent that needs manual intervention by the operators and addressing the various production issues/problems promptly. Effective and prompt communication with the site admins in dealing with site related job failures has resulted in improved production quality and reduced wastage of resource usage consistently over the years. CMS Data Operations (DataOps) CMS has been routinely using most of the T2s (and the 7 T1 sites when available) for MC production over the past years. These resources are divided geographically into several groups of sites (regions attached to a T1 center), each of which is handled by a given production team. Up to 6 production teams (1 team for OSG sites and 5 teams for the LCG sites) have been routinely contributing to the global MC production in CMS. When necessary, a given production team is empowered to quickly install and configure multiple instances of ProdAgents, allowing parallel and faster submission of a large number of jobs, on demand. The standalone MC production group became a subgroup of the newly formed Data Operations group (DataOps) that took shape in early 2007. The MC production task within the DataOps group has since then been managed by two L3 level managers. Due to the excellent contribution, productivity, performance, and demonstrated capability of the OSG production team based at UW Madison, Ajit Mohapatra (UW) was appointed as one of the L3 manager/coordinator in 2008 and has been serving in that position for last 3.5 years. His responsibilities include the supervision of the 6 production teams (alongwith the 2nd coordinator), the coordination of the global CMS MC production, timely delivery, and all other relevant DataOps activity. Under the leadership of Ajit Mohapatra, every effort has been made to maximize the utilization of all available computing resources at sites in a consistent manner over time. As a result, over the years, the usage of total batch slots in OSG and LCG for MC production has gone up to 30K (includes the T1/T2 and some T3s) as shown in Fig.1. The number of events produced in the US/OSG has increased systematically over the years for past 6 years, as shown in Fig. 2. The significant jump in statistics in 2008 (from 2007) is primarily due to implementation of the bulk job submission mechanism and chain processing algorithm of multi-tier workflows in ProdAgent alongwith other operational improvements. This has enabled the DataOps group to produce and deliver the MC samples to the CMS user community in a timely manner for the past 3.5 years, and it continues today while CMS is actively taking collision data for the past 1.5 years. Future MC Production System – WMAgent The ProdAgent system has performed very well for past 3.5 years within it’s design goal and limitations. This is being replaced by the new CMS workload management system called the WMAgent which is currently being used for MC/data processing at the T1 sites. It is in the final phase of deployment/evaluation for MC production. It comes with more features than what ProdAgent was capable of and is built on years of operational experience with ProdAgent and is expected to significantly address the bottlenecks that was inherent to the ProdAgent system and improve the operational productivity. Figure 1: Snapshot of Simultaneous Running of MC Production Jobs at the OSG and LCG Sites (mostly T1s and T2s, and a few T3s) during May 2011. Figure 2: MC Production Statistics in the OSG for the past 6 years.

MC_prod_disun_final_04

Related documents

Products

Support

MC_prod_disun_final_04

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib