ATLAS Monte Carlo Production System Chun L. Tan University of Birmingham, UK The ATLAS Monte Carlo production system tool attempts to provide the end-user physicist the ability to perform on-demand Monte Carlo production for his/her specific needs. The aim is to shield the physicist from the underlying complexity and to present him/her with a consistent, flexible and intuitive interface. The tool will harness the power and capacity of grid computing allowing the physicist to tap on a worldwide computing resource pool never before available to an individual physicist. The first prototype, AtCom[1] (short for ATLAS Commander), has been developed in partnership with CERN in October 2002. An invaluable production manager’s tool, AtCom automates tedious and often repetitious tasks like defining and submitting large numbers of jobs, monitoring of job execution status, scanning of log files for errors, updating ATLAS bookkeeping databases, cleaning up and finally resubmitting if the job failed. Future versions of the tool will be more end-user oriented. It will also be based on the final grid interface GANGA [2]. Apart from providing access to Grid and legacy computing resources, GANGA, with its Python bus architecture, will potentially allow the Monte Carlo production system tool to interface with other complementary software e.g. Atlantis[3] (visualisation package), JOE[4] (Athena job options editor) and DIAL[5] as they become available. 1. INTRODUCTION The production of simulated or Monte Carlo events is a crucial component of every particle physics experiment. In ATLAS, this tedious, error-prone and often highly repetitive task is undertaken by a handful of expert production managers. Typically, the normal physicist needs to request for Monte Carlo events in advance and is unable to implement ideas and experiment with minor variations in job parameters within an acceptable round time. This unsatisfactory round time delay and the overall associated complexity of Monte Carlo production prevents the physicist from focusing on the physics and testing his ideas in an efficient and straightforward manner. The first prototype of the ATLAS Monte Carlo production system, AtCom (short for ATLAS Commander) has been developed in partnership with CERN and has been in active use in the ongoing ATLAS data challenges. The basic features include job creation, submission, monitoring, validation and updating of bookkeeping databases. However, its design was geared towards the expert user (i.e. the production manager) and is not suitable for the casual physicist. This was due to the pressing need for an automated tool for use in the ATLAS data challenges in the autumn of 2002. AtCom assists the production manager by automating tedious and often repetitious tasks like defining and submitting of large numbers of jobs, monitoring of job execution status, scanning of log files for errors, updating ATLAS bookkeeping databases, cleaning up and finally resubmitting if the job failed. AtCom, with its modular design, has 3 distinct components: the generic basic job management component, the database interface component and the computing systems interface component. The computing systems interface component was implemented with dynamically loaded plugins for batch systems and GRIDs. Today, many flavours of batch systems are in use at different ATLAS institutes around the world and the GRID is poised to simplify the situation in the future. We are, however, currently in a transitional stage where different flavours of GRIDs and numerous legacy batch systems are used. Increasingly, GRID and non-GRID computing systems are being deployed together at a single site. AtCom allows jobs to be defined in a generic way i.e. non system-specific. The current implementation is broadly based on the virtual data[6] approach where a job definition is linked by reference to a transformation definition which consists of a script/executable, its execution environment in the form of ' used'packages and a signature enumerating the formal parameters and their types. AtCom is implemented in Java and should, in principle, run on any Java-enabled platform. It has been routinely used for production on both Linux and Windows platforms. interface with different computing systems. These plug-ins form an interface that defines methods and signatures for common operations: submitting a job, getting job status, killing a job and getting the current output (stdout and stderr) of a job. 2.2 MC production tool The MC production tool differs from its prototype in 2 areas. Job submission and monitoring will no longer be performed natively but will instead utilise GANGA’s job submission and monitoring facilities. Bookkeeping functions will also be provided through GANGA. By interfacing with GANGA, various other existing tools can be harnessed e.g. Athena/Gaudi job obtions editor (JOE), PyROOT[9], Atlantis, etc. Figure 2 shows the services GANGA will provide the MC production tool. ATLAS Monte Carlo production tool 2. ARCHITECTURE 2.1 AtCom Bookkeeping DBs Plug-ins LSFComputingSystem Magda MagdaMgt AtCom core AMI AMIMgt EDGComputingSystem NGComputingSystem PyROOT DIAL JOE PBSComputingSystem Figure 1: The AtCom architecture Figure 1 above shows the top-level architecture of AtCom. At the heart lies the AtCom core application, which includes the GUI, implements the logic of defining, submitting and monitoring jobs. On the left of the diagram are the AMIMgt and MagdaMgt modules that allow AtCom to interface with the ATLAS bookkeeping databases, AMI[7] (Atlas Meta-data Interface) and Magda[8] (Manager for Gridbased Data) respectively. On the right of the diagram, there are several plug-ins that allow AtCom to Gaudi/ Athena Bookkeeping GANGA job definition, submission and monitoring etc… Figure 2: The MC production tool architecture 3. ATCOM FUNCTIONALITY Essentially, AtCom supports three classes of operations: job definition, job submission and job monitoring. These correspond to the three main panels of the GUI. 3.1 Job definition In ATLAS, collections of similar data are termed as datasets. Partitions are subsections of these datasets. both datasets and partitions to be selected and displayed. Given a set of retrieved partitions the user can select an arbitrary subset and select a target computing system for submission. The jobs are submitted and automatically transferred to the next panel for monitoring. In section 7 we will in detail present what happens when one submits a job. Figure 3: Job definition panel. From the job definition panel (figure 3), the user can select a dataset he wants to define partitions with, by means of a SQL query composer (not shown here but see figure 4 for a similar screenshot). He defines the fields of the dataset to be visible and specifies the selection criteria. With the search results, the user can then select a single dataset and choose a particular version of the associated transformation. AtCom displays a list of relevant parameter values of the partitions concerned to be entered (see figure 6). Parameters can be previewed before the partition is created. Figure 4: SQL composer (job submission) 3.3 Job monitoring Of the parameters available to the user, the partition’s LFN, the signature parameters, the output file mapping and the final destination for stdout and stderr are compulsory regardless of the type of transformation used. 3.2 Job submission The second AtCom panel allows the user to submit any defined partition to any configured computing system. Once again, the process begins with a SQL composer allowing you to retrieve a set of partitions (see figure 7). The composer here, compared to that provided at the job definition stage, is more sophisticated as it allows attributes of Figure 5: Job monitoring panel. The monitoring panel (figure 5) displays the job name, ID, status, computing system, host and AMI status of each job being monitored. Job name is the name given to the job by the plug-in upon submission. Usually, it is simply the LFN of the partition. Job ID is the computing system dependent token that identifies the job. Status is the status as reported by the computing system. Common status values include ‘running’, ‘wait’, ‘done’, ‘error’ and ‘failed’ but vary with the type of computing system. AMI status is the status of the partition as stored in the permanent production log database. The panel allows the user to check the status of all monitored jobs on demand or poll automatically at regular intervals. When the job completes, the corresponding extract script parses stdout and stderr in search for errors. If all is well, the output files are then registered in the replica catalog (Magda). Log files are copied to their final destination and the partition’s AMI status is finally set to ‘validated’. If the job fails, the output as defined in the partition’s output mapping is deleted and its status is set to ‘failed’. However, if the job is ‘undecided’, the user can arbitrate and update the status manually. The decision dialog (figure 9) displays the information needed for the user to perform status arbitration. Figure 9: Decision dialog. 4 CONCLUSIONS AND OUTLOOK AtCom is a tool designed for production managers. It is a convenient automation tool that interfaces with the ATLAS bookkeeping databases and various computing systems. However, AtCom, in its current form, is neither suitable nor robust enough to be used by the casual physicist with limited production experience. Moreover, it is not known how scalable AtCom will be when monitoring of thousands of jobs. AtCom inability to be full automated will make it unsuitable in such cases while GRAT[10]like systems will be a better choice. The next few months will be crucial for deciding the future of the production strategy for ATLAS. The uniform production framework will become reality. ATLAS datasets, partitions and transformations will be stored in a single logical database. The non-automated production mode, will gradually be phased out because of the high risk of human errors. Highly automated production tools will take care of most if not all the production at all sites. The production model will be extended to take into account productions on the scale of physics groups all the way down to the scale of a single physicist. Complementary tool suites (e.g. GANGA, DIAL) targeting just such audience groups will be integrated and deployed. The Monte Carlo production system tool will be designed to take all that into account and build upon the experiences of AtCom. References [1] http://atlas-project-atcom.web.cern.ch [2] http://ganga.web.cern.ch/ganga [3] http://atlantis.web.cern.ch/atlantis/ [4] http://ganga.web.cern.ch/ganga/user/ JOE/JOE-UserGuide.html [5] http://www.usatlas.bnl.gov/~dladams [6] http://www-unix.griphyn.org/chimera/ [7] http://larbookkeeping.in2p3.fr [8] http://www.atlasgrid.bnl.gov/magda/info [9] http://seal.web.cern.ch/seal/snapshot/ work-packages/scripting/index.html [10] http://heppc1.uta.edu/atlas/software/ user-tools.html [11] http://atlas-project-atcom.web.cern.ch/ atlas-project-atcom/atcom_chep03.pdf