ATLAS Monte Carlo Production System Chun L. Tan University of Birmingham, UK

advertisement
ATLAS Monte Carlo Production System
Chun L. Tan
University of Birmingham, UK
The ATLAS Monte Carlo production system tool attempts to provide the end-user
physicist the ability to perform on-demand Monte Carlo production for his/her specific
needs. The aim is to shield the physicist from the underlying complexity and to present
him/her with a consistent, flexible and intuitive interface. The tool will harness the
power and capacity of grid computing allowing the physicist to tap on a worldwide
computing resource pool never before available to an individual physicist.
The first prototype, AtCom[1] (short for ATLAS Commander), has been developed in
partnership with CERN in October 2002. An invaluable production manager’s tool,
AtCom automates tedious and often repetitious tasks like defining and submitting large
numbers of jobs, monitoring of job execution status, scanning of log files for errors,
updating ATLAS bookkeeping databases, cleaning up and finally resubmitting if the
job failed.
Future versions of the tool will be more end-user oriented. It will also be based on the
final grid interface GANGA [2]. Apart from providing access to Grid and legacy
computing resources, GANGA, with its Python bus architecture, will potentially allow
the Monte Carlo production system tool to interface with other complementary software
e.g. Atlantis[3] (visualisation package), JOE[4] (Athena job options editor) and
DIAL[5] as they become available.
1. INTRODUCTION
The production of simulated or Monte
Carlo events is a crucial component of every
particle physics experiment. In ATLAS, this
tedious, error-prone and often highly
repetitive task is undertaken by a handful of
expert production managers. Typically, the
normal physicist needs to request for Monte
Carlo events in advance and is unable to
implement ideas and experiment with minor
variations in job parameters within an
acceptable round time. This unsatisfactory
round time delay and the overall associated
complexity of Monte Carlo production
prevents the physicist from focusing on the
physics and testing his ideas in an efficient
and straightforward manner.
The first prototype of the ATLAS Monte
Carlo production system, AtCom (short for
ATLAS Commander) has been developed in
partnership with CERN and has been in
active use in the ongoing ATLAS data
challenges. The basic features include job
creation, submission, monitoring, validation
and updating of bookkeeping databases.
However, its design was geared towards the
expert user (i.e. the production manager) and
is not suitable for the casual physicist. This
was due to the pressing need for an
automated tool for use in the ATLAS data
challenges in the autumn of 2002.
AtCom assists the production manager by
automating tedious and often repetitious
tasks like defining and submitting of large
numbers of jobs, monitoring of job execution
status, scanning of log files for errors,
updating ATLAS bookkeeping databases,
cleaning up and finally resubmitting if the
job failed. AtCom, with its modular design,
has 3 distinct components: the generic basic
job management component, the database
interface component and the computing
systems interface component. The computing
systems
interface
component
was
implemented with dynamically loaded plugins for batch systems and GRIDs. Today,
many flavours of batch systems are in use at
different ATLAS institutes around the world
and the GRID is poised to simplify the
situation in the future. We are, however,
currently in a transitional stage where
different flavours of GRIDs and numerous
legacy batch systems are used. Increasingly,
GRID and non-GRID computing systems are
being deployed together at a single site.
AtCom allows jobs to be defined in a
generic way i.e. non system-specific. The
current implementation is broadly based on
the virtual data[6] approach where a job
definition is linked by reference to a
transformation definition which consists of a
script/executable, its execution environment
in the form of '
used'packages and a signature
enumerating the formal parameters and their
types.
AtCom is implemented in Java and should,
in principle, run on any Java-enabled
platform. It has been routinely used for
production on both Linux and Windows
platforms.
interface with different computing systems.
These plug-ins form an interface that defines
methods and signatures for common
operations: submitting a job, getting job
status, killing a job and getting the current
output (stdout and stderr) of a job.
2.2 MC production tool
The MC production tool differs from its
prototype in 2 areas. Job submission and
monitoring will no longer be performed
natively but will instead utilise GANGA’s
job submission and monitoring facilities.
Bookkeeping functions will also be provided
through GANGA. By interfacing with
GANGA, various other existing tools can be
harnessed e.g. Athena/Gaudi job obtions
editor (JOE), PyROOT[9], Atlantis, etc.
Figure 2 shows the services GANGA will
provide the MC production tool.
ATLAS Monte Carlo
production tool
2. ARCHITECTURE
2.1 AtCom
Bookkeeping DBs
Plug-ins
LSFComputingSystem
Magda
MagdaMgt
AtCom
core
AMI
AMIMgt
EDGComputingSystem
NGComputingSystem
PyROOT
DIAL
JOE
PBSComputingSystem
Figure 1: The AtCom architecture
Figure 1 above shows the top-level
architecture of AtCom. At the heart lies the
AtCom core application, which includes the
GUI, implements the logic of defining,
submitting and monitoring jobs.
On the left of the diagram are the AMIMgt
and MagdaMgt modules that allow AtCom to
interface with the ATLAS bookkeeping
databases,
AMI[7]
(Atlas
Meta-data
Interface) and Magda[8] (Manager for Gridbased Data) respectively.
On the right of the diagram, there are
several plug-ins that allow AtCom to
Gaudi/
Athena
Bookkeeping
GANGA
job
definition,
submission
and
monitoring
etc…
Figure 2: The MC production tool architecture
3. ATCOM FUNCTIONALITY
Essentially, AtCom supports three classes
of operations: job definition, job submission
and job monitoring. These correspond to the
three main panels of the GUI.
3.1 Job definition
In ATLAS, collections of similar data are
termed as datasets. Partitions are subsections
of these datasets.
both datasets and partitions to be selected and
displayed.
Given a set of retrieved partitions the user
can select an arbitrary subset and select a
target computing system for submission. The
jobs are submitted and automatically
transferred to the next panel for monitoring.
In section 7 we will in detail present what
happens when one submits a job.
Figure 3: Job definition panel.
From the job definition panel (figure 3),
the user can select a dataset he wants to
define partitions with, by means of a SQL
query composer (not shown here but see
figure 4 for a similar screenshot). He defines
the fields of the dataset to be visible and
specifies the selection criteria. With the
search results, the user can then select a
single dataset and choose a particular version
of the associated transformation. AtCom
displays a list of relevant parameter values of
the partitions concerned to be entered (see
figure 6). Parameters can be previewed
before the partition is created.
Figure 4: SQL composer (job submission)
3.3 Job monitoring
Of the parameters available to the user, the
partition’s LFN, the signature parameters, the
output file mapping and the final destination
for stdout and stderr are compulsory
regardless of the type of transformation used.
3.2 Job submission
The second AtCom panel allows the user to
submit any defined partition to any
configured computing system. Once again,
the process begins with a SQL composer
allowing you to retrieve a set of partitions
(see figure 7). The composer here, compared
to that provided at the job definition stage, is
more sophisticated as it allows attributes of
Figure 5: Job monitoring panel.
The monitoring panel (figure 5) displays
the job name, ID, status, computing system,
host and AMI status of each job being
monitored.
Job name is the name given to the job by
the plug-in upon submission. Usually, it is
simply the LFN of the partition. Job ID is the
computing system dependent token that
identifies the job. Status is the status as
reported by the computing system. Common
status values include ‘running’, ‘wait’,
‘done’, ‘error’ and ‘failed’ but vary with the
type of computing system. AMI status is the
status of the partition as stored in the
permanent production log database. The
panel allows the user to check the status of
all monitored jobs on demand or poll
automatically at regular intervals.
When the job completes, the corresponding
extract script parses stdout and stderr in
search for errors. If all is well, the output
files are then registered in the replica catalog
(Magda). Log files are copied to their final
destination and the partition’s AMI status is
finally set to ‘validated’.
If the job fails, the output as defined in the
partition’s output mapping is deleted and its
status is set to ‘failed’. However, if the job is
‘undecided’, the user can arbitrate and update
the status manually. The decision dialog
(figure 9) displays the information needed for
the user to perform status arbitration.
Figure 9: Decision dialog.
4
CONCLUSIONS AND OUTLOOK
AtCom is a tool designed for production
managers. It is a convenient automation tool
that interfaces with the ATLAS bookkeeping
databases and various computing systems.
However, AtCom, in its current form, is
neither suitable nor robust enough to be used
by the casual physicist with limited
production experience. Moreover, it is not
known how scalable AtCom will be when
monitoring of thousands of jobs. AtCom
inability to be full automated will make it
unsuitable in such cases while GRAT[10]like systems will be a better choice.
The next few months will be crucial for
deciding the future of the production strategy
for ATLAS. The uniform production
framework will become reality. ATLAS
datasets, partitions and transformations will
be stored in a single logical database. The
non-automated production mode, will
gradually be phased out because of the high
risk of human errors. Highly automated
production tools will take care of most if not
all the production at all sites. The production
model will be extended to take into account
productions on the scale of physics groups all
the way down to the scale of a single
physicist. Complementary tool suites (e.g.
GANGA, DIAL) targeting just such audience
groups will be integrated and deployed. The
Monte Carlo production system tool will be
designed to take all that into account and
build upon the experiences of AtCom.
References
[1] http://atlas-project-atcom.web.cern.ch
[2] http://ganga.web.cern.ch/ganga
[3] http://atlantis.web.cern.ch/atlantis/
[4] http://ganga.web.cern.ch/ganga/user/
JOE/JOE-UserGuide.html
[5] http://www.usatlas.bnl.gov/~dladams
[6] http://www-unix.griphyn.org/chimera/
[7] http://larbookkeeping.in2p3.fr
[8] http://www.atlasgrid.bnl.gov/magda/info
[9] http://seal.web.cern.ch/seal/snapshot/
work-packages/scripting/index.html
[10] http://heppc1.uta.edu/atlas/software/
user-tools.html
[11] http://atlas-project-atcom.web.cern.ch/
atlas-project-atcom/atcom_chep03.pdf
Download