N I V E R S T H Y IT E U R G H O F E D I U N B QCDGrid: A grid resource for Quantum Chromodynamics James Perry, Andrew Jackson, Lorna Smith, Stephen Booth EPCC, The University of Edinburgh, James Clerk Maxwell Building, King’s Buildings, Edinburgh, EH9 3JZ, UK August 15, 2003 Abstract Quantum Chromodynamics (QCD) is an application area that requires access to large supercomputing resources and generates huge amounts of raw data. UKQCD currently stores and requires access to around five terabytes of data, a figure that is expected to grow dramatically as the collaborations purpose built HPC system, QCDOC, comes on line in 2004. This data is stored on QCDGrid, a data grid currently composed of six storage elements at four separate UK sites: Edinburgh, Liverpool, Swansea and RAL. 1 Introduction Fundamental physics research has always relied upon the latest in state-of-the-art computer hardware, even to the point of demanding purpose-built supercomputing resources. Now, modern lattice quantum chromodynamics (QCD) research demands not only the best hardware available, but the best Grid software as well. The Terabytes of raw physical data created in this field and the complex metadata used to describe it together form a significant challenge to current Grid design and implementation. Quantum Chromodynamics (QCD) is an application area that requires access to large supercomputing resources and generates huge amounts of raw data. UKQCD is a group of geographically dispersed theoretical QCD scientists in the UK that currently stores and requires access to around five terabytes of data, a figure that is expected to grow dramatically as the collaboration’s purpose built HPC system, QCDOC, comes on line in 2004. The aim of the QCDGrid project is to satisfy this demand, providing a multi-Terabyte stor- age system over at least four UK sites based on commodity hardware and open-source software. QCDGrid is part of the GridPP project, a collaboration of Particle Physicists and Computing Scientists from the UK and CERN, who are building a Grid for Particle Physics. 2 The Data Grid QCD’s data is stored on QCDGrid, a data grid currently composed of six storage elements at four separate UK sites: Edinburgh, Liverpool, Swansea and RAL. The aim of the data grid is to distribute the data across the sites: • Robustly Each file must be replicated at at least two sites; • Efficiently Where possible, files should be stored close to where they are needed most often; • Transparently End users should not need to be concerned with how the data grid is implemented. 2.1 Hardware and Software The hardware consists of a set of RedHat Linux PCs using large RAID arrays of harddiscs. This provides a relatively cheap option with built in redundancy. across the sites that form the QCDGrid so that even if an entire site is lost, all the data can still be recovered. This system has a central control thread running on one of the storage elements which constantly scans the grid, making sure all the files The QCDGrid software builds on the Globus are stored in at least two suitable locations. toolkit. This toolkit is used for basic grid opera- Hence when a new file is added to any stortions such as data transfer, security and remote age node, it is rapidly replicated across the grid job execution. It also uses the Globus replica onto two or more geographically separate sites. catalogue to maintain a directory of the whole grid, listing where each file is currently stored. Custom written QCDGrid software is built on Globus to implement various QCDGrid client tools and the control thread (see later). The Eu- 2.3 Fault Tolerance ropean Data Grid (EDG) software is used for virtual organisation management and security. Figure 1 shows the basic structure of the data The control threads also scans the grid to engrid, and how the different software packages sure that all the storage elements are working. When a storage element is lost from the sysinteract. tem unexpectedly, the data grid software e-mail Browser Applet the system administrator and begins to replicate the files that were held there on to the other Data+Metadata Submission Applet storage nodes automatically. Nodes can be temShell Command Interface porarily disabled if they have to be shut down or rebooted, to prevents the grid moving data User Tools Service around unnecessarily. QCDgrid Service XML Database Server eXist European DataGrid QCDgrid’s XML Schema Globus 2 Figure 1: Schematic representation of QCDGrid, showing how the different software packages interact. A secondary node is constantly monitoring the central node, backing up the replica catalogue and configuration files. The grid can also still be accessed (albeit read-only) if the central node goes down. 2.4 File Access The software has been designed to allow users to access files easily and efficiently. For example, it generally takes longer to transfer a file 2.2 Data Replication from Swansea to Edinburgh than it would to transfer it from another machine at Edinburgh. The data is of great value, not only in terms of Therefore, when a user requests a file, the softits intrinsic scientific worth, but also in terms ware will automatically return a copy of the of the cost of the CPU cycles required to create replica of that file which is nearest to the user. or replace it. Therefore, data security and re- Additionally, a user can register interest in havcovery are of utmost importance. To this end, ing a particular file stored on a particular storeach site stores the data in such a way as to en- age element, such as the one located physically sure that all the data can be recovered if any one closest to them. The grid software will then take of the harddiscs at any site fails by using RAID this request into account when deciding where technology. Furthermore, the data is replicated to store the file. 2.5 Use Cases: Adding and Retrieving a File Get Location Query Client Machine Result Replica Catalogue GridFTP A file may be added to the data grid using the Request put command. When a user issues the put Data command, the software chooses a suitable storStorage Element age element and copies the file to its ’new’ directory (see Figure 2). On its next scan, the control file finds the new file and moves it to Figure 3: Schematic representation of retrieving its actual home, registering it with the replica a file from the data grid. catalogue. Finally, on its next scan the control threads finds that there is only one copy of the file and makes another one at a suitable site, 3 The MetaData Catalogue registering it with the replica catalogue. Put Registration a Client Machine Replica Catalogue Data NEW/ Transfer Storage Element Replica Catalogue Update b Control Thread Move NEW/ Data Storage Element Control Thread c Update Storage Storage Data Element Element Replicate Replica Catalogue Figure 2: Schematic representation of adding a file to the data grid. When a user issues the get command on a client machine, the software looks up the replica catalogue to find the nearest copy of the file (see Figure 3). The file is then transferred from that copy. If the file transfer fails, the software looks up the replica catalogue again to find the next nearest copy, and tries to transfer a copy of that instead. In addition to storing the raw physical data, the project aims to provide an efficient and simple mechanism for accessing and retrieving this data. This is achieved by generating metadata, structured data which describes the characteristics of the raw data. The metadata is in the form of XML documents and is stored in an XML Database server (XDS). The XML database used is eXist and open source database that can be searched using the XPath query language. The XML files are submitted to the data grid, to ensure that there is a backup copy of the metadata. Hence the metadata catalogue can be reconstructed from the data grid if necessary. UKQCD’s metadata contains information about how each configuration was generated and from which physical parameters. The collaboration has developed an XML schema, which defines the structure and content of this metadata in an extensible and scientifically meaningful manner. The schema can be applied to various different data types, and is is likely to form the basis of the standard schema for describing QCD metadata, being developed by the International Lattice DataGrid (ILDG) collaboration. ILDG is a collaboration of scientists involved in lattice QCD from all over the world (UK, Japan, USA, France, Germany, Australia and other countries), who are working on standards to allow national data grids to inter operate, for easier data sharing. Data submitted to the grid must be accompanied by a valid metadata file. This can be enforced by checking it against the schema. A submission tool (graphical or command line) takes care of sending the data and metadata to the right places (see Figure 4). Client Data & Submission Meta data Tool Metadata Metadata Data Data Grid Command Line Tools Browser GUI OGSA DAI Grid Service Metadata QCDgrid Data Management Software Exist XML Database XML Database Globus 2.0 Figure 4: Schematic representation of data being added to the data grid and metadata catalogue. 4 MetaData and Data Grid Browser Storage Elements Figure 5: The QCDGrid browser. resources. The European Data Grid software provides virtual organisation management and The system also consists of a set of graphical security. QCDGrid job submission software is and command-line tools by which researchers being build on these components, providing may store, query and retrieve the data held on the interface and features for QCD users. the grid. The browser was originally developed by OGSA-DAI and has been extended to The aim is to provide a job submission system suite QCDGrid requirements. It is written in which: Java and provides a user friendly interface to • is integrated with the existing data grid; the XML database. The browser is also integrated with the lower level data grid software • can run across a diverse range of machines through the Java Native Interface and data can from normal (Linux) PCs to supercomputbe fetched from the grid easily through the GUI. ers such as QCDOC; A simple interface for data/metadata submission and grid administration is currently under • can provide real-time job status monitordevelopment. Figure 5 shows a schematic of the ing. relationship between the browser and the data grid and metadata catalogue. Resource broking is not essential, as QCD users usually know in advance on which machine a job should run. However to make the software as generic and usable as possible, resource bro5 Job Submission kering is desirable. A user-friendly GUI or web portal is also desirable, if time permits. Current work on the project is focussed on job submission: allowing data generation and anal- Currently, jobs can be submitted to grid reysis jobs to be submitted to grid machines. The sources using a command line tool (on a test aim is to allow QCD scientists to submit jobs to grid system). Input files can be fetched aua range of computational resources across the tomatically from the data grid and job output country, with data being added and retrieved and input can be streamed to and from the users console, allowing for job to be monitored, from the data grid in a seamless manner. and even for interactive jobs to run on grid reAs with the data grid software, the Globus sources (which may be useful for debugging). toolkit is being used for low level access to grid Finally, all output files generated by the job are automatically returned to the users’ local machine. 6 Conclusions QCDGrid is a data grid currently composed of six storage elements at four separate UK sites: Edinburgh, Liverpool, Swansea and RAL. This distributes data across the sites, in a robust, efficient and transparent manner. Current work is focussed on developing a job submission system that allows QCD scientists to submit jobs to a range of computational resources across the country, with data being added and retrieved from the data grid in a seamless manner. 7 Further Information For further details on QCDGrid see: http://www.epcc.ed.ac.uk/computing/ research_activities/grid/qcdgrid/ and: http://www.gridpp.ac.uk