EU DataGRID testbed management and support at CERN Speaker: Emanuele Leonardi (EDG Testbed Manager – WP6) Emanuele.Leonardi@cern.ch DataGrid is a project funded by the European Union CHEP 2003 – 24-28 March 2003 – n° 1 Talk Outline Introduction to EDG Services EDG Testbeds at CERN EDG Testbed Operation Activities Installation and configuration Service administration Resource management Conclusions Authors Emanuele Leonardi, Markus Schulz - CERN CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 2 EDG Services (1) Authentication Grid Security Infrastructure (GSI) based on PKI (openSSL) Proxy renewal service Authorization Global EDG user directory + per-VO user directory (LDAP) Very coarse grained Resource GLOBUS gatekeeper with EDG extensions Access Interface to standard batch systems (PBS, LSF) GSI-enabled FTP server Data replication and access Job sandbox transportation CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 3 EDG Services (2) Installation LCFG(ng) Storage File Replication service (GDMP) Management At CERN interfaced to CASTOR (MSS) Replica Catalog (LDAP) One RC per VO Information Services Hierarchical GRID Information Service (GIS) structure (LDAP) Central Metacomputing Directory Service (MDS) CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 4 EDG Services (3) Resource Resource Broker Management Interfaced to the GIS Jobmanager and Jobsubmission Based on CondorG Logging and Bookkeeping Monitoring To be deployed in EDG 2 Accounting Basic facility foreseen for EDG 2 CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 5 EDG Services (4) Services are interdependent Services are composite and heterogeneous Based on lower level services (e.g. CondorG) Many different DataBase flavors in use (MySQL, Postgres, …) Services Each physical node runs one or more services e.g. a Computing Element (CE) runs the GLOBUS gatekeeper, an FTP server, a batch submission system, … Services are mapped to logical machines impose constraints on the testbed configuration Shared filesystems are needed within the batch system to have a common /home area and to create an homogeneous security configuration Some services need special resources on the node (extra RAM) CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 6 EDG Testbeds at CERN Production (Application) Testbed: 40 Nodes, EDG v.1.4.7 Few updates (security fixes) but frequent restarts of services (daily) Data production tests by application groups (LHC experiments, etc.) Demonstrations and Tutorials (every few weeks) Number of nodes varied greatly (user requests, stress tests, availability) Development In the past used to integrate and test new releases many changes/day, very unstable, service restarts, traceability problems Now used to test small changes before installation on the Production TB Integration Testbed: 9 Nodes , EDG v.1.4.7 Testbed: 18 Nodes EDG porting to RH7.3+GLOBUS 2.2.4, then EDG 2.0 integration Many minor Testbeds (developers, unit testing, service integration) CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 7 EDG Testbeds at CERN: Infrastructure 5 NFS servers with 2.5 TByte mirrored disk User directories Shared /home area on the batch system Storage managed by EDG and visible from the batch system NIS server to manage users (not only CERN users) LCFG servers for installation and configuration Certification Authority To provide CERN users with X509 user certificates To provide CERN with host and service certificates Hierarchical system (Registration Authorities) mapped to experiments Linux RH 6.2 (now moving to RH 7.3) CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 8 The early days: before v.1.1.2 The Continuous Release Period No procedures CERN testbeds have seen all versions Trial and error (on the CERN testbeds first) Services very unreliable Many debugging sessions with developers Hard work by all project participants resulted in a version that was used for key demonstrations in March 2002 CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 9 Release Procedures New RPMs are delivered to the Integration Team Configuration changed in CVS Installed on Dev TB at CERN (highest rate of changes) Basic tests Core sites install version on their Dev TBs Distributed tests Software is deployed on Application TB Final (large scale) tests Applications start using the Application Testbed Other sites install, get certified by ITeam, and then join Over time this process has evolved into a quite strict release procedure Application Software is installed on UI/WN on demand and outside the release process CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 10 More History Known Problems: • GASS Cache Coherency • Race Conditions in Gatekeeper • Unstable MDS Intense Use by Applications! Limitations: • Resource Exhaustion • Size of Logical Collections CHEP 2003 – 24-28 March 2003 Version Date Successes • Matchmaking/Job Mgt. • Basic Data Mgt. Known Problems: • High Rate Submissions • Long FTP Transfers 1.1.2 27 Feb 2002 1.1.3 02 Apr 2002 1.1.4 04 Apr 2002 1.2.a1 11 Apr 2002 1.2.b1 31 May 2002 1.2.0 12 Aug 2002 1.2.1 04 Sep 2002 1.2.2 09 Sep 2002 1.2.3 25 Oct 2002 1.3.0 08 Nov 2002 1.3.1 19 Nov 2002 1.3.2 20 Nov 2002 1.3.3 21 Nov 2002 1.3.4 25 Nov 2002 1.4.0 06 Dec 2002 1.4.1 07 Jan 2003 1.4.2 09 Jan 2003 1.4.3 14 Jan 2003 1.4.4 18 Jan 2003 1.4.5 26 Feb 2003 1.4.6 4 Mar 2003 Security fix (sendmail) 1.4.7 8 Mar 2003 Security fix (file) ATLAS phase 1 start CMS stress test Nov.30 - Dec. 20 Successes • Improved MDS Stability • FTP Transfers OK Known Problems: • Interactions with RC CMS, ATLAS, LHCB, ALICE E.Leonardi - EU DataGRID testbed management and support at CERN – n° 11 Operations: Node Installation (1) Basic installation tool: LCFG (LocalConFiGuration System) by University of Edinburgh (http://www.lcfg.org) Extended by EDG WP4 (Fabric mgmt group) LCFG is most effective if: LCFG-objects are provided for all services Machine configurations are compatible with LCFG constraints Main e.g. only the 4 primary partitions are supported drawbacks of LCFG No verification of the installation/update process Wants to have total control on the machine (not suitable for installation of EDG on an already running system) Does not handle user accounts (password changes) Not suitable for rapid developer updates CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 12 Operations: Node Installation (1) Basic installation tool: LCFG (Local ConFiGuration System) by University of Edinburgh (http://www.lcfg.org) Extended by EDG WP4 (Fabric mgmt group) LCFG is most effective if: LCFG-objects are provided to configure all services Machine configurations are compatible with LCFG constraints the checking A lot of manual work No verification of the installation/update process Main e.g. only the 4 primary partitions are supported Home-made tools to do drawbacks of LCFG Wants to have total control on the machine (not suitable for installation of EDG on an already running system) Does not handle user accounts (password changes) Not suitable for rapid developer updates CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 13 Operations: Node Installation (1) Basic installation tool: LCFG (LocalConFiGuration System) by University of Edinburgh (http://www.lcfg.org) Extended by EDG WP4 (Fabric mgmt group) LCFG is most effective if: LCFG-objects are provided for all services Machine configurations are compatible with LCFG constraints system accounts Use NIS for standard users No verification of the installation/update process Main E.g. only the 4 primary partitions are supported Use LCFG to handle root and drawbacks of LCFG Wants to have total control on the machine (not suitable for installation of EDG on an already running system) Does not handle user accounts (password changes) Not suitable for rapid developer updates CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 14 Operations: Node Installation (2) PXE-based initiation of installation process a floppy was needed to start the installation process with PXE the whole process goes through the network Serial The reset button is connected to a relays system controlled from a server via a serial line (ref. Andras Horvath’s talk) Serial line controlled reset of nodes line console monitoring All serial lines are connected to a central server via a multi-port serial card (ref. Andras Horvath’s talk) Visits to the Computer Center drastically reduced CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 15 Grid Middleware We are still learning how to deploy production grid middleware services Many services are fragile (daily restarts) Very complex fault patterns (every release creates new) The site model misses some resource management aspects Several services does not scale far enough Memory leaks, file leaks, port leaks, i-node leaks, … Route from working binary to deployable RPM not always reliable What Max 512 jobs per RB, max ~1000 files in the RC Some components are resource hungry storage, scratch space, log files (if any) An autobuild system is now in place grid services should NOT assume: Service Disk node characteristics: CPU = SPECInt, RAM = storage characteristics: single disk partition of Network connectivity: CHEP 2003 – 24-28 March 2003 MB GB (transparently growable) bandwidth with 0 RTT to anywhere in the world, up 24/7 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 16 gass_cache A disk area (gass_cache) has to be shared between the gatekeeper node (CE) and all the worker nodes (scaling problem here?) Each If job creates a big number (>>100) of tiny files in this area the job ends in an unclean way, these files are not deleted No easy way to tell which file belongs to whom, no GLOBUS/EDG tool to handle this case Usage of i-nodes is huge: at least once per week the whole batch system has to be stopped and the gass_cache area cleaned (~2 hours given the number of i-nodes) Random fault pattern: the system stops working for apparently totally uncorrelated reasons, shared area appears empty All the jobs running at the time are lost CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 17 Storage management (1) Available: GSI-enabled ftp server (SE) replication-on-demand service (GDMP) basic Replica Catalogue (max O(1000) files managed coherently) user commands to copy and register files basic interface to tape storage (CASTOR, HPSS) Unavailable: an integrated approach to overall space management disk space management and monitoring tools CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 18 Storage Management (2) Constraint from the Replica Catalog: PFN = physical filename, LFN = logical filename PFN = <Local storage area>/LFN Consequences: the whole storage area must consist in a single big partition OR all SE’s must have exactly the same disk partition structure different partitioning of the storage area on different SE’s can make file replication impossible e.g. a LFN might be located on a partition with free space on one SE and on a full partition on another SE Needs A LOT OF PLANNING if only small partitions are available at CERN we use disk servers exporting 100GB partitions CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 19 Resource Broker The It Resource Broker is the central intelligence of the GRID interacts with most of the other services in a complex way Failure Running multiple instances of the RB at different sites has helped A lot of effort was put into fixing problems but it is still the most sensitive spot in the GRID The of the RB stops user job submission RB’s need to be stopped regularly and all databases cleaned The problem is related to a db corruption due to a non-thread-safe library. Fixes are already available and will be deployed with EDG 2. All jobs being managed by the RB at the time are lost CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 20 Conclusions EDG They testbeds have been in operation for almost two years provided continuous feed-back to the developers LHC experiments and other project partners were able to taste the flavor of a realistic GRID environment EDG has gone further than any other multi-science grid project in deploying a large-scale working production system. Lots of limitations and inadequacies were found of which many have been addressed Keeping a testbed site running is lot of work - testbed managers were over-stressed by the mismatch between some users expectations and the reality of the status of grids today Thanks to the EU and to our national funding agencies for their support of this work CHEP 2003 – 24-28 March 2003 E.Leonardi - EU DataGRID testbed management and support at CERN – n° 21