Grid Support and Operations John Gordon CCLRC GridPP9 - Edinburgh

advertisement
Grid Support and Operations
John Gordon
CCLRC
GridPP9 - Edinburgh
John Gordon
CCLRC eScience centre
What is support?
• Not well defined
• ..or rather defined differently in many
places
• End users, sysadmins, deployers,
developers
– all need support
• Some examples
Grid Support Centre
•
14 named staff at Rutherford, Daresbury, Manchester and Edinburgh.
•
Operates the UK e-Science Certification Authority.
– http://ca.grid-support.ac.uk
•
Provides a helpdesk for ‘first point of call’ queries.
•
Website for advertising services provided.
– http://www.grid-support.ac.uk
•
Provides technical training and evaluations of middleware.
•
Supports the Level-2 Grid project.
– National Information Server for Core programme.
– Publishing of site monitoring information in xml.
•
Core support for the OGSA-DAI project.
European Grid Support Centre
• Collaboration between CCLRC, CERN and KTH
Sweden each providing 1 FTE
• Point of trusted reliability between major projects
and middleware producers.
• Directly communicates with staff from Globus
Alliance to ensure European issues faced
having assisted with release.
• Website up and running though currently a
skeleton of the final content.
• Attended EDG meeting in Barcelona to publicise
and GGF-8 to guide User Services R.G. work.
Global Grid User Support – GGUS
The Model
• Started 1st of october
at GridKa
Forschungszentrum
Karlsruhe (Germany)
• Supports already 41
usergroups of GridKa
• Website
http://www.ggus.org
• E-Mail
support@ggus.org
Information flow
Grid User
Service Request
GGUS
Interaction
Data flow
Interaction
GOC
ESUS
First line of support:
Problems (experiment
specific) will be solved by
ESUS (with Savannah) or
sent to GGUS using an
agreed interface;
Grid related problems
will be solved by GGUS
or sent to GOC using
the GGUS system;
GGUS: Global Grid User Support
ESUS: Experiment Specific User Support
GOC: Grid Operations Centre
Interaction
Local operations
GridPP TB-Support
1.
2.
3.
Support Team
•
built from sysadmins. 4 funded by
GridPP to work on EDG WP6, the rest
are the usual site sysadmins.
Methods
•
Email list, phone meetings, personal
visits, job submission monitoring
•
RB, VO, RC for UK use to support nonEDG use
•
Planned to verify EDG releases but they
have been too infrequent to test
procedures
Rollout
•
Experience from RAL in EDG dev
testbeds and IC and Bristol in CMS
testbeds
•
>10 sites have been part of EDG app
testbed at one time
•
3 in LCG1
Savannah
EGEE Operations
• Resource Centres – all sites
• Regional Operations Centres (ROC)
– At least one per region!
– RAL in UK/Ireland
• Core Infrastructure Centres (CIC)
– CERN, RAL, CNAF, CC-IN2P3
Others
• Tier1`Support
– Role to support UK Tier2s in LCG
– Deployment role in GridPP2
• Tier2 Specialist Posts
– Support for varous middleware areas
• Middleware Developers
Where do you go for support?
• Users go to experiment support
• Experiment support diagnoses and forwards as necessary to Grid user
support or middleware or operations or applications
• Resource Centres look to their Regional Operations Centre (Tier2s to
their Tier1)
• ROCs will also push problems to their RCs.
• But we know that users will go to their local sysadmin or direct to their
Tier2 or Tier1 too.
– And some sysadmins will go to their favourite experiment expert
– And Tier1s will go direct to middleware experts.
• In short, chaos.
• Strategy for now is to have a UK Plan that is self-contained and can
deliver support in the UK when and where required.
– Interface this to the various outside bodies
– Don’t duplicate for the sake of it, but be ready to.
• Or be prepared to role our work into wider provision when it is
proven.
Grid Operations Centre
John Gordon
CCLRC eScience centre
What is Operations?
• RAL leading development of LCG GOC
The Vision
• GOC Processes and Activities
– Coordinating Grid Operations
– Defining Service Level Parameters
– Monitoring Service Performance Levels
– First-Level Fault Analysis
– Interacting with Local Support Groups
– Coordinating Security Activities
– Operations Development
• Recent developments :-
GOC - Monitoring
• Who is Involved?
3.0 FTE (Trevor Daniels, Dave Kant, Matt Thorpe, Jason Leake)
• What are we Doing?
Monitor Grid Services, Manage Site Information, Accounting
• Developed Tools to Configure/Integrate Monitoring to make the job
easier
GPPMon
Nagios
Both tedious to configure
Mapcentre
Not practical “by hand” with large numbers of nodes
• Example: Mapcentre 30 sites ~ 500 lines in config file
• Example: Nagios
30 sites, 12 individual config files with
dependencies
GOC - Database
• Develop/maintain a database to hold site information
• Site Information (contact lists, resources, site
information, URLs)
• Secure access through GridSite (X509 certificates) via
PHP web interface
• RC managers should maintain their own pages as part of
the site certification process.
• Monitoring scripts read information in database and run
a set of customised tools to monitor the infrastructure.
• To be included in the monitoring a site must register its
resources (CE,SE,RB,RC,RLS,MDS,RGMA,BDII,..)
• BDII can be queried to check GOC database is up-todate.
GOC Monitoring Today
Remote UI Queries Database to build a list of resources
Submit monitoring jobs to those resources
EDG
RESOURCES
Publish Results on WWW
GOC DB
EDG UI
GridSite
MySQL
LCG-1
RESOURCES
LCG1 UI
LCG-2
RESOURCES
LCG-2
UI
New GPPMon Features
•
Download Host Certificates daily and monitor Life Times for CEs and SEs
for LCG and EDG
New GPPMon Features
• Reliability of service provided using
RRDTool to show Globus and RB stats
New GPPMon Features
•
Moving toward LCG-1, LCG-2 and EDG monitoring
Tuesday 3/2/04 14:10
Only RAL and FZK have updated
their LCG-2 information in the
gridkap01.fzk.de
GOC database.
Nagios
•
Customised plugins for monitoring
•
Focus service behaviour and data consistency
Do RBs find resources
Do site GIIS’s publish correct hostname?
Is the site running the latest stable software release?
Does the Gatekeeper authenticate?
Are the host certificates valid?
Are essential services running?
Nagios Screen Shots LCG-1
Nagios Screen Shots LCG-1
Service Summary for Gatekeeper Nodes
Nagios Screen Shots LCG-1
Host and Service Summary tables for BDII nodes
GOC Configuration
• Example: Manage a Grid-Wide Database
- provides access to site information via trusted certificate
- scripts to automatically configure Nagios from the GOC database
- provide plugins to monitor services for nagios
- create configurations file for mapcentre
GOC
Secure Database Management via HTTPS / X.509
GOC
GridSite
MySQL
Resource Centre
Resources & Site Information
EDG, LCG-1, LCG-2, …
ce
bdii
se
rb
RC
Monitoring
GOC Server
http://goc.grid-support.ac.uk
What’s in the Database?
People: Who do we notify when there are problems
What’s in the Database?
Node Information (Hostname, IP Address, Group)
What’s in the Database?
Scheduled Downtimes:
Advanced warning of site maintenance resulting in reduced service availability
LCG Accounting Overview
1. PBS log processed daily on site CE to extract required data, filter acts
as R-GMA DBProducer -> PbsRecords table
2. Gatekeeper log processed daily on site CE to extract required data, filter
acts as R-GMA DBProducer -> GkRecords table
3. Site GIIS interrogated daily on site CE to obtain SpecInt and SpecFloat
values for CE, acts as DBProducer -> SpecRecords table, one dated
record per day
4. These three tables joined daily on MON to produce LcgRecords table.
As each record is produced program acts as StreamProducer to send
the entries to the LcgRecords table on the GOC site.
5. Site now has table containing its own accounting data; GOC has
aggregated table over whole of LCG.
6. Interactive and regular reports produced by site or at GOC site as
required.
Note: This is an improved design over that presented at the Jan GDB. The
SOAP transport has been replaced by R-GMA.
LCG Accounting Flow
site GIIS
LCG Site
LCG Site
LCG Site
CE
MON
GOC Site
MON
Reports
filter
filter
Accounting DB
filter
GOC
PBS log
gk log
Progress
• Status on 3 Feb 2004:
– The code which will run on the CE to parse and process the PBS
and Gatekeeper logs is written. The PbsRecords and
GkRecords tables are created and are being populated.
– The code to join these two tables and publish the new joined
table (LcgRecords) is also written and working.
– Work is in progress to write the archiver at the GOC to receive
the aggregated LcgRecords table – 2 days work.
• To do:
– Write the code to interrogate the site GIIS to extract the CPU
power values and populate these fields in the tables – 2 days
work
– Integration testing and debugging – 5 days
– Packaging for deployment – 3 days
– Write the report generators – 30 days (estimate – not yet
designed)
Accounting Issues
1. There is no R-GMA infrastructure LCG-wide, so most sites are not able to
install and run the accounting suite at present. It is expected that R-GMA
and the MON boxes will be rolled out in LCG2 soon after the storage
problems are resolved. Until this happens the complete batch and
gatekeeper logs will have to be copied to the GOC site for processing.
2. The VO associated with a user’s DN is not available in the batch or
gatekeeper logs. It will be assumed that the group ID used to execute user
jobs, which is available, is the same as the VO name. This needs to be
acknowledged as an LCG requirement.
3. The global jobID assigned by the Resource Broker is not available in the
batch or gatekeeper logs. This global jobID cannot therefore appear in the
accounting reports. The RB Events Database contains this, but that is not
accessible nor is it designed to be easily processed.
4. At present the logs provide no means of distinguishing sub-clusters of a CE
which have nodes of differing processing power. Changes to the
information logged by the batch system will be required before such
heterogeneous sites can be accounted properly. At present it is believed all
sites are homogeneous.
Future Direction Towards EGEE
Distribute Tools to help the ROCs monitor their RCs
(Database + Monitoring Packages)
Distribute Tools to help CIC’s monitor Core Services – Grid Wide
Monitoring
Ideas on how this would work:
CIC monitoring tools query ROC databases
Select core services
Run a standard set of checks on those services
Display information / Notifications …
UK Deployment, Support and
Operations
John Gordon
CCLRC eScience centre
Deployment Team
Proposal for a UK wide Team to provide and run a UK wide Grid
The GridPP View. There are alternative views for other stakeholders
Core UK
JISC
GridPP
EGEE
Production Manager
Grid Support Centre
5 FTE
Core Grid Coordinator
1 FTE
MiddleWare Specialist Support Deployment Team
6 FTE
8 FTE
Grid Operations Centre
Security Officer
Helpdesk
RAL
Data and Storage Management
Network Monitoring
Glasgow, Bristol, Edinburgh
VO Management and Services
North (0.5 FTE)
Workload Management Services
London
Network Management
London (0.5 FTE)
2 Tier1 Deployment
4 Tier 2 UK Coordinators
LondonGrid,NorthGrid
ScotGrid, SouthGrid
1 Tier2 Coordinator
Ireland
Applications Expert
Network Support
Manager
Operations (2)
Technical Writer
Resource Centres
• Tier1 : Rutherford Appleton Laboratory
• Tier-2 centres are distributed over many sites.
London Grid
North Grid
IC,QMUL,RHUL,UCL, Brunel
Scot Grid
South Grid
Durham, Edinburgh, Glasgow
Daresbury, Lancaster,
Liverpool,Manchester,
Sheffield
Birmingham, Bristol,
Cambridge, Oxford, RAL-PPD
• Sites which have signed up to LCG and deployed software
(RAL,IC,Cambridge) expect to join EGEE (PM1)
Tier-2 Centre Resources (Projected 2004)
Projected resources available in September 2004 to be applied to large-scale production Grid deployment. The total CPU at each
institute is proportional to the size of the green circles. The disk storage at each site is proportional to the height of the grey vertical
bars
Tier 2
Number
Of
CPUs
TOTAL
CPU
[KSI2000]
Total
Disk
[TB]
Total
Tape
[TB]
London
2454
1996
99
20
North
Grid
2718
2801
209
332
South
Grid
918
930
67
8
Scot
Grid
368
318
79
0
Total
6458
6045
455
360
Roles (1)
• Production Manager
Overall Manager to oversee operations and report to
other groups (ROC Coordinator, OMC …)
• Core Grid Coordinator
Bring UK non-Particle Physics projects (applications and
resources) into EGEE
Roles (2)
• Deployment Team
Consists of about 7 people to spearhead the rollout and
certification of Grid software to the Resource Centres
(Tier1 & Tier2)
• Grid Operations Centre
Similar role to the proposed CIC in EGEE.
Monitor health of services and provide toolkits
Operate Core Grid Services
Database of RC’s managed by RC site administrators
Roles (3)
• Middleware Specialist Support
Body of experts to provide specialist support to
Resource Centres in key areas: security, data
management, network, VO management and workflow
management.
• Grid Support Centre
Helpdesk facility, CA
Broker requests to middleware specialists
Team UK
• A large team in the UK (GridPP, EU, and other)
• GridPP Production Manager should orchestrate
this team to deliver a production grid for GridPP
– But interwork with as many other UK grids
and projects as possible
• Meet our EGEE ROC and CIC deliverables for
support and operations
• A big challenge
Download