Hardware Reliability at RAL Tier-1

advertisement
Hardware Reliability at the RAL Tier1
Gareth Smith
16th September 2011
Staffing
The following staff have left since GridPP26:
•
•
•
•
James Thorne (Fabric - Disk servers)
Matt Hodges (Grid Services Team leader)
Derek Ross (Grid Services Team)
Richard Hellier (Grid Services & Castor Teams)
We thank them for their work whilst with the Tier1 team.
08 April 2015
Tier-1 Status
Some Changes
• CVMFS in use for Atlas & LHCb:
– The Atlas (NFS) software server used to give significant
problems.
– Some CVMFS teething issues but overall much better!
• Virtualisation:
– Starting to bear fruit. Uses Hyper-V.
• Numerous test systems
• Production systems that do not require particular
resilience.
• Quattor:
– Large gains already made.
08 April 2015
Tier-1 Status
CernVM-FS Service Architecture
Atlas install box in PH
LHCb install box in PH
Stratum 0 web in PH
cvmfs-public@cern
Random site
Squid
cvmfs-ral@cern
cvmfs-bnl@cern
•Replication to Stratum 1 by hourly cron
(for now)
BatchBatch
Node
Batch
NodeNode
6th May 20114
•Stratum 0 moving to IT by end of year
•BNL almost in production
CernVM-FS Service At RAL
iSCSI Storage
webfs.gridpp.rl.ac.uk
WEB Server - replicated Stratum 0 at
CERN
cvmfs.gridpp.rl.ac.uk
lcgsquid05
lcgsquid06
The Replica at RAL - presented as a
virtual host in 2 reverse proxy
squids accelerating webfs in the
background
squid cache
Random site
Squid
Squid(s)
Batch Node
Batch Node
Batch Node
6th May 20115
squid cache
RAL
batch
BatchBatch
Node
Batch
NodeNode
Database Infrastructure
We making Significant Changes to the Oracle Database
Infrastructure.
Why?
• Old servers are out of maintenance
• Move from 32bit to 64bit databases
• Performance improvements
• Standby systems
• Simplified architecture
Database Disk Arrays - Now
Oracle RAC
Nodes
Fibrechannel
SAN
Power
Supplies
(on UPS)
08 April 2015
Disk Arrays
Tier-1 Status
Database Disk Arrays - Future
Oracle RAC
Nodes
Fibrechannel
SAN
Data Guard
Power
Supplies
(on UPS)
08 April 2015
Disk Arrays
Tier-1 Status
Castor
Changes since last GridPP Meeting:
• Castor upgrade to 2.1.10 (March)
• Castor version 2.1.10-1 (July) needed for the higher capacity
"T10KC" tapes.
• Updated Garbage Collection Algorithm (to “LRU” rather than
the default which is based on size). (July)
• (Moved ‘logrotate’ to 1pm rather than 4am.)
08 April 2015
Tier-1 Status
Castor Issues.
• Load related issues on small/full service classes
(e.g. AtlasScratchDisk; LHCbRawRDst)
– Load can become concentrated on one or two disk
servers.
– Exacerbated if uneven distribution if disk server sizes.
• Solutions:
– Add more capacity; clean-up.
– Changes to tape migration policies.
– Re-organization of service classes.
08 April 2015
Tier-1 Status
Procurement
All existing bulk capacity orders in production or being
deployed.
Problems ‘SL08’ generation overcome.
Tenders under way for disk and tape.
Disk:
• Anticipate 2.66 PB usable space.
– Vendor 24 day proving test using our tests. Then
– Re-install and 7 days acceptance tests by us.
CPU:
• Anticipate 12k HEPSpec06
– 14 day proving tests by vendor. Then
– 14 day acceptance tests by us.
Evaluation based on 5 year Total Cost of Ownership.
08 April 2015
Tier-1 Status
Disk Server Outages by Cause (2011)
memory
disk controller
multi-disk fails
OS
Castor
other software
other
config (puppet)
unknown
08 April 2015
Tier-1 Status
Disk Server Outages by Service Class (2011)
1.20
1.00
0.80
0.60
0.40
0.20
0.00
08 April 2015
Tier-1 Status
Disk Drive Failure – Year 2011
50
45
40
35
30
25
20
15
10
5
0
January
February
March
April
May
June
July
August
Double Disk Failures (2011)
6
5
4
3
2
1
0
In process of updating the firmware on the particular batch of disk
controllers.
08 April 2015
Tier-1 Status
Disk Server Issues - Responses
New Possibilities with Castor 2.1.9 or later:
• ‘Draining’ (‘passive’ and ‘active’)
• Read –only server.
• Checksumming (2.1.9) – easier to validate files plus regular
check of files written.
All of these used regularly when responding to a disk server
problems.
08 April 2015
Tier-1 Status
Data Loss Incidents
Summary of losses since GridPP26
Total of 12 incidents logged:
• 1 – Due to a disk server failure (loss of 8 files for CMS)
• 1 – Due to a bad tape (loss of 3 files for LHCb)
• 1 - Files not in Castor Nameserver but no location. ( 9 LHCb
files)
• 9 – Cases of corrupt files. In most cases the files were old
(and pre-date Castor checksumming).
Checksumming in place of tape and disk files. Daily and
random checks made on disk files.
08 April 2015
Tier-1 Status
T10000 Tapes
Type Capacity
A
0.5TB
In Use
5570
Total Capacity
2.2PB
B
C
2170
1.9PB (CMS)
1TB
5TB
We have 320 T10KC tapes (capacity ~1.5PByte)
already purchased.
Plan is to move data (VOs) using the ‘A’ tapes onto
the ‘C’ tapes, leapfrogging the ‘Bs’.
08 April 2015
Tier-1 Status
T10000C Issues
• Failure of 6 out of 10 tapes.
– Current A/B failure rate roughly 1 in 1000.
– After writing part of a tape an error was reported.
• Concerns are three fold:
– A high rate of write errors cause disruption
– If tapes could not be filled our capacity would be reduced
– We were not 100% confident that data would be secure
• Updated Firmware in drives.
– 100 tapes now successfully written without problem.
• In contact with Oracle.
08 April 2015
Tier-1 Status
Long-Term Operational Issues
• Building R89 (Noisy power v EMC units); Electrical Discharge
in 11kV supply.
– Noisy electrical current: Fixed by use of isolating
transformers in appropriate places.
– Await final information on resolution of 11kV discharge.
• Asymmetric Data Transfer rates in/out of RAL Tier1.
– Many possible causes: Load; FTS settings, Disk server
settings; TCP/IP tuning, network (LAN & WAN
performance).
– Have modified FTS settings with some success.
– Looking at Tier1-UK Tier2 transfers as within GridPP.
08 April 2015
Tier-1 Status
Long-Term Operational Issues
• BDII issues. (Site BDII stops updating).
– Monitoring & re-starters in place.
• Packet Loss on RAL Network Link
– Particular effect on LFC updates (reported by Atlas)
– Some evidence of being load related.
• Some ‘hangs’ within Castor JobManager
– Difficult to trace, very intermittent.
08 April 2015
Tier-1 Status
Other reported hardware
issues
The following were reported via theTier1-Experiments Liaison
Meeting.
• (Isolating transformers in disk array power feeds - final
resolution)
• Network switch stack failure
• Network (transceiver) failure in switch – in link to tape
system
• CMS LSF machine was also turned off by mistake - resulting
in a short outage for srm-cms.
• T10KC tapes
• 11kV feed into building - discharge.
• Failure of Site Access Router
08 April 2015
Tier-1 Status
Other Hardware Issues
The following were reported via theTier1-Experiments Liaison
Meeting.
• (Isolating transformers in disk array power feeds - final
resolution of long standing problem)
• Network switch stack failure
• Network (transceiver) failure in switch – in link to tape
system
• CMS LSF machine was also turned off by mistake - resulting
in a short outage for srm-cms.
• T10KC tapes
• 11kV feed into building – electrical discharge.
• Failure of Site Access Router
08 April 2015
Tier-1 Status
A couple of final comments
Disk server issues are the main area of effort for hardware
reliability / stability.
...but do not forget the network.
Hardware that has performed reliably in the past may throw up
a systematic problem.
08 April 2015
Tier-1 Status
Additional Slide:
Procedures....
Post Mortem:
1 Post Mortem During 6 months since GridPP26.
10th May 2011: LFC Outage After Database Update
• 1 hour outage following an At Risk. Caused by configuration
error in Oracle ACL lists.
Disaster Management:
• Triggered once (for T10KC tape problem).
08 April 2015
Tier-1 Status
Download