2011-10_HEPiX_Vancouver_-_RAL_Site_Report

advertisement
RAL Site Report
HEPiX 20th Anniversary
Fall 2011, Vancouver
24-28 October
Martin Bly, STFC-RAL
Overview
•
•
•
•
•
General
Hardware
Storage
Networking
…
02/05/2011
RAL Site Report - HEPiX Spring 2011
General
• New CEO for STFC
– John Womersley takes over from Keith Mason on 1st November
– To 31st March 2015
• Staffing @ Tier1
– 5 staff posts open due to staff moving
– Replacements agreed despite restrictions
– Recruitments underway
• Power
– ‘Partial Discharge’ (arcing) detected in 11kV bus in transformer room
– Isolated to the join between two bus segments (bus-coupler)
– Loose bolt in bus bar identified and tightened up – fixed
02/05/2011
RAL Site Report - HEPiX Spring 2011
Hardware changes
• Summary of previous report:
–
–
–
–
13 x Dell R610 tape servers (10GbE) for T10KC drives
14 x T10KC tape drives
Arista 7124S 24-port 10GbE switch + twinax copper interconnects
5 x Avaya 5650 switches + various 10/100/1000 switches
• New since May
– Various Dell R510s for small data servers for Facilities Data Service,
provides interfaces into Castor for RAL site facilities and others.
– 68 x 40TB 4U servers ordered for capacity storage – two suppliers
• 10GbE, 2TB HDD, single CPU, 24GB RAM, 2.66PB total
• Note that disks may be hard to get 
– 15,000 HEP-SPEC tender completed evaluation, result just announced
• To come
– 40GbE/10GbE and 10Gbe/1GbE switches, management switches, more
tape servers, T10KC tape drives and tapes, iSCSI arrays, ...
• Gone: 22 x 10TB servers - 2005 generation
• To go: 86 x 6TB servers – 2006 generation
02/05/2011
RAL Site Report - HEPiX Spring 2011
Storage Issues
• Issue with some 3ware controllers throwing perfectly healthy WD
drives
– Due to firmware not recognising and handling failure mode on newer WD
drives of the same model
– Firmware update has fixed this, rollout completed
• Issue with Adaptec controllers and StorageManager software
– SM reports many SMART errors when drives are healthy
• reports unhealthy ones too
– Firmware update has fixed this, rolling out shortly
• Problem with T10KC drives
– Early production batch issue
– Firmware fix
– No recurrence
• Production storage now using most recent sets of hardware with
older (smaller capacity) hardware ‘spinning reserve’
02/05/2011
RAL Site Report - HEPiX Spring 2011
Castor Status
• Castor manages disk and tape storage
– 18 million files (at Oct 2011)
• Recent news:
– Moved to T10KC tape media in production in September (Atlas, LHCb)
– New (non-Tier1) production instance for Diamond synchrotron
• Part of a new complete Facilities Data Service which provides data
transparent aggregation (StorageD) metadata service (ICAT) and web
(TopCAT) and FUSE frontends to access data
• Coming up (Jan-Mar):
– Move to new database hardware and better resilient architecture
(using DataGuard) over next 6 months
– Major upgrade of CASTOR with a new optimized scheduler and new
tape functionality – better for small files
– New service ’head nodes’ in test: Dell R410 and Transtec
02/05/2011
RAL Site Report - HEPiX Spring 2011
Networking
• WAN
– UK NREN JANET now has a 100Gb/s backbone.
– Funding for the next upgrade of the NREN SuperJANet6 has recently been
approved
• Site
– Sporadic packet loss in site core networking (few %)
• Still present to a very small degree – intermittent problems with access to LFC
dropping for remote users (T2s). May be load related.
• Asymmetric Data Transfer rates in/out of Tier1
– Many possible causes: Load; FTS settings, disk server settings; TCP/IP tuning,
network (LAN & WAN performance)
– Have modified FTS settings with some success
– Looking at Tier1-UK Tier2 transfers
• LAN
– Another failed 10GbE XFP transceiver, and a death in service of a Nortel 5510
– Three subnets in use for Tier1
– Lots of packet discards into stacks, investigating...
• Developments
– Looking to provide large bandwidth in Tier1 core with ‘mesh-type’ arrangement
linked at multiple 40Gb/s with storage connectivity at 10Gb/s.
02/05/2011
RAL Site Report - HEPiX Spring 2011
Databases
• Small but significant Oracle installation
– Castor, 3D, LFC, FTS
• Castor database server hardware to be replaced
–
–
–
–
–
–
Old: 2 x 5-node (32bit) RACs, EMC AX4 arrays
New: 2 pairs of 3-node (64bit) RACs, EMC AX4 + Infortrend Arrays
Different ASM architecture – single volumes rather than paired
Dataguard from Production RAC to Standby RAC for resilience
Standby RACs in different building
Backups off the Standby set
• LFC/FTS
– Standby set to be added to the existing setup, Dataguard and backup
as per Castor, single volume data, ASM volume architecture changes
• 3D
– ASM volume architecture changes
02/05/2011
RAL Site Report - HEPiX Spring 2011
Virtualisation
• Evaluated MS Hyper-V for services virtualization platform
– Beginning to roll out local-storage virtualisation for services that
don’t need fast failover
• Struggled for a long time with iSCSI storage arrays (and poor
support)
– New iSCSI arrays ordered
– To support fast-failover etc
• Cloud project
– Department initiative looking at cloud use
• Talk by Ian Collier
02/05/2011
RAL Site Report - HEPiX Spring 2011
Projects
• Quattor
– Batch and Storage systems under Quattor management
• ~6200 cores, 700+ systems (batch), 500+ system (storage)
• Significant time saving
– Significant rollout on Grid services node types
• CernVM-FS
– Major deployment at RAL to cope with software distribution issues
– More news in talk by Ian Collier later this week
02/05/2011
RAL Site Report - HEPiX Spring 2011
Questions?
02/05/2011
RAL Site Report - HEPiX Spring 2011
Download