RAL Site Report HEPiX 20th Anniversary Fall 2011, Vancouver 24-28 October Martin Bly, STFC-RAL Overview • • • • • General Hardware Storage Networking … 02/05/2011 RAL Site Report - HEPiX Spring 2011 General • New CEO for STFC – John Womersley takes over from Keith Mason on 1st November – To 31st March 2015 • Staffing @ Tier1 – 5 staff posts open due to staff moving – Replacements agreed despite restrictions – Recruitments underway • Power – ‘Partial Discharge’ (arcing) detected in 11kV bus in transformer room – Isolated to the join between two bus segments (bus-coupler) – Loose bolt in bus bar identified and tightened up – fixed 02/05/2011 RAL Site Report - HEPiX Spring 2011 Hardware changes • Summary of previous report: – – – – 13 x Dell R610 tape servers (10GbE) for T10KC drives 14 x T10KC tape drives Arista 7124S 24-port 10GbE switch + twinax copper interconnects 5 x Avaya 5650 switches + various 10/100/1000 switches • New since May – Various Dell R510s for small data servers for Facilities Data Service, provides interfaces into Castor for RAL site facilities and others. – 68 x 40TB 4U servers ordered for capacity storage – two suppliers • 10GbE, 2TB HDD, single CPU, 24GB RAM, 2.66PB total • Note that disks may be hard to get – 15,000 HEP-SPEC tender completed evaluation, result just announced • To come – 40GbE/10GbE and 10Gbe/1GbE switches, management switches, more tape servers, T10KC tape drives and tapes, iSCSI arrays, ... • Gone: 22 x 10TB servers - 2005 generation • To go: 86 x 6TB servers – 2006 generation 02/05/2011 RAL Site Report - HEPiX Spring 2011 Storage Issues • Issue with some 3ware controllers throwing perfectly healthy WD drives – Due to firmware not recognising and handling failure mode on newer WD drives of the same model – Firmware update has fixed this, rollout completed • Issue with Adaptec controllers and StorageManager software – SM reports many SMART errors when drives are healthy • reports unhealthy ones too – Firmware update has fixed this, rolling out shortly • Problem with T10KC drives – Early production batch issue – Firmware fix – No recurrence • Production storage now using most recent sets of hardware with older (smaller capacity) hardware ‘spinning reserve’ 02/05/2011 RAL Site Report - HEPiX Spring 2011 Castor Status • Castor manages disk and tape storage – 18 million files (at Oct 2011) • Recent news: – Moved to T10KC tape media in production in September (Atlas, LHCb) – New (non-Tier1) production instance for Diamond synchrotron • Part of a new complete Facilities Data Service which provides data transparent aggregation (StorageD) metadata service (ICAT) and web (TopCAT) and FUSE frontends to access data • Coming up (Jan-Mar): – Move to new database hardware and better resilient architecture (using DataGuard) over next 6 months – Major upgrade of CASTOR with a new optimized scheduler and new tape functionality – better for small files – New service ’head nodes’ in test: Dell R410 and Transtec 02/05/2011 RAL Site Report - HEPiX Spring 2011 Networking • WAN – UK NREN JANET now has a 100Gb/s backbone. – Funding for the next upgrade of the NREN SuperJANet6 has recently been approved • Site – Sporadic packet loss in site core networking (few %) • Still present to a very small degree – intermittent problems with access to LFC dropping for remote users (T2s). May be load related. • Asymmetric Data Transfer rates in/out of Tier1 – Many possible causes: Load; FTS settings, disk server settings; TCP/IP tuning, network (LAN & WAN performance) – Have modified FTS settings with some success – Looking at Tier1-UK Tier2 transfers • LAN – Another failed 10GbE XFP transceiver, and a death in service of a Nortel 5510 – Three subnets in use for Tier1 – Lots of packet discards into stacks, investigating... • Developments – Looking to provide large bandwidth in Tier1 core with ‘mesh-type’ arrangement linked at multiple 40Gb/s with storage connectivity at 10Gb/s. 02/05/2011 RAL Site Report - HEPiX Spring 2011 Databases • Small but significant Oracle installation – Castor, 3D, LFC, FTS • Castor database server hardware to be replaced – – – – – – Old: 2 x 5-node (32bit) RACs, EMC AX4 arrays New: 2 pairs of 3-node (64bit) RACs, EMC AX4 + Infortrend Arrays Different ASM architecture – single volumes rather than paired Dataguard from Production RAC to Standby RAC for resilience Standby RACs in different building Backups off the Standby set • LFC/FTS – Standby set to be added to the existing setup, Dataguard and backup as per Castor, single volume data, ASM volume architecture changes • 3D – ASM volume architecture changes 02/05/2011 RAL Site Report - HEPiX Spring 2011 Virtualisation • Evaluated MS Hyper-V for services virtualization platform – Beginning to roll out local-storage virtualisation for services that don’t need fast failover • Struggled for a long time with iSCSI storage arrays (and poor support) – New iSCSI arrays ordered – To support fast-failover etc • Cloud project – Department initiative looking at cloud use • Talk by Ian Collier 02/05/2011 RAL Site Report - HEPiX Spring 2011 Projects • Quattor – Batch and Storage systems under Quattor management • ~6200 cores, 700+ systems (batch), 500+ system (storage) • Significant time saving – Significant rollout on Grid services node types • CernVM-FS – Major deployment at RAL to cope with software distribution issues – More news in talk by Ian Collier later this week 02/05/2011 RAL Site Report - HEPiX Spring 2011 Questions? 02/05/2011 RAL Site Report - HEPiX Spring 2011