Hosting Large-scale e-Infrastructure Resources Mark Leese mark.leese@stfc.ac.uk Contents • Speed dating introduction to STFC • Idyllic life, pre-e-Infrastructure • Sample STFC hosted e-Infrastructure projects • RAL network re-design • Other issues to consider STFC • One of seven publicly funded UK Research Councils • Formed from 2007 merger of CCLRC and PPARC • STFC does a lot, including… – awarding research, project & PhD grants – providing access to international science facilities through its funded membership of bodies like CERN – shares it expertise in areas such as materials and space science with academic and industrial communities • …but it is mainly recognised for hosting large scale scientific facilities, inc. High Performance Computing (HPC) resources Harwell Oxford Campus - STFC major shareholder in Diamond Light Source Electron beam accelerated to near light speed within ring Resulting light (X-Ray, UV or IR) interacts with samples being studied - ISIS ‘super-microscope’ employing neutron beams to study materials at atomic level Harwell Oxford Campus - - STFC’s Rutherford Appleton Lab is part of Harwell Oxford Science and Innovation Campus with UKAEA, and commercial campus management company Co-locate hi-tech start-ups and multi-national organisations alongside established scientific and technical expertise - Similar arrangement at Daresbury in Cheshire Both within George Osbourne Enterprise Zones: Reduced business rates Government support for roll out of super fast broadband Previous Experiences Large Hadron Collider 16.5 miles CMS ALICE LHCb ATLAS • LHC at CERN • Search for elementary but hypothetical Higgs boson particle • Two proton (hadron) beams • Four experiments (particle detectors) • Detector electronics generate data during collisions LHC and Tier-1 • After initial processing, the four experiments generated 13 PetaBytes of data in 2010 (> 15m GB or 3.3m single layer DVDs) • In last 12 months, Tier-1 received ≈ 6 PBs from CERN and other Tier-1s • GridPP contributes equivalent of 20,000 PCs UK Tier-1 at RAL ISP Front Door Security Internal Distribution Site Access Router Firewall Router A Primary Janet Backup Tier-1 to Tier-2s (universities) CERN LHC OPN Optical Private Network • • 10 Gbps lightpath Backup UK Light Router LHC data RAL Site PetaBytes?!? “Normal” data Tier-1 Tier-0 & other Tier-1s Individual Tier-1 hosts route data to routers A or UKLight as appropriate Config pushed out with Quattor Grid/cluster management tool • • Access Control Lists of IP address on SAR, UKLight router and/or hosts replaces firewall security As Tier-2 (universities) network capabilities increase, so must RAL’s (102030 Gbps) LOFAR - LOw Frequency Array - World's largest and most sensitive radio telescope - Thousands of simple dipole antennas, 38 European arrays - 1st UK array opened at Chilbolton, Sept 2010 - 7 PetaBytes a year raw data generated (> 1.5m DVDs) - Data transmitted in real-time to IBM BlueGene/P super computer at Uni of Groningen - Data processed & combined in software to produce images of the radio sky LOFAR - 10 Gbps Janet Lightpath - Janet GÉANT SURFnet - Big leap from FedEx’ing data tapes or drives - 2011 RCUK e-IAG “Southampton and UCL make specific reference ... quicker to courier 1TB of data on a portable drive” - Funded by LOFAR-UK - cf. LHC: centralised not distributed processing - Expected to pioneer approach for other projects, e.g. Square Kilometre Array Sample STFC e-Infrastructure Projects ICE-CSE • International Centre of Excellence for Computational Science and Engineering • Was going to be Hartree Centre, now DFSC • STFC Daresbury Laboratory, Cheshire • Partnership with IBM • Mission to provide HPC resources and develop software • DL previously hosted HPCx, big academic HPC before HECToR • IBM BlueGene/Q supercomputer • 114,688 processor cores, 1.4 Petaflops peak performance • Partner IBM’s tests were first time a Petaflop application has been run in the UK (one thousand trillion calculations per second) • 13th in this year’s TOP500 worldwide list • Rest of Europe appears five times in Top 10 • DiRAC and HECToR (Edinburgh) 20th and 32nd ICE-CSE • • DL network upgraded to support up to 8 * 10 Gbps lightpaths to current regional Janet deliverer, Net North West, in Liverpool and Manchester Same optical fibres, different colours of light: 1. 2. 3. 4. 5. 10G JANET IP service (primary) 10G JANET IP service (secondary) 10G DEISA (consortium of European supercomputers) 10G HECToR (Edinburgh) 10G ISIC (STFC-RAL) More expected as part of IBM-STFC collaboration • • Feasible because NNW rents its own dark (unlit) fibre network NNW ‘simply’ change the optical equipment on each end of the dark fibre • • Key aim is for machine and expertise to be available to commercial companies How? Over Janet? • A Strategic Vision for UK e-Infrastructure estimates that 1,400 companies could make use of HPC, with 300 quite likely to do so So even if some instead go for the commercial “cloud” option... • JASMIN & CEMS • Joint Analysis System Meeting Infrastructure Needs • JASMIN and CEMS funded by BIS through NERC, and UKSA and ISIC respectively • Compute and storage cluster for the climate and earth system modelling community Big compute and storage cluster 4.6 PetaBytes fast disc storage JASMIN will talk internally to other STFC resources JASMIN will talk to its satellite systems 150 TB compute + 500 TB 150 TB JASMIN will talk to the Nederlands, the MET Office & Edinburgh over UKLight CEMS in the ISIC • • • • Climate and Environmental Monitoring from Space Essentially JASMIN for commercial users Promote use of ‘space’ data and technology within new market sectors Four consortia already won funding from public funded ‘Space for Growth’ competition (run by UKSA, TSB and SEEDA) • • • Hosted in International Space Innovation Centre A ‘not-for-profit’ formed by industrials, academia and government. Part of UK’s Space Innovation and Growth Strategy to grow the sector’s £turnover • • ISIC is STFC ‘Partner Organisation’ in terms of Janet Eligibility Policy So... Janet-BCE (Business and Community Engagement) for network access related to academic and ISIC partners Commercial ISP for network access related to commercial customers As the industrial collaboration agenda is pushed, this needs to be controlled and applicable elsewhere in STFC • • Janet BT Janet & Janet-BCE traffic Commercial traffic RAL Infrastructure 10 Gbps fibre Commercial customers VLAN No CEMS traffic permitted Janet-BCE VLAN Rtr JASMIN 10 Gbps fibre ISIC Sw CEMS Rtr • JASMIN and CEMS connected at 10 Gbps… • …but no Janet access for CEMS via JASMIN • Keeping Janet ‘permitted’ traffic as separate BCE VLAN allows tighter control • Customers will access CEMS on different IP addresses depending on who they are (academia, partners, commercials) • This could be enforced RAL Network Re-Design & Other Issues RAL Network Re-Design ISIS Janet Site Access Router The Outside World RAL PoP CERN LHC OPN UKLight Router Internal Router A Distribution Firewall RAL Site Admin “Normal” data JASMIN LHC data Tier-1 Tier-1 Two main aims: 1. Resilience: Reduce serial paths and single points of failure. 2. Scalability and flexibility: Remove need for special cases. Make adding bandwidth and adding ‘clouds’ (e.g. Tier-1 or tenants) a repeatable process with known costs. External Connectivity Site Access & Distribution Security Internal Distribution Site Visitors Campus Rtr Implicit trust relationship = bypass firewall Virtual firewall Primary Janet Backup CERN LHC OPN RAL PoP: Rtr 1 Campus Access & Distribution Rtr 2 Project, Facility, Dept Sw 1 Internal Site Distribution Rtr A Rest of RAL site Sw 2 Primary Rtr Commer -cial ISP Tenants Tier-1 Rtr 1 & 2, Sw1 & 2 Front: 48 ports 1/10 GbE (SFP+) Back: 4 ports 40 GbE (QSFP+) • Lots of 10 Gigs: – clouds and new providers can be readily added – bandwidth readily added to existing clouds – clouds can be dual connected RAL Site Resilience Backup to London Primary to Reading 500 ft 100m User Education • Belief that you can plug a node or cluster into “the network” and be immediately firing lots of data all over the world is a fallacy • Over provisioning is not a complete solution • Having invested £m’s elsewhere, most network problems that do arise are within the last mile: campus network individual devices applications • On the end systems... – – – – – – Network Interface Card Hard disc TCP configuration Poor cabling Does your application use parallel TCP streams? What protocols does your application use for data transfer (GridFTP, HTTP...)? • Know what to do on your end systems • Know what questions to ask of others User Support • 2010 example: CMIP5 - RAL Space sharing environmental data with Lawrence Livermore (West coast US) and DKRZ (Germany) – – – – – • • • ESNet, California GÉANT, London 800 Mbps ESNet, California RAL Space 30 Mbps RAL Space DKRZ, Germany 40Mbps So RAL is the problem right? Not necessarily... DKRZ, Germany RAL Space up to 700Mbps Involved six distinct parties: RAL Space, STFC Networking, Janet, DANTE, ESNet, LLNL Difficult, although the experiences probably fed into the aforementioned JASMIN Tildesley’s Strategic Vision for UK e-Infrastructure talks of “the additional effort to provide the skills and training needed for advice and guidance on matching end-systems to high-capacity networks” I’ll do anything for a free lunch • Access Control and Identity Management – During DTI’s e-Science programme access to resources was often controlled using personal X.509 certificates – Is that scalable? – Will you run or pay for a PKI? – Resource providers may want to try Moonshot • extension of eduroam technology • users of e-Infrastructure resources authenticated with user credentials held by their employer • Will the Janet Brokerage be applicable to HPC e-Infrastructure resources? Conclusions From the STFC networking perspective: • Adding bandwidth should be repeatable process with known costs • Networking is now a core utility, just like electricity: plan for resilience on many levels • Plan for commercial interaction • In all the excitement don’t forget security • e-Infrastructure funding is paying for capital investments - be aware of the recurrent costs