Outsourcing University Services Future of National Computer Grid Services in the UK Dr Rhys Newman University of Oxford NeSC 22nd Feb 2007 My Background… Academic researching Computer Grid Technology (more on this later) I work in the physics department although I am a software engineer by trade Spent 6 months project-leading a small-scale computer room build project for Oxford Physics Spent 2 further years on the committee overseeing a computer room build for the expansion of the Oxford Supercompter Spent most of last year campaigning for outsourcing CPU provision in the face of the above experiences Am a director of a University spin-out company which aims to bring grid computing technology to market This experience has meant I have looked at the economic argument for grid computing (at least in connection with CPU usage) in detail and compared it in particular with outsourcing CPU time and building your own computer facility I therefore feel well qualified to comment on the issue of “Outsourcing your CPU” Current Status in the UK Relative Power 100 80 Percentage The UK maintains about 7% of the top 500 supercomputer power (6% in 2006). Even though the total CPU power has increased by 35x since 2000. To get into the top 500 you’ll need about 1000 processor cores at 2.4Ghz or better Percentage in UK UK Supercomputing Performance 60 40 20 0 2000 2001 2002 2003 Year 2004 2005 A cluster similar to Cambridge’s recent supercomputer, but equivalent to the #1 supercomputer in GFlops: 18 000 Dual Core Xeon machines Cost over £30 million to buy (computers only at £2500 each retail) Consume over 15MW and cost £8 million in electricity to run per year Would provide approximately 1 billion GHzHrs per year Would need a machine room the size of a football pitch 2006 What do Academic Users want? Percentage x86 Architecture More computing!!! Typically x86 Linux based Power Processors coming in artificially in 2006 due to Blue Gene upgrade. 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 2000 2001 2002 2003 2004 100 80 Percent Important for cost Infiniband appearing before 10GBit Ethernet? 2006 Percentage Standard Ethernet Myrinet Interconnect Family Clusters with raw CPU “grunt” rather than special hardware or interconnect. 2005 60 40 20 0 2000 2001 2002 2003 Year 2004 2005 2006 Correlation of GFlops and GHzHrs Clock speed (GHzHrs) correlates to performance within machine families: SpecInt2000 vs GHz in 2005 SpecInt2000 vs GHz in 2004 2500 2000 1800 1600 2000 1400 1200 1500 1000 800 1000 600 1x1 1x2 400 500 200 1x1 2x2 0 0 1 2 3 4 5 6 2x1 2x1 7 0 8 0 1 2 3500 3000 2500 SpecInt2000 vs GHz in 2006 2000 1500 1000 1x1 1x2 500 2x2 0 0 4 8 12 16 3 4 5 6 7 8 9 To GHzHr or not to GHzHr? Despite the flaws in this measurement of power, observe the following prices on Dell.co.uk Cost (£ per GHzHr) Dual Quad Core Xeon 5355 2.66GHz (8 cores) Dual Dual Core Xeon 5130 2GHz (4 cores) Dual Dual Xeon 5050 3GHz (4 cores) Quad Core Xeon 5355 2.66GHz (4 cores) Dual Core Xeon 5050 3GHz (2 cores) Dual Dual Core AMD Opteron 2.8GHz (4 cores) Dual Core AMD Opteron 2.8GHz (2 cores) Dual Core AMD Opteron 2.4GHz (2 cores) Dual Core AMD Opteron 2GHz (2 cores) 0 0.005 0.01 0.015 0.02 The average is 1.36p GHzHr (2.53p if you use Hire Purchace). These values were 1.15p GHzHr 6 months ago (1.39p on HP) This suggests an acceptance de-facto of GHzHrs as a basis of price How to get the most GHzHrs for £ 1. Buy your own computers, build your room and run a computing facility 2. Rent computers and hosting from external provider 3. Use grid computing to extract value from existing machines Option 1: Build your own Facility Advantages You get good PR when it opens You can get exactly the equipment you want….well almost Disadvantages The project risks of building and commissioning such facilities are surprisingly large No flexibility – run at 100% all the time or waste the investment All the responsibility, uptime, hardware failures, hardware refresh, being everything to everyone! Real cost The costs of a computer room Cost of computer hardware is 1.0p GHzHr However “Bare Bones” facility calculation for a 1000 Dual CPU node cluster shows £1.3million running costs per year (500k in electricity alone) GHzHr rate 1.27p to 1.43p £4.6 to £6.3 million startup costs This build has no UPS or other “high availability” features This should be the cost Universities should be able to pass on to internal users However inherent inefficiencies in the internal process mean this rapidly becomes more than 5p GHzHr – if the university can find the initial capital in the first place! Anecdotally one institution believes it is possible to charge 10p GHzHr and expect; Their academics to pay it The research councils to accept the charge on the FEC project sheet The chain of events….. More and more research areas need substantial computing resources – more than any department can contemplate The University steps in to provide a central computing facility – more efficient on the surface You now need a computer room built to modern spec (as you’ll need 1000 CPUs) Almost always requires a substantial building project Building projects are notoriously late and over budget (typical fully costed build multiplies initial quotes by 2) Every department now at the mercy of the progress of this central project, which becomes politically more risky to control as costs overrun and deadlines are missed Computing facility comes online and has to recover much larger costs than anticipated from departments (and their research projects) – almost inevitable now we have FEC A late project has had an academic opportunity cost which is difficult to quantify (but almost certainly has cost research grants), and the overpricing needed has made it unattractive to use University steps in to force use (User charge bumped up with general fund support) Nobody is happy, costs are high and research has suffered Even worse: if certain “high spec” users of computing have persuaded the University to spend extra money on hardware which they need, as this results in central funds sponsoring a particular group’s work at the expense of other uses Option 2: Outsource CPU Resources Advantages You can get the best value from a load of suppliers (competition is fierce) You can often get your resources online in less than a week You have no risks with a building project and the hardware maintenance, infrastructure resilience and operational hassle is no longer yours You are flexible – grow and shrink as necessary Disadvantages You don’t get as large a choice of hardware Price Comparison for GHzHrs Some prices from the web for dedicated hosting 5p was the norm 6 months ago Worthy mention www.VCompute.com 5p/GHzHr but in conventional cluster arrangement with 8GB RAM on each node, happy to supply over 10000 nodes Reasons for Variation Different RAM Pentium/XEON/AMD Bandwidth restrictions HDD size Additional services Pence per GHzHr 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Option 3: Use Existing Resources Better Grid Technology can enable the thousands of machines in an instituion to be utilised much more effectively This is a real resource which is going to waste: estimated £100 billion globally per annum. Office machines can be used to soak up the more conventional computing tasks leaving only specialist tasks for special machines Specialist machines can be smaller and come back into the departmental remit – where they belong! Potential locked away…. How many computers in Oxford? Oxford University has 5000 staff 50000 registered IP addresses Suggests 10000 modern machines available How many in the UK academic sector? 168 institutions employing 160000 staff Assume 200000 “decent” machines (2Ghz or better) available and connected to the LAN 3.5 Billion GHzHrs total, 2.6 Billion outside office hours Total incremental cost of this is dominated by the extra electricity: £50000 per year per institution Equivalent to 0.3p GHzHr For a UK-wide cost of £8m/year, we could have 2 Darwin machines Equivalent to #13 in the top 500 Grid Technology: Nereus Any proposed technology which attempts to exploit these idle machines must Support Windows primarily (90% of all computers run Windows not Linux) Not require admin privileges to run Be bulletproof to protect users and owners from each other (and limit support calls) Must be simple and easy to install My particular interest and project: Nereus In development for 2 years, currently in beta Testing phase set to begin within weeks on many thousands of machines (ironically not in academia and not in the UK!) Solves the above issues in a way not addressed by any current grid middleware Recommendations Do not… build any more computer rooms at an institution level Waste money on large special hardware Wait any longer to catch up the rest of the world in computing resources Do….. Outsource computing resources to specialist providers Soak up the existing resources in institutions using grid technology Let special projects buy their special hardware for their own use as before Finally a Request for a “composite” National Grid Service: We can build an academic grid using Nereus which pools the idle time of all UK institutions – a resource of global capability Can the NGS supply conventional clusters and also manage a desktop grid deployment to ensure the right users get the best resources per £ Can anyone suggest a means to fund the desktop grid part in the UK- a small outlay will have massive benefits