HPCx:an Overview Dr Arthur Trew Director, EPCC what is HPCx? • HPCx is the latest in a series of HPC services for UK academia – £52.9M from UK Research Councils – £1M from IBM for HPC R&D at EPCC – £600k from IBM for Life Sciences outreach • UoE HPCx Ltd runs the contract – wholly-owned subsidiary of the the University – CCLRC, EPCC and IBM are subcontractors • EPSRC’s objectives for the procurement were – “ to deliver the optimum service resulting in world-leading science” – “ address the problems in scaling codes to capability levels (512+)” • … so, the challenges we face are to – support change from capacity to capability – develop more scalable codes • science support is the key to success IBM Team Talent Meeting 10 February 2004 2 the story so far • Phase 1 service started on 9 December 2002 • The first year was extremely successful – CPU utilisation grown to >80% – 25+ user groups, ~350 users Utilisation of Capability Region 100% CPU Usage by Job Size 80% 60% 40% 2000000 20% 1500000 >1024 0% 512 256 Dec-03 Nov-03 500000 Oct-03 Sep-03 Aug-03 1000000 Jul-03 Jun-03 May-03 Apr-03 Mar-03 Feb-03 Jan-03 AUs 1024 128 64 32 0 Dec-03 Nov-03 Oct-03 Sep-03 Aug-03 Jul-03 10 February 2004 Jun-03 May-03 Apr-03 Mar-03 Feb-03 Jan-03 IBM Team Talent Meeting 16 8 3 Metric TSL FSL Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Ave 80 99. 2 99.6 98.0 96.8 99.9 99.8 100. 0 99.5 98.5 99.6 98.2 99.9 100. 0 99.1 Technology MTBF (hours) 200 300 293 183 81 732 732 366 183 418 209 1464 300 Number of AV FTEs 7.5 10 12.6 11.7 12.6 13.5 11.6 12.9 12.4 9.8 10.9 11.2 13.0 10.8 11.9 Number of training days per month 30/ 12 40/ 12 10/1 17/2 17/3 24/4 33/5 33/6 33/7 33/8 35/9 40/1 0 49/1 1 50/1 2 50/1 2 queries esolved <3 days (%) 85 97 98.7 98.7 97.8 100. 0 100. 0 100. 0 100. 0 98.5 100. 0 100. 0 100. 0 100. 0 99.5 Number of A&M FTEs 3.7 5 5.7 5 8.2 7.1 7.9 5.4 5.4 5.6 6.7 5.1 6.7 7.9 6.5 5.4 6.5 A&M serviceability (%) 80 100 99.4 99.6 99.9 100. 99.9 99.5 100. 99.9 99.9 98.8 99.9 99.7 99.7 Technology serviceability (%) • … but the colony switch did have poor performance and reliability IBM Team Talent Meeting 10 February 2004 4 looking forward • the Phase 1 Phase 2 upgrade most risky part of the project – new hardware, colony federation – new software, PSSP CSM • EPSRC has funded small Phase 2 development machine – so, the support teams are more prepared – but switch performance is (currently) poor – … and unlikely to satisfy EPSRC • termination is unlikely but the relationship with EPSRC could be less cordial post-Phase 2 IBM Team Talent Meeting 10 February 2004 5