HPCx:an Overview Dr Arthur Trew Director, EPCC

advertisement
HPCx:an Overview
Dr Arthur Trew
Director, EPCC
what is HPCx?
• HPCx is the latest in a series of HPC services for UK academia
– £52.9M from UK Research Councils
– £1M from IBM for HPC R&D at EPCC
– £600k from IBM for Life Sciences outreach
• UoE HPCx Ltd runs the contract
– wholly-owned subsidiary of the the University
– CCLRC, EPCC and IBM are subcontractors
• EPSRC’s objectives for the procurement were
– “ to deliver the optimum service resulting in world-leading science”
– “ address the problems in scaling codes to capability levels (512+)”
• … so, the challenges we face are to
– support change from capacity to capability
– develop more scalable codes
• science support is the key to success
IBM Team Talent Meeting
10 February 2004
2
the story so far
• Phase 1 service started on 9 December 2002
• The first year was extremely successful
– CPU utilisation grown to >80%
– 25+ user groups, ~350 users
Utilisation of Capability Region
100%
CPU Usage by Job Size
80%
60%
40%
2000000
20%
1500000
>1024
0%
512
256
Dec-03
Nov-03
500000
Oct-03
Sep-03
Aug-03
1000000
Jul-03
Jun-03
May-03
Apr-03
Mar-03
Feb-03
Jan-03
AUs
1024
128
64
32
0
Dec-03
Nov-03
Oct-03
Sep-03
Aug-03
Jul-03
10 February 2004
Jun-03
May-03
Apr-03
Mar-03
Feb-03
Jan-03
IBM Team Talent Meeting
16
8
3
Metric
TSL
FSL
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Ave
80
99.
2
99.6
98.0
96.8
99.9
99.8
100.
0
99.5
98.5
99.6
98.2
99.9
100.
0
99.1
Technology
MTBF (hours)
200
300
293
183
81
732
732

366
183
418
209
1464

300
Number of AV
FTEs
7.5
10
12.6
11.7
12.6
13.5
11.6
12.9
12.4
9.8
10.9
11.2
13.0
10.8
11.9
Number of
training days per
month
30/
12
40/
12
10/1
17/2
17/3
24/4
33/5
33/6
33/7
33/8
35/9
40/1
0
49/1
1
50/1
2
50/1
2
queries esolved
<3 days (%)
85
97
98.7
98.7
97.8
100.
0
100.
0
100.
0
100.
0
98.5
100.
0
100.
0
100.
0
100.
0
99.5
Number of A&M
FTEs
3.7
5
5.7
5
8.2
7.1
7.9
5.4
5.4
5.6
6.7
5.1
6.7
7.9
6.5
5.4
6.5
A&M
serviceability (%)
80
100
99.4
99.6
99.9
100.
99.9
99.5
100.
99.9
99.9
98.8
99.9
99.7
99.7
Technology
serviceability (%)
• … but the colony switch did have poor performance
and reliability
IBM Team Talent Meeting
10 February 2004
4
looking forward
• the Phase 1  Phase 2 upgrade most risky part of
the project
– new hardware, colony  federation
– new software, PSSP  CSM
• EPSRC has funded small Phase 2 development
machine
– so, the support teams are more prepared
– but switch performance is (currently) poor
– … and unlikely to satisfy EPSRC
• termination is unlikely but the relationship with
EPSRC could be less cordial post-Phase 2
IBM Team Talent Meeting
10 February 2004
5
Download