Maui High Performance Computing Center Open System Support Mike McCraney

advertisement
Maui High Performance Computing Center
Open System Support
An AFRL, MHPCC and UH Collaboration
December 18, 2007
Mike McCraney
MHPCC Operations Director
1
Agenda






MHPCC Background and History
Open System Description
Scheduled and Unscheduled Maintenance
Application Process
Additional Information Required
Summary and Q/A
2
An AFRL Center
 An Air Force Research Laboratory Center
 Operational since 1993
 Managed by the University of Hawaii
• Subcontractor Partners – SAIC / Boeing
 A DoD High Performance Computing
Modernization Program (HPCMP)
Distributed Center
 Task Order Contract – Maximum Estimated
Ordering Value = $181,000,000
• Performance Dependent – 10 Years
• 4 Year Base Period with 2, 3-Year Term
Awards
3
A DoD HPCMP Distributed Center
Director,
Defense Research and Engineering
DUSD
(Science and Technology)
High Performance Computing Modernization Program
Major Shared Resource Centers
Distributed Centers




 Allocated Distributed Centers
Aeronautical Systems Center (ASC)
Army Research Laboratory (ARL)
Engineer Research and Development Center (ERDC)
Naval Oceanographic Office (NAVO)
•
•
•
•
Army High Performance Computing Research Center (AHPCRC)
Arctic Region Supercomputing Center (ARSC)
Maui High Performance Computing Center (MHPCC)
Space and Missile Defense Command (SMDC)
 Dedicated Distributed Centers
•
•
•
•
•
•
•
ATC
AFWA
AEDC
AFRL/IF
Eglin
FNMOC
JFCOM/J9
•
•
•
•
•
•
•
NAWC-AD
NAWC-CD
NUWC
RTTC
SIMAF
SSCSD
WSMR
4
MHPCC HPC History

1994 - IBM P2SC Typhoon Installed
MHPCC HPC Growth

1996 - 2000 IBM P2SC

2000 - IBM P3 Tempest Installed




12000
300
10000
250
2002 - IBM P2SC Typhoon Retired
8000
200
2002 - IBM P4 Tempest Installed
6000
150
2004 - LNXi Evolocity II Koa Installed
4000
100
2000
50
Disk (TB)/TFlops

350
2001 - IBM Netfinity Huinalu Installed
Memory/Processors

14000
2005 - Cray XD1 Hoku Installed
2006 - IBM P3 Tempest Retired
Processors
Memory
Disk
20
07
20
06
20
05
20
04
20
03
20
02
20
01
20
00
2007 - Dell Poweredge Jaws Installed
19
99

19
98
2007 - IBM P4 Tempest Reassigned
0
19
97

19
96
0
Tflops
5
Hurricane Configuration Summary
Current Hurricane Configuration:

Eight, 32 processor/32GB “nodes” IBM P690 Power4
 Jobs may be scheduled across nodes for a total of 288p
 Shared memory jobs can span up to 32p and 32GB
 10TB Shared Disk available to all nodes
 LoadLeveler Scheduling
 One job per node – 32p chunks – can only support 8 simultaneous jobs
 Issues:
 Old technology, reaching end of life, upgradability issues
 Cost prohibitive – Power consumption constant ~$400,000 annual power cost
6
Dell Configuration Summary
Proposed Shark Configuration:
 40, 4 processor/8GB “nodes” Intel 3.0Ghz Dual Core Woodcrest Processors
 Jobs may be scheduled across nodes for a total of 160p
 Shared memory jobs can span up to 8p and 16GB
 10TB Shared Disk available to all nodes
 LSF Scheduler
 One job per node – 8p chunks – can support up to 40 simultaneous jobs
Features/Issues:
 Shared use as Open system and TDS (test and development system)
 Much lower power cost – Intel power management
 System already maintained and in use
 System covered 24x7 UPS, generator
 Possible short-notice downtime
7
Jaws Architecture
Cisco 6500 Core
 Head Node for System Administration
Head
• “Build” Nodes
Node
• Running Parallel Tools
– (pdsh, pdcp, etc.)
 SSH Communications Between Nodes
• Localized Infiniband Network
• Private Ethernet
User
User
Webtop
User
 Dell Remote Access Controllers
24 Lustre
Webtop
Webtop
I/O
Nodes,
• Private Ethernet
1 MDS
• Remote Power On/Off
Fibre
• Temperature Reporting
Channel
• Operability Status
• Alarms
Storage
Storage
• 10 Blades Per Chassis
Storage
Storage
DDN
 CFS Lustre Filesystem
200 TB
• Shared Access
• High Performance
• Using Infiniband Fabric
Gig-E nodes with
10 Gig-E uplinks.
40 nodes per
uplink.
Simulation
Simulation
Engine
Simulation
1280
Engine
Simulation
Engine
Batch
Engine
(5120
Cores)
Cisco
Infiniband
(Copper)
10 Gig-E
Ethernet
Fibre
3 User
User
Webtop
Interactive
Webtop
Nodes
(12 cores)
Network
Network
s
Network
s
Network
DREN
s
s
Network
s
8
Shark Software

Systems Software
•
Red Hat Enterprise Linux v4
–
•
2.6.9 Kernel
Infiniband
 Cisco Software stack
•
MVAPICH
–
•
•
•
•
MPICH 1.2.7 over IB Library
Gnu 3.4.6 C/C++/Fortran
Intel 9.1 C/C++/Fortran
Platform LSF HPC 6.2
Platform Rocks
9
Maintenance Schedule
 Current
• 2:00pm – 4:00pm
• 2nd and 4th Thursday (as necessary)
• Check website (mhpcc.hpc.mil) for maintenance notices
 New Proposed Schedule
• 8:00am – 5:00pm
• 2nd and 4th Wednesdays (as necessary)
• Check website for maintenance notices
 Only take maintenance on scheduled systems
 Check on Mondays before submitting jobs
10
Account Applications and Documentation
 Contact Helpdesk or website for application information
 Documentation Needed:
•
Account names, systems, special requirements
•
Project title, nature of work, accessibility of code
•
Nationality of applicant
•
Collaborative relevance with AFRL
 New Requirements
•
“Case File” information
•
For use in AFRL research collaboration
•
Future AFRL applicability
•
Intellectual property shared with AFRL
 Annual Account Renewals
•
September 30 is final day of the fiscal year
11
Summary
 Anticipated migration to Shark
 Should be more productive and able to support wide range of jobs
 Cutting edge technology
 Cost savings from Hurricane (~$400,000 annual)
 Stay tuned for timeline – likely end of January, early February
12
Mahalo
13
Download