Lessons from Experience and Future Directions Arch Davis (GS*69) Davis Systems Engineering
• 1. (not parallel) Fastest Possible serial – a. Make it complex – b. Limits • 2. Old superscalar, vector Crays, etc.
• 3. Silicon graphics shared memory (<64 CPUs) • 4. Intel shared memory: 2-32 processor servers • 5. Distributed memory: “Beowulf” clusters • 6. Biggest D.m.: NEC SS6 “Earth Simulator”
• 1. Processor type and speed • 2. Single or Dual • 3. Type of memory • 4. Disk topology • 5. Interconnection technology • 6. Physical packaging • 7. Reliability
• But to be reliable, more must be watched.
– Power supplies – Fans – Motherboard components – Packaging layout – Heat dissipation – Power quality • To be cost effective, configure carefully.
– Easy to overspecify and cost >2x what is necessary – Don’t overdo the connections, they cost a lot.
– The old woman swallowed a fly. Be careful your budget doesn’t die.
• A. Pentium 4 Inexpensive if not leading edge speed • B. Xeon =dual processor P4. Shares a motherboard.
• C. AMD Opteron 64-bit Needed for >2GB mem.
• D. (future) Intel 64-bit Will be AMD compatible!
• E. IBM 970 (G5) True 64-bit design Apple is using • F. Intel Itanium “Ititanic” 64-bit long instruction word
• 1. Disk per board • 2. Diskless + RAID
• Always a desire for way more speed than possible • Latency is ultimately an issue of light speed • Existing options: • 1. Ethernet, including Gigabit Switched • Very Robust, by Dave Boggs EECS’72 • Affordable, even at Gigabit • 2. Infiniband Switched • 3. Proprietary: Myrinet, Quadrics, Dolphin • Various topologies, including 2&3-D meshes • Remote DMA may be transfer method • Assumes noise-free channel, may have CRC
• It’s not “rocket science,” but it takes care.
• A few equations now and then never hurt when you are doing heat transfer design.
• How convenient is it to service?
• How compact is the cluster?
• What about the little things? Lights & buttons?
• “Take care of yourself, you never know how long you will live.”
• Quality is designed-in, not an accident.
• Many factors affect reliability.
• Truism: “All PCs are the same. Buy the cheapest and save.” • Mil-spec spirit can be followed without gold plate.
• Many components and procedures affect the result.
• Early philosophy: triage of failing modules • Later philosophy: Entire cluster uptime • Consequence of long uptime: user confidence, greatly accelerated research
Benchmarks ● Not a synthetic ● 100 timesteps of Terra code (John R. Baumgardner, LANL) ● Computational fluid dynamics application ● Navier-Stokes equation with ∞ Prandtl number ● 3D spherical shell multi-grid solver ● Global elliptic problem with 174,000 elements ● Inverting and solving at each timestep Results are with Portland Group pf90 Fortran compiler on –fastsse option And with Intel release 8 Fortran: Machine baseline lowpower Router2 epiphany pntium28 opteron146 Cray design P4 2.0
P4M 1.6
Xeon 2.4
Xeon 2.2
P4 2.8 /800 AMD 2.0 NEC SX-6 Compiler Intel Portland 319s 342s 362 sec 358 sec 264s 264s 172s 160s 305 sec 312 sec 209 sec 164 sec ~50 sec
• Usually is Linux with MPI for communication.
• Could be Windows, but not many.
• Compilers optimize.
• Management and monitoring software • Scheduling software
®
Compiler
PGF77 ® PGF90 ™
Language
FORTRAN 77 Fortran 90
PGHPF ® PGCC ® PGC++ ™
High Performance Fortran ANSI and K&R C ANSI C++ with cfront compatibility features
PGDBG ®
Source code debugger
PGPROF ®
Source code performance profiler
Linux Pentium 4 32-bit/64-bit Athlon Xeon
Command
pgf77 pgf90 pghpf pgcc pgCC pgdbg pgprof
Windows Opteron
™
Workstation Clusters
A turn-key package for configuration of an HPC cluster from a group of networked Linux workstations or dedicated blades
• Always go Beowulf if you can.
• Work on source code to minimize communication.
• Compilers may never be smart enough to automatically parallelize or second-guess the programmer or the investigator.
• Components will get faster, but interconnects will always lag processors.
• No existing boards are made for clustering.
• Better management firmware is needed.
• Blade designs may be proprietary.
• They may require common components to operate at all.
• Hard disks need more affordable reliability.
• Large, affordable Ethernet switches are needed.
• Think of clusters as “personal supercomputers.” • They are simplest if used as a departmental or small-group resource.
• Clusters too large may cost too much: – Overconfigured – Massive interconnect switches – Users can only exploit so many processors at once – Multiple runs may beat one massively parallel run.
– Think “lean and mean.”
• 1. Test these machines with your code.
• 2. Get a consultation on configuration
• Peter Bunge sends his greetings In anticipation of a Deutsche Geowulf 256 Processors… And many more clusters here and there.
a p p y