March 22, 2000 Dr. Thomas Sterling, Caltech 1
Presentation to the American Physical Society:
Networking Options for Beowulf
Clusters
Dr. Thomas Sterling
California Institute of Technology and
NASA Jet Propulsion Laboratory
March 22, 2000
March 22, 2000 Dr. Thomas Sterling, Caltech 3
March 22, 2000 Dr. Thomas Sterling, Caltech 4
Points of Inflection Computing
• Heroic Era (1950)
– technology: vacuum tubes, mercury delay lines, pulse transformers
– architecture: accumulator based
– model: von-Neumann, sequential instruction execution
– examples: Whirlwind, EDSAC
• Mainframe (1960)
– technology: transistors, core memory, disk drives
– architecture: register bank based
– model: reentrant concurrent processes
– examples: IBM 7042, 7090, PDP-1
• Scientific Computer(1970)
– technology: earliest SSI logic gate modules
March 22, 2000 Dr. Thomas Sterling, Caltech
– model: parallel processing
– examples: CDC 6600, Goodyear STARAN
5
Points of Inflection in the History of
Computing
• Supercomputers (1980)
– technology: ECL, semiconductor integration, RAM
– architecture: pipelined
– model: vector
– example: Cray-1
• Massively Parallel Processing
(1990)
– technology: VLSI, microprocessor,
– architecture: MIMD
– model: Communicating Sequential
Processes, Message passing
Dr. Thomas Sterling, Caltech
• ? (2000)
6
March 22, 2000 Dr. Thomas Sterling, Caltech 7
March 22, 2000 Dr. Thomas Sterling, Caltech 8
Punctuated Equilibrium nonlinear dynamics drive to point of inflexion
• Drastic reduction in vendor support for HPC
• Component technology for PCs matches workstation capability
• PC hosted software environments achieve sophistication and robustness of mainframe O/S
• Low cost network hardware and software enable balanced PC clusters
• MPPs establish low level of expectation
• Cross-platform parallel programming model
March 22, 2000 Dr. Thomas Sterling, Caltech 9
BEOWULF -CLASS
SYSTEMS
• Cluster of PCs
– Intel x86
– DEC Alpha
– Mac Power PC
• Pure M 2 COTS
• Unix-like O/S with source
– Linux, BSD, Solaris
• Message passing programming model
– PVM, MPI, BSP, homebrew remedies
• Single user environments
• Large science and engineering applications
March 22, 2000 Dr. Thomas Sterling, Caltech 10
Rank Manufacturer
33 Sun
34 Compaq
44 Self-made
143 Sun
169 Compaq
265 Self-Made
351 Fujitsu-Siemens
384 Sun
397 SGI
399 Sun
400 Sun
420 SGI
421 SGI
422 SGI
423 SGI
424 SGI
443 Sun
445 SGI
454 Self-made
Computer
HPC 4500 Cluster
AlphaServer SC
CPlant Cluster
HPC 10000 400 MHz Cluster
Alphleet Cluster
Avalon Cluster hpcLine Cluster
Rmax
272.1
271.4
232.6
68.77
61.3
48.6
Installation Site
Sun Burlington
Compaq Computer Corporation Littleton
Sandia National Laboratories Albuquerque
KT Freetel Seoul
Institute of Physical and Chemical Res. (RIKEN) Wako
Los Alamos National Laboratory/CNLS Los Alamos
41.45 Universitaet Paderborn - PC2 Paderborn
HPC 10000 333 MHz Cluster 39.87 Dutchtone
ORIGIN 2000 250 MHz - EthCluster 39.4 The Sabre Group Ft Worth
HPC 10000 400 MHz Cluster
HPC 10000 400 MHz Cluster
39.03 Computer Manufacturer
39.03 Semiconductor Company
ORIGIN 2000 300 MHz - EthCluster 37.31 Industrial Light & Magic
ORIGIN 2000 250 MHz - EthCluster 37.31 Government
ORIGIN 2000 250 MHz - EthCluster 37.31 America On Line (AOL)
ORIGIN 2000 250 MHz - EthCluster 37.31 Industrial Light & Magic
ORIGIN 2000 250 MHz - EthCluster 37.31 NASA/Ames Research Center/NAS Mountain View
HPC 10000 333 MHz Cluster 35.17 Gedas N.A. (VW)
64
64
128
144
128
128
128
70
ORIGIN 2000 250 MHz - EthCluster 34.47 Government
Parnass2 Cluster 34.23 University Bonn - Dep. of Applied Mathematics Bonn
112
128
# Proc
720
512
580
110
140
140
192
78
128
51.2
51.2
76.8
72
64
64
64
42
56
57.6
Rpeak
483.84
512
580
88
140
149.4
86.4
46.8
64
March 22, 2000 Dr. Thomas Sterling, Caltech 11
Beowulf-class Systems
A New Paradigm for the Business of Computing
• Brings high end computing to broad ranged problems
– new markets
• Order of magnitude Price-Performance advantage
• Commodity enabled
– no long development lead times
• Low vulnerability to vendor-specific decisions
– companies are ephemeral; Beowulfs are forever
• Rapid response technology tracking
• Just-in-place user-driven configuration
– requirement responsive
• Industry-wide, non-proprietary software environment
March 22, 2000 Dr. Thomas Sterling, Caltech 12
March 22, 2000 Dr. Thomas Sterling, Caltech 13
Have to Run Big Problems on Big
Machines?
• Its work, not peak flops
• A user’s throughput over application cycle
• Big machines yield little slices
– due to time and space sharing
• But data set memory requirements
– wide range of data set needs, three order of magnitude
– latency tolerant algorithms enable out-of-core computation
• What is Beowulf breakpoint for price-performance?
March 22, 2000 Dr. Thomas Sterling, Caltech 14
Throughput Turbochargers
• Recurring costs approx.. 10% MPPs
• Rapid response to technology advances
• Just-in-place configuration and reconfigurable
• High reliability
• Easily maintained through low cost replacement
• Consistent portable programming model
– Unix, C, Fortran, Message passing
• Applicable to wide range of problems and algorithms
• Double machine room throughput at a tenth the cost
• Provides super-linear speedup
March 22, 2000 Dr. Thomas Sterling, Caltech 15
Beowulf Project - A Brief History
• Started in late 1993
• NASA Goddard Space Flight Center
– NASA JPL, Caltech, academic and industrial collaborators
• Sponsored by NASA HPCC Program
• Applications: single user science station
– data intensive
– low cost
• General focus:
– single user (dedicated) science and engineering applications
– out of core computation
– system scalability
– Ethernet drivers for Linux
March 22, 2000 Dr. Thomas Sterling, Caltech 16
Beowulf System at JPL (Hyglac)
• 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory,
Fast Ethernet card.
• Connected using 100Base-T network, through a 16-way crossbar switch.
Theoretical peak performance: 3.2 GFlop/s.
Achieved sustained performance: 1.26 GFlop/s.
March 22, 2000 Dr. Thomas Sterling, Caltech 17
A 10 Gflops
Beowulf
Center for
Advance
Computing
Research
March 22, 2000
California Institute of Technology
Dr. Thomas Sterling, Caltech
172 Intel
Pentium Pro microprocessors
18
March 22, 2000 Dr. Thomas Sterling, Caltech 19
1st printing: May, 1999
2nd printing: Aug. 1999
MIT Press
March 22, 2000 Dr. Thomas Sterling, Caltech 20
Beowulf at Work
March 22, 2000 Dr. Thomas Sterling, Caltech 21
Beowulf
Scalability
March 22, 2000 Dr. Thomas Sterling, Caltech 22
Electro-dynamic FDTD Code
T3D
(shmem)
T3D
(MPI)
1.8
(1.3
* )
1.8
(1.3
*
0.007
0.08
)
Hyglac
(MPI,
Good Load
Balance)
1.1
Hyglac
(MPI,
Poor Load
Balance)
1.1
Interior
Computation
Interior
Communication
Boundary
Computation
Boundary
Communication
Total
0.19
0.04
2.0
(1.5
* )
0.19
1.5
3.5
(3.0
* )
3.8
0.14
50.1
55.1
3.8
0.42
0.0
5.5
( * using assembler kernel)
All timing data is in CPU seconds/simulated time step, for a global grid size of 282
362
102, distributed on 16 processors.
March 22, 2000 Dr. Thomas Sterling, Caltech 23
Network Topology Scaling
350
Latencies
(
s)
300
250
200
150
100
50
0
TCP Latency
UDP Latency
March 22, 2000 Dr. Thomas Sterling, Caltech 24
Routed Network - Random Pattern
March 22, 2000 Dr. Thomas Sterling, Caltech 25
March 22, 2000 Dr. Thomas Sterling, Caltech 26
March 22, 2000
System Area Network
Technologies
• Fast Ethernet
– LAN, 100 Mbps, 100 usec
• Gigabit Ethernet
– LAN/SAN, 1000 Mbps, 50 usec
• ATM
– WAN/LAN, 155/620 Mbps,
• Myrinet
– SAN, 1250 Mbps, 20 usec
• Giganet
– SAN/VIA, 1000 Gbps, 5 usec
• Servernet II
– SAN/VIA, 1000 Gbps, 10 usec
• SCI Dr. Thomas Sterling, Caltech
– SAN, 8000 Gbps, 5 usec
27
3Com CoreBuilder 9400 Switch and Gigabit Ethernet NIC
March 22, 2000 Dr. Thomas Sterling, Caltech 28
Lucent Cajun M770 Multifunction
Switch
March 22, 2000 Dr. Thomas Sterling, Caltech 29
M2LM-SW16 16-Port Myrinet
Switch with 8 SAN ports and 8
LAN ports
March 22, 2000 Dr. Thomas Sterling, Caltech 30
Dolphin Modular SCI Switch for
System Area Networks
March 22, 2000 Dr. Thomas Sterling, Caltech 31
Giganet High Performance Host
Adapters
March 22, 2000 Dr. Thomas Sterling, Caltech 32
Giganet High Performance
Cluster Switch
March 22, 2000 Dr. Thomas Sterling, Caltech 33
March 22, 2000 Dr. Thomas Sterling, Caltech 34
March 22, 2000 Dr. Thomas Sterling, Caltech 35
March 22, 2000 Dr. Thomas Sterling, Caltech 36
March 22, 2000 Dr. Thomas Sterling, Caltech 37
March 22, 2000 Dr. Thomas Sterling, Caltech 38
March 22, 2000 Dr. Thomas Sterling, Caltech 39
The Beowulf Delta looking forward
• 6 years
• Clock rate: X 4
• flops (per chip): X 50 (2-4 proc/chip, 4-8 way
ILP/proc)
• #processors: 32
• Networking: X 32 (32 - 64 Gbps)
• Memory: X 10 (4 Gbytes)
• Disk: X 100
• price-performance: X 50
March 22, 2000 Dr. Thomas Sterling, Caltech
• system performance: 50 Tflops
40
Million $$ Teraflops Beowulf?
• Today, $3M peak Tflops
• < year 2002 $1M peak Tflops
• Performance efficiency is serious challenge
• System integration
– does vendor support of massive parallelism have to mean massive markup
• System administration, boring but necessary
• Maintenance without vendors; how?
– New kind of vendors for support
• Heterogeneity will become major aspect
March 22, 2000 Dr. Thomas Sterling, Caltech 41
March 22, 2000 Dr. Thomas Sterling, Caltech 42