Building a Beowulf: My Perspective and Experience Ron Choy Lab. for Computer Science

advertisement
Building a Beowulf:
My Perspective and Experience
Ron Choy
Lab. for Computer Science
MIT
Outline
•
•
•
•
History/Introduction
Hardware aspects
Software aspects
Our class Beowulf
The Beginning
• Thomas Sterling and Donald Becker
CESDIS, Goddard Space Flight Center,
Greenbelt, MD
• Summer 1994: built an experimental cluster
• Called their cluster Beowulf
The First Beowulf
•
•
•
•
16 x 486DX4, 100MHz processors
16MB of RAM each, 256MB in total
Channel bonded Ethernet (2 x 10Mbps)
Not that different from our Beowulf
The First Beowulf (2)
Current Beowulfs
• Faster processors, faster interconnect, but
the idea remains the same
• Cluster database:
http://clusters.top500.org/db/Query.php3
• Top cluster: 1.433 TFLOPS peak
Current Beowulfs (2)
What is a Beowulf ?
• Massively parallel computer built out of
COTS
• Runs a free operating system (not Wolfpack,
MSCS)
• Connected by high speed interconnect
• Compute nodes are dedicated (not Network
of Workstations)
Why Beowulf?
• It’s cheap!
• Our Beowulf, 18 processors, 9GB RAM:
$15000
• A Sun Enterprise 250 Server, 2 processors,
2GB RAM: $16000
• Everything in a Beowulf is open-source and
open standard - easier to manage/upgrade
Essential Components of a
Beowulf
•
•
•
•
Processors
Memory
Interconnect
Software
Processors
• Major vendors: AMD, Intel
• AMD: Athlon MP
• Intel: Pentium 4
Comparisons
• Athlon MPs have more FPUs (3) and higher
peak FLOP rate
• P4 with highest clock rate (2.2GHz) beats
out the Athlon MP with highest clock rate
(1.733GHz) in real FLOP rate
• Athlon MPs have higher real FLOP per
dollar, hence it is more popular
Comparisons (2)
• P4 supports SSE2 instruction set, which
perform SIMD operations on double
precision data (2 x 64-bit)
• Athlon MP supports only SSE, for single
precision data (4 x 32-bit)
Memory
• DDR RAM (double data rate) – used mainly
by Athlons, P4 can use them as well
• RDRAM (Rambus DRAM) – used by P4s
Memory Bandwidth
• Good summary:
http://www6.tomshardware.com/mainboard/
02q1/020311/sis645dx-03.html
• DDR beats out RDRAM in bandwidth, and
is also cheaper
Interconnect
• The most important component
• Factors to consider
–
–
–
–
Bandwidth
Latency
Price
Software support
Ethernet
• Relatively inexpensive, reasonably fast and
very popular.
• Developed by Bob Metcalfe and D.R.
Boggs at Xerox PARC
• A variety of flavors (10Mbps, 100Mbps,
1Gbps)
Pictures of Ethernet Devices
Myrinet
• Developed by Myricom
• “OS bypass”, the network card talks directly
to host processes
• Proprietary, but very popular because of its
low latency and high bandwidth
• Usually used in high-end clusters
Myrinet pictures
Comparison
Latency
Fast
Ethernet
~120s
Gigabit
Ethernet
~120 s
Bandwidth ~100Mbps ~1Gbps
peak
peak
Myrinet
~7 s
~1.98Gbps
real
Cost Comparison
• To equip our Beowulf with:
– Fast ethernet: ~$1700
– Gigabit ethernet: ~ $5600
– Myrinet: ~$17300
How to choose?
• Depends on your application!
• Requires really low latency e.g. QCD?
Myrinet
• Requires high bandwidth and can live with
higher latency e.g. ScaLAPACK? Gigabit
ethernet
• Embarrassingly parallel? Anything
What would you gain from fast
interconnect?
• Our cluster: Single fast ethernet (100Mbps)
– 36.8 GFLOPS peak, HPL: ~12 GFLOPS
– 32.6% efficiency
• GALAXY: Gigabit ethernet
– 20 GFLOPS peak, HPL: ~7 GFLOPS
– 35% efficiency *old, slow tcp/ip stack!*
• HELICS: Myrinet 2000
– 1.4 TFLOPS peak, HPL: ~864 GFLOPS
– 61.7% efficiency
My experience with hardware
• How long did it take for me to assemble the
9 machines? 8 hours, nonstop
Real issue 1 - space
• Getting a Beowulf is great, but do you have
the space to put it?
• Often space is at a premium, and Beowulf is
not as dense as traditional supercomputers
• Rackmount? Extra cost! e.g. cabinet
~$1500, case for one node ~$400
Real issue 2 – heat management
• The nodes, with all the high powered
processors and network cards, run hot
• Especially true for Athlons - can reach 60°C
• If not properly managed, the heat can cause
crash or even hardware damage!
• Heatsink/fans – remember to put in in the
right direction
Real issue 3 - power
• Do you have enough power in your room?
• UPS? Surge protection?
• You don’t want a thunderstorm to fry your
Beowulf!
• For our case we have a managed machine
room - lucky
Real issue 4 - noise
• Beowulfs are loud. Really loud.
• You don’t want it on your desktop.
Bad idea
Real issue 5 - cables
• Color scheme your cables!
Software
• We’ll concentrate on the cluster
management core
• Three choices:
– Vanilla Linux/FreeBSD
– Free cluster management software (a very
patched up Linux)
– Commercial cluster management software (very
very patched up Linux)
The issues
• Beowulfs can get very large (100’s of
nodes)
• Compute nodes should setup themselves
automatically
• Software updates must be automated across
all the nodes
• Software coherency is an issue
Vanilla Linux
• Most customizable, easiest to make changes
• Easiest to patch
• Harder for someone else to inherit the
cluster – a real issue
• Need to know a lot about Linux to properly
setup
Free cluster management
softwares
•
•
•
•
Oscar: http://oscar.sourceforge.net/
Rocks: http://rocks.npaci.edu
MOSIX: http://www.mosix.org/
(usually patched) Linux that comes with software
for cluster management
• Reduces dramatically the time needed to get things
up and running
• Open source, but if something breaks, you have
one more piece of software to hack
Commercial cluster management
• Scyld: www.scyld.com - founded by Donald
Becker
• Scyld – father of Beowulf
• Sells a heavily patched Linux distribution
for clustering, free version available but old
• Based on bProc, which is similar to MOSIX
My experience/opinions
• I chose Rocks because I needed the
Beowulf up fast, and it’s the first cluster
management software I came across
• It was a breeze to setup
• But now the pain begins … severe lack of
documentations
• I will reinstall everything after the semester
is over
Software (cont’d)
• Note that I skipped a lot of details: e.g. file
system choice (NFS? PVFS?), MPI choice
(MPICH? LAM?), libraries to install …
• I could talk forever about Beowulfs but it
won’t fit in one lecture
Recipe we used for our Beowulf
•
Ingredients: $15000, 3 x 6 packs of coke, 1 grad
student
1. Web surf for 1 weeks, try to focus on the
Beowulf sites, decide on hardware
2. Spend 2 days filling in various forms for
purchasing and obtaining “competitive quotes”
3. Wait 5 days for hardware to arrive, meanwhile
web surf some more, and enjoy the last few days
of free time in a while
Recipe (cont’d)
4. Lock grad student, hardware (not money),
and coke in an office. Ignore scream. The
hardware should be ready after 8 hours.
Office of the future
Recipe (cont’d 2)
5.
Move grad student and hardware to its
final destination. By this time grad student
will be emotionally attached to the
hardware. This is normal. Have grad
student set up software. This would take 2
weeks.
Our Beowulf
Things I would have done
differently
• No Rocks, try Oscar or may be vanilla
Linux
• Color scheme the cables!
• Try a diskless setup (saves on cost)
• Get rackmount
Design a $30000 Beowulf
• One node (2 processors, 1GB RAM) costs
$1400, with 4.6 GFLOPS peak
• Should we get:
– 16 nodes, with fast ethernet, or
– 8 nodes, with Myrinet?
Design (cont’d)
• 16 nodes with fast ethernet:
– 73.6 GFLOPS peak
– 23.99 GFLOPS real (using the efficiency of our cluster)
– 16 GB of RAM
• 8 nodes with Myrinet
– 36.8 GFLOPS peak
– 22.7 GFLOPS real (using the efficiency of HELICS)
– 8 GB of RAM
Design (cont’d 2)
• First choice is good if you work on linear
algebra applications, and require lots of
memory
• Second choice is more general purpose
Download