MS doc

advertisement
William Kerney
4/29/00
Clusters – Separating Myth from Fiction
I. Introduction
Over the last few years, clusters of commodity PCs have become ever more prevalent.
Since the early ‘90s computer scientists have been predicting the demise of Big Iron –
that is, the custom built supercomputers of the past such as the Cray XMP or the CM-* -due workstations’ superior price/performance. Big Iron machines were able to stay viable
for a long time since they were able to perform computations that were infeasible on even
the fastest of personal computers. In the last few years though, clusters of personal
computers have nominally caught up to supercomputers in raw CPU power and
interconnect speed, putting three self-made clusters in the top 500 of supercomputers1
This has led to a lot of excitement in the field of clustered computing, and to inflated
expectations as to what clusters can achieve. In this paper, we will survey three clustering
systems, compare some common clusters with a modern supercomputer, and then discuss
some of the myths that have sprung up about clusters in recent years.
II. The NOW Project
One of the most famous research efforts in clustered computing was the NOW project at
UC Berkeley, which ran from 1994-1998. “A case for NOW”2 by Culler et al. is an
1
2
http://www.netlib.org/benchmark/top500/top500.list.html
http://now.cs.berkeley.edu/Case/case.html
authoritative statement of why clusters are a “good idea” – they have lowered costs,
greater performance, and can even be used as a general computer lab for students.
The NOW cluster physically looked like any other undergraduate computer lab: it had (in
1998), 64 UltraSPARC I boxes with 64MB of main memory each, all of which could be
logged into individually. For all intents and purposes they looked like individual
workstations that can submit jobs to an abstract global pool of computational cycles. This
global pool is provided by way of GLUnix, a distributed operating system that sits atop
Solaris and provides load balancing, input redirection, job control and co-scheduling of
applications that need to be run at the same time. GLUnix load balances by watching the
CPU usage of all the nodes in the cluster; if a user sits down at one workstation and starts
performing heavy computations, the OS will notice and migrate any global jobs to a less
loaded node. In other words, GLUnix is transparent – it appears to a user that he has full
access to his workstation’s CPU at all times with a batch submission system to access
spare cycles on all the machines across the lab. The user does not decide which nodes to
run on – he simply uses the resources of the whole lab.
David Culler and the other developers of NOW also discovered one of the most important
ideas to come out of clustered computing – Active Messages. Active Messages were
devised to compensate for the slower networks that workstations typically use – typically
10BaseT or 100BaseT, which get nowhere near the performance of high-performance
custom hardware like hypercubes of CrayLinks.3 In the NOW cluster, when a active data
packet arrives the NIC writes the data directly into an application’s memory. Since the
application no longer has to poll the NIC or copy data out of the NIC’s buffer, the overall
end-to-end latency decreases by 50% for medium-sized (~1KB) messages and from 4ms
to 12us for short (1 packet) messages, a 200x reduction in time. A network with active
messages running through it has a lower half power-point – the message size that
achieves half the maximum bandwidth – than a network using TCP since active messages
have a much smaller latency, especially with short messages. A network with AM hits the
half power point at 176 bytes, as compared with 1352 bytes for TCP. 4 When 95% of
packets are less than 192 bytes and the mean size is 382 (according to a study performed
on the Berkeley network), Active Messages will be very superior to TCP.
The downside to Active Messages is that programs must be rewritten to take advantage of
the interface; by default programs poll the network with a select(3C) call and do not set
up regions of memory for the NIC to write into. It is not a straightforward conversion
from TCP sockets since the application has to set up handlers to get called back when a
message arrives for the process. The NOW group worked around this by implementing
Fast Sockets5, which presents the same API as UNIX sockets, but has an active message
implementation beneath.
The results that came out of the NOW project were quite promising. It broke the world
record for the datamation disk-to-disk sorting benchmark6 in 1997, demonstrating that a
large number of cheap workstation drives can have a higher aggregate bandwidth than a
3
http://www.sgi.com/origin/images/hypercube.pdf
file://ftp.cs.berkeley.edu:/ucb/CASTLE/Active_Messages/hotipaper.ps
5
http://www.usenix.org/publications/library/proceedings/ana97/full_papers/rodrigues/rodrigues.ps
6
http://now.cs.berkeley.edu/NowSort/nowSort.ps
4
smaller number of high-performance drives in a centralized server. Also, the NOW
project showed that for a fairly broad class of problems the cluster was scalable and could
challenge the performance of traditional supercomputers with inexpensive components.
Their Active Messaging system, by lowering message delay, mitigated the slowdown
caused by running on a cheap interconnect.7
III. HPVM
HPVM, or High-Performance Virtual Machine, was a project by Andrew Chien et al. at
the University of Illinois at Urbana-Champaign (1997-present) that built in part on the
successes of the NOW project.8 Their goal was similar to PVM’s, in that they wanted to
present an abstract layer that looked like a generic supercomputer to its users, but was
actually composed of heterogeneous machines beneath.
The important difference between HPVM with PVM and NOW is that where PVM and
NOW use their own custom API to access the parallel processing capabilities of their
system, requiring programmers to spend a moderate amount of effort porting their code,
HPVM presents four different APIs which mimic common supercomputing interfaces.
So, for example, if the programmer already has a program written using SHMEM – the
one sided memory transfer API used by Crays – then he will be able to quickly port his
program to HPVM. The interfaces implemented by HPVM are: MPI, SHMEM, global
7
8
http://www.cs.berkeley.edu/~rmartin/logp.ps
http://www-csag.ucsd.edu/papers/hpvm-siam97.ps
arrays (similar to shared memory but allowing multi-dimensional arrays) and FM
(described below).9
The layer beneath HPVM’s multiple APIs is a messaging layer called Fast Messages. FM
was developed in 1995 as an extension of Active Messages.10 Since then, AM has been
worked on as well, so the projects have diverged slightly over the years though both have
independently implemented new features like having more than one active process per
node. The improvements FM made to AM include the following11:
1) FM allows the user to send messages larger than fit in main memory, AM does not.
2) AM returns an automatic reply to every request sent to detect packet loss. FM
implements a more sophisticated reliable delivery protocol and guarantees correct order
in the delivery of messages.
3) AM requires the user to specify the remote memory address the message will get
written into; FM only requires that a handler be specified for the message.
In keeping with HPVM’s goal of providing an abstract supercomputer, it theoretically
allows its interface to sit above any combination of hardware that a system admin can
throw together. In other words, it would allow an administrator to put 10 Linux Boxes, 20
NT Workstations and a Cray T3D into a virtual supercomputer that could run MPI, FM or
SHMEM programs quickly (via the FM underlying it all).
9
http://www-csag.ucsd.edu/projects/hpvm/doc/hpvmdoc_7.html#SEC7
http://www-csag.ucsd.edu/papers/myrinet-fm-sc95.ps
11
http://www-csag.ucsd.edu/papers/fm-pdt.ps
10
In reality, Chien’s group only implemented the first version of HPVM on NT and linux
boxes, and their latest version only does NT clustering. A future release might add
support for more platforms.
IV. Beowulf
Beowulf has been the big name in clustering recently. Every member of the high-tech
press has run at least on story on Beowulf: Slashdot12, Zdnet13, Wired14, CNN15 and
others. One of the more interesting things to note about Beowulf clusters is that there is
no such thing as a “definitive Beowulf cluster.” Various managers have labeled their
projects “Beowulf-Style” (like the Stone Soupercomputer16) while others will say that a
true Beowulf cluster is one that mimics the original cluster at NASA.17 Yet even others
claim that any group of boxes running an open source operating system is a “Beowulf.”
The definition we will use here is: any cluster of workstations which runs Linux with the
packages available off the official Beowulf website.
The various packages include:
1) Ethernet bonding – this allows multiple Ethernet channels to be logically joined into
one higher-bandwidth connection. In other words, if a machine had two 100Mb/s
connections to a hub, it would be able to transmit data over the network at 200Mb/s,
assuming that all other factors are negligible.
12
http://slashdot.org/articles/older/00000817.shtml
http://www.zdnet.com/zdnn/stories/news/0,4586,2341316,00.html
14
http://www.wired.com/news/technology/0,1282,14450,00.html
15
http://www.cnn.com/2000/TECH/computing/04/13/cheap.super.idg/index.html
16
http://stonesoup.esd.ornl.gov/
17
http://cesdis.gsfc.nasa.gov/linux/beowulf/beowulf.html
13
2) PVM or MPI – these standard toolkits are what allow HPC programs to actually be run
on the cluster. Unless the users has an application whose granularity is so high that it can
be done merely with remote shells, he will want to have either PVM or MPI or the
equivalent installed.
3) Global PID space – This patch allows only one given process id to be in use in any of
the linux boxes in the cluster. Thus, two nodes can always agree on what “Process 15” is;
this helps promote the illusion of the cluster being one large machine instead of a number
of smaller ones. As a side effect, the Global PID space patch allows processes to be run
on remote machines.
4) Virtual Shared Memory – This also contributes to the illusion of the Beowulf cluster
being one large machine. Even though each machine – in hardware – has no concept of a
remote memory address as an Origin 2000 does, with this kernel patch a process can use
pages of memory that physically exist on a remote machine. When a process tries to
access memory not in local RAM, it triggers a page fault, which invokes a handler that
fetches the memory from the remote machine.
5) Modified standard utilities – they have altered utilities like ps and top to give process
information over all the nodes in the cluster instead of just the local machine. This can be
thought of as a transparent way of dealing with things typically handled by a
supercomputer’s batch queue system. Where a user on the Origin 2000 would do a bps to
examine the state of the processes in the queues, a Beowulf user would simply do a ps
and look at the state of both local and remote jobs at the same time. It is up to a user’s
tastes to determine which way is preferable.
A good case study of Beowulf is the Avalon project18 at Los Alamos National
Laboratory. They put together a 70-CPU alpha cluster for 150,000$ in 1998. In terms of
peak performance, it scored twice as high as a multi-million dollar Cray with 256 nodes.
Peak rate, though, is a misleading performance metric: people will point to the high
GFLOPS rate and ignore the fact that those benchmarks did not take communication into
account. This leads to claims like the ones that the authors make, that do-it-yourself
supercomputing will make vendor-supplied supercomputers obsolete since their
price/performance ratio is so horrible.
Interestingly enough, in the two years since that paper was published, the top 500 list of
supercomputers is still overwhelmingly dominated by vendors. In fact, there are only
three self-made systems on the list, with the Avalon cluster (number 265) being one of
them.
Why is that the case?
Although they get a great peak performance – three times greater than the Origin 2000 –
a Beowulf cluster like Avalon doesn’t work as well in the real world. Real applications
communicate heavily, and a fast Ethernet switch cannot match the better speed of the
custom Origin interconnect. Even though Avalon was using an equal number of 533MHz
21164 Alphas as 195Mhz R10ks for the Origin 2000, the NASPAR Class B benchmark
rated the O2k at twice the performance. A 533Mhz 21164 specints at 27.919 while the
18
19
http://cnls.lanl.gov/avalon/
http://www.spec.org/osg/cpu95/results/res98q3/cpu95-980914-03070.html
195Mhz R10k only gets 10.420 This means that, due to the custom hardware on the O2k,
it was able to get six times the computing power out of the processors. Although the
authors claim a win since their system was 20 times cheaper than the Origin, the opposite
is true: it is justifying the cost of an Origin by saying, “If you want to make your system
run six times faster, you can pay extra for some custom hardware.” And given the
moderate success of the Origin 2000 line, users seem to be agreeing with this philosophy.
One important thing to note about Beowulf clusters is that they are different from a NOW
– instead of being a computer lab where students can sit down and use any of the
workstations individually, a Beowulf is a dedicated supercomputer with one point of
entry. (This is actually something that the GRID book is wrong about – pages 440-441
say that NOWs are dedicated. But the cited papers for NOW repeatedly state that they
have the ability to migrate jobs away from workstations being used interactively.)
Both NOWs and Beowulfs are made of machines which have independent local memory
spaces, but they go about presenting a global machine in different ways. A Beowulf uses
kernel patches to pretend to be a multi-CPU machine with a single address space,
whereas the NOW project uses GLUnix, which is a layer that sits above the system
kernel, that loosely glues machines together by allowing MPI invocations to be scheduled
and moved between nodes.
V. Myth
20
http://www.spec.org/osg/cpu95/results/res98q1/cpu95-980206-02411.html
As the Avalon paper demonstrated, there are a lot of inflated expectations of what
clusters can accomplish. Scanning through the forums of Slashdot21, one can easily assess
that there is a negative attitude prevailing towards vendor supplied supercomputers.
Quotes like “Everything can be done with a Beowulf cluster!” and “Supercomputers are
dead” are quite common. This reflects a naiveté on the part of the technical public as a
whole. There are two refutations to beliefs such as these:
1) The difference between buying a supercomputer and making a cluster is the
difference between repairing a broken window yourself or having a professional do
it for you. Building a Beowulf cluster is a do-it-yourself supercomputer. It is a lot
cheaper than paying professionals like IBM or Cray to do it for you – but as a trade-off,
you will have a lower reliability in your system because it is being done by amateurs. The
Avalon paper tried to refute this by saying that they had over 100 days of uptime, but
reading their paper carefully, one can see that only 80% of their jobs completed
successfully. Why did 20% fail? They didn’t know.
Holly Dail mentioned that the people that built the Legion cluster at the University of
Virginia suffered problems from having insufficient air conditioning in their machine
room. A significant fraction of the cost of a supercomputer is in building the chassis, and
the chassis is designed to properly ventilate multiple CPUs running heavy loads. Sure, the
Virginia people had a supercomputer for less than a “real” one costs, but they made up
for it in hardware problems.
21
http://www.slashdot.org/search.pl, search for “Beowulf”
Businesses need high availability. 40% of IT managers interviewed by zdnet13 said that
the reason that they were staying with mainframes and not moving to clusters of PCs is
that large expensive computers have more stringent uptime guarantees. IBM, for
example, makes a system that has a guaranteed 99.999% uptime – which means that the
system will only be down for fifty minutes during an entire year. Businesses can’t afford
to rely on systems like ASCI Blue, which is basically 256 quad Pentium Pro boxes glued
together with a custom interconnect. ASCI Blue has never been successfully rebooted.
A large part of the cost of vendor-supplied machines is for testing. As a researcher, you
might not care if you have to restart your simulation a few times, but a manger in charge
of a mission-critical project definitely wants to know that his system has been verified to
work. Do-it-yourself projects just can’t provide this kind of guarantee. That’s why
whenever a business needs repairs done on the building, they hire a contractor instead of
having their employees do it for less.
3) Vendors are already doing it. It is a truism right now that “Commercial, Off The
Shelf (COTS) technology should be used whenever possible.” People use this to justify
not buying custom-built supercomputers. The real irony is that the companies that build
these supercomputers are not dumb, and do use COTS technology whenever they can –
with the notable exception of Tera/Cray, who believe in speed at any price. The only
times that most vendors build custom hardware is when they feel that the added cost will
justify a significant performance gain.
For example, Blue Horizon, the world’s third most powerful computer, is built using
components from IBM workstations: its CPUs, memory and Operating System are all
recycled from their lower end systems. The only significant parts that are custom are the
high performance file system (which holds 4TB and can write data in parallel very
quickly), the chassis (which promotes reliability as discussed above), the SP switch
(which is being used for backwards compatibility), the monitoring software (the likes of
which cannot be found on Beowulf clusters) and the memory crossbar, which replaces the
bus-based memory system found on most machines these days. By replacing the bus with
a crossbar it greatly increases memory bandwidth and eliminates a bottleneck found in
many SMP programs: when multiple CPUs try to hit memory at once, only one at a time
can be served, causing severe system slowdown. Blue Horizon was sold to the
Supercomputer Center for 20,000,000$, which works out to roughly 20,000$ a processor,
an outrageously expensive price. But the fact that the center was willing to pay for it is
testimony enough that the custom hardware gave it enough of an advantage over systems
built entirely with COTS products.
VI Conclusion
Clustered computing is a very active field these days, with a number of good
advancements coming out of it, such as Active Messages, Fast Messages, NOW, HPVM,
Beowulf, etc. By building systems using powerful commodity processors, connecting
them with high-speed commodity networks using Active Messages and linking
everything together with a free operating system like linux, one can create a machine that
looks, acts and feels like a supercomputer except for the price tag. However, alongside
the reduced price comes a greater risk of failure, a lack of technical support when things
break (NCSA has a full service contract with SGI, for example), and the possibility that
COTS products won’t do as well as custom-built ones.
A few people have created a distinction between two different kinds of Beowulf clusters.
The first, Type I Beowulf, is built entirely with parts found at any computer store:
standard Intel processors, 100BaseT Ethernet and PC100 RAM. These machines are the
easiest and cheapest to buy, but are also the slowest due to the inefficiencies common in
standard hardware. The so-called Type II Beowulf is an upgrade to Type I Beowulfs –
they add more RAM than can be commonly found in PCs, they replace the 100BaseT
with some more exotic networking like Myrinet, and they upgrade the OS to use Active
Messages. In other words, they replace some of the COTS components with custom ones
to achieve greater speed.
I hold forth the view that traditional supercomputers are the logical extension of this
process, a “Type III Beowulf,” if you will. Blue Horizon, for example, can be thought of
as 256 IBM RS/6000 workstations that have been upgraded with a custom chassis and
memory crossbar instead of a bus. Just like Type II Beowulf’s, they replace some of the
COTS components with custom ones to achieve greater speed. There’s no reason to call
for the death of supercomputers at the hands of clusters; in some sense, the vendors have
done that already.
Download