Building a Parallel Computer from Cheap PCs: SMILE Cluster

advertisement
Building a Parallel Computer from Cheap PCs: SMILE Cluster
Experiences
Putchong Uthayopas, Thara Angskun, Jullawadee Maneesilp
Parallel Research Group
Computer and Network Systems Research Laboratory,
Department of Computer Engineering, Faculty of Engineering,
Kasetsart University, Bangkok, Thailand, 10900.
Phone: 9428555 Ext. 1416 Fax: 5796245
E-mail: {pu,b38tra,b38jdm}@nontri.ku.ac.th
http://smile.cpe.ku.ac.th
ABSTRACT
This paper presents our experiences in constructing a parallel computer from cluster
of cheap PCs. This parallel computer can be programmed using PVM and MPI standard
message passing interface. Applications developed on this machine are portable to most
commercial supercomputer such as IBM SP System, SGI Power Challenge. The steps of
system integration and the application of this system are presented. We found that key factor
in building this kind of system is the integration of suitable hardware and software systems.
This technology is important in providing affordable supercomputing power for research and
academic communities in Thailand .
1
Introduction and Background
Advanced in microprocessor technology enable the Intel based personal computer to
deliver a performance comparable to a single node of supercomputer. In fact, the most
powerful supercomputer currently in use at Sandia National Laboratory [1,2] is made of a
collection of 200MHz Pentium Pro microprocessors. This microprocessor is even a generation
behind a processor used in PC (Pentium II 300MHz). Also, the availability of low cost fast
interconnection network and interface such as 100Mb/sec Fast Ethernet make it cheap and
easy to combine these powerful PCs together to build high-performance parallel computing
environment. With free version of UNIX such as LINUX, FreeBSD and public domain
packages from various sources, this system yields a lowest price/ performance than any
commercial high-performance computing system. It is clear that PC cluster is an excellent
high-performance platform for Thai researchers whose work involves compute- intensive
applications.
The effort to deliver low-cost high-performance computing platform to scientific
communities has been going on for many years. A network of workstation is a good candidate
since is has the same architecture as Distributed Memory Multicomputer System. An early
approach is to steal the cycle of idle workstations in workstation farm [3]. This approach can
provide computing power at low cost but suffer from the inadequate network bandwidth and
interference of user on each workstation. As the price of PC and workstation decrease rapidly,
it seem appropriate to consider a dedicated group of workstations or cluster since they are
now affordable by most of the research groups. One project that aims toward exploitation of
workstation network is NOW (Network of Workstation) project from University of California
at Berkeley [4].
Although workstation is robust and high performance, the price of workstation is still
high compare to PC. As Intel rapidly improve the performance of PC processor, the gap
between workstation performance and PC performance started to disappear. Hence, many
research groups start to put together a commodity off the shelves PC and Fast Ethernet to
build a parallel computer. One of the largest initiatives in this area is NASA Beowulf project
[5,6]. The goal of Beowulf project is to explore the issues in both software and hardware
necessary to build a large-scale parallel-computing platform at low cost. Currently, there are
many systems operated at NASA space center and collaborating universities such as Whitney
system at NASA Ames Research Center, Hyglac and Naegling System at Caltech, LoBoS,
System at National Institute of Health. Our group are also one of the collaborating group
under the Beowulf project.
This paper discusses the experiences in building and operating a parallel PC cluster
system named SMILE (Scalable Multicomputer Implementation using Low-cost Equipment)
Beowulf cluster. This system has been operating for more than a year at Computer and
Network Systems Research Laboratory, Department of Computer Engineering, Kasetsart
University. We use this system as a platform for parallel processing research, computational
science application development.
The rest of this paper is organized as follows. Section 2 gives a description of both
hardware and software architecture. Section 3 discusses the issues in system installation.
Section 4 present the ideas of software tool we developed to manage the cluster follow by the
application of our cluster in section 5. Finally, Section 6 presents the conclusion and point out
the future direction.
2
2.1
SMILE Beowulf Cluster System
Hardware Configuration
Current SMILE system consists of 8 compute nodes and 1 front-end node connected by
one 100 Mbps Fast Ethernet Hub. Four of the compute nodes are
• Pentium II 233MHz, 512 KB L2 Cache. (ASUS P2-L97 440LX)
• 64 MB of main memory , 2 GB EIDE hard disk
• 3COM Ethernet card 3C905TX
The rest of the compute nodes are
• Pentium 133MHz, 256 KB L2 Cache
• 64 MB of memory, 1.2 GB of IDE hard disk
• 3COM Ethernet card 3C905TX.
The front-end node uses
• Pentium 133MHz with 256 KB L2 Cache
• 64 MB of memory, 2 GB of EIDE hard disk
• CD-ROM drive and 17 inches monitor
• Two Ethernet card, one is Fast Ethernet card and another is 10 Mbps Ethernet.
The configuration is shown in Figure 1. The separation of compute node from the out side
world by front-end nodes yields a lot of benefit such as:
• Minimize the use of the Internet address since we have a freedom to assign an
internal Internet address freely.
• Separate heavy network traffic of computation tasks from external network.
• Tighten the security.
There are many technologies available to connect a group of PC together to form a
cluster. Conventional 10 Mbps Ethernet is too slow for most purposes except for application
development. ATM is a fast but still very expensive. So, the most economical choice is fast
Ethernet since this technology can deliver up to 100 Mbps at low cost for both switch network
and interface card.
Compute
node
Fronted
node
10 MB Ethernet Hub
Compute
node
Fast Ethernet Hub
Compute
node
Compute
node
Figure 1 System configuration of SMILE cluster
There are two approaches to link the compute nodes together.
1. Using a fast network switch.
2. Use multiple network card and link cluster using a static network configuration such
as hypercube or mesh.
The first approach has a benefit of easy installation and high throughput while second
alternative may have a lower cost. For small systems, the first approach will be more
effective since the second approach suffer from the latency caused by the routing of packet
across different interfaces on a node. Nevertheless, one factor that help determine the
applicability of second alternative is the communication pattern of application. In fact, many
applications might be able to exploit the redundant communication link to gain even faster
communication time. In recent literature [7], 2D Torus Topologies and Fast Ethernet Hub has
been evaluated. They reported that 2D Torus topology can deliver more aggregate bandwidth
and faster barrier synchronization. For simulated CFD applications, 2D Torus also
demonstrated substantially better performance than Hub. In our system, 100MBPS Fast
Ethernet Hub has been used instead of switch due to budget limitation.
2.2
Software Configuration
2.2.1 Operating System
The most popular OS for clustered computing is Linux (see http://www.linux.org). The
reasons are:
1. Linux is free and contains source code to the system that make it suitable for research and
development.
2. Plenty of good public domain software are available.
3. Linux communities are fast to respond to technology change. New hardware driver,
networking code are made available regularly.
We choose Linux RedHat 5.0 for SMILE system due to its ease of installation and strong
management tools and.
2.2.2
Parallel Programming Systems
To use cluster for parallel processing, a software layer must be installed to make a
virtual parallel machine from a group of nodes. This software usually take care of parallel task
creation / deletion, passing data among tasks, and synchronize tasks execution. There are
several systems available publicly such as:
• PVM [8] from university of Tennessee and Oak-Ridge National Laboratory.
• MPI [9] from MPI Forum. Since MPI is only a standard specification, there are
many implementations available such as MPICH[10], LAM[11]
• BSP [12] from Oxford University.
PVM is flexible and contain many enhancement that still not available in MPI such as
dynamic process creation. However, MPI started to gain more acceptances recently since it is
a standard supported by many institutions. Another interesting programming systems is which
started to gain a wide acceptance lately. MPICH and BSP have been install in SMILE System.
Fortran90 and HPF (High Performance Fortran) are also an interesting alternative.
These language help ease the process of porting old FORTRAN77 application to cluster by
providing automatic parallelization of code and data under control of programmers. We are
currently evaluating this alternative since it is similar to IBMSP2 that is available in
Kasetsart University. The disadvantage is that there is no public domain compiler available.
Some parallel math library are also available such as ScaLapack and PetSc.
Researcher can easily code a program using these libraries since no knowledge of parallel
processing is required. Table 1 Summarizes our choices.
Category
Operating System
Message Passing Library
Visualization
Math Library
Software used
Linux RedHat 5.0
MPICH, PVM 3.3, BSP
VIS5D
Scalapack, PetSc
Table 1 Smile main software components
3
System Installation
SMILE System has a long history of changes. The initial system start from five 486
machines and evolves to this configuration. Typical steps to build a cluster are:
1. Install Linux system on the front-end.
2. Install each compute node and link them together to form a cluster.
3. Setup a single system view.
4. Install parallel computing systems.
The following subsection explains these steps in detail.
3.1
Linux Installation on Front-end System
Linux installation can be done using CD-ROM, NFS, and ftp. Front-end node should
have a CD-ROM drive. The installation from CD is usually the simplest installation method.
On front-end node, we partition hard disk into four partitions: root file system (/), swap space
(/swap), and user command (/usr) and home system (/home). For the swap space, the rule of
thumb is to allocate twice the size of physical memory. One of the trickiest parts of the
installation is to setup to Ethernet card. We learn that:
• Plug and Play support must be turned off by changing BIOS configuration.
• All interrupt must be set such that there are no conflict.
• Kernel must be compiled to support IP forwarding and IP masquerading.
Once all of these have been done, the routing is set automatically by the system. At present,
IP address is specify by hand in /etc/hosts. Internet address of internal nodes can be any valid
address since they are separated from external network.
3.2
Installation of Compute Nodes
Since a compute node has no CD-ROM drive, we copy the image of CD-ROM onto
hard disk of front-end system and install RedHat software using ftp from the front-end.
Another method that we also use is cloning the disk. Since hard disk on every node is
identical, it can be plugged in to second IDE interface (set to slave) on a fully installed node.
Then the entire disk is cloned by issueing a command "dd if=/dev/hda of=/dev/hdb". This
command dumps all data from first hard disk to second hard disk. Then setup files must be
change to correct node configuration later.
3.3
Single System View
Single system view is the configurations that make the usage any nodes in cluster appears
the same to users. This feature provides many benefits such as:
• Single point of software installation
• Single management and security control
• Ease the installation of parallel programming systems
To achieve this setup, we need
• Single user authentication. This can be done using NIS (Network Information System)
• Single file system view, which can be done using NFS (Network File System).
In SMILE cluster, front-end node is also function as a NFS file server and NIS domain
server. Each user of cluster will have the account created on the front-end node. NIS domain
server (called ypserv) has already available in RedHat. So the front-end node run "ypserv" to
service NIS request while compute node run "ypbind" to receive the login information from
server. Hence, user can login and use any nodes transparently once the account has been
create at front-end node.
For file , we divide the directory into global directory and local directory. The global
directory will look the same on all nodes. Table 2 shows the list of global directory and their
intended purpose.
Directory
/usr/local
/home
Purpose
Install commonly used software packages
User home directory
Table 2 Global Directory
These directories are mounted to corresponding places on each compute node. As user
login to any node, that node will send NIS message to validate the login with front-end nodes.
If the login success, users will see the same /home and /usr/local directory. This setup enable
us to install major software at a single place (/usr/local on front-end).
3.4
Installation of Parallel Programming System
Most of the parallel programming systems can be install easily once the previous steps
have been completed. However, most of these systems use remote execution support in UNIX
(rsh, rlogin, rcp). So, UNIX security setting must be set to allow these operations to perform.
Usually, this can be done by putting the name of every node in ".rhosts" files in each user
home directory. This setup open the permission for remote machine. Setup like this can be
done easily in SMILE by changing only one ".rhosts" file since all user share the same home
directory on all nodes. Another choice is for the admin to go to every machines and setup
"/etc/hosts.equiv" to allow transparent login for all participating nodes. After all these setups,
we install MPI, PVM, BSP in /usr/local so that every users can have an access to them.
4
Managing the Cluster
One of the major hurdles in operating the cluster is the lack of management tool.
Realizing this fact, we have developed a management system for our cluster. This system
helps centralize management activity to a single point. Cluster administrator can easily
perform some management task such as:
Compute node
Client
Machines
CMA
CMA
CMA
Nodes Information
Management
Application
System
Information
SMA
RMI LIBRARY
Front-end node
Figure 2 Cluster Management Systems Organization
•
•
•
•
•
Shutdown or reboot any nodes
Remote login and remote command execution on any node.
Submit parallel command that execute cluster-wide and get the result back at
management point.
Query important statistic such as CPU usage, memory usage from any node or the
whole cluster.
Browse system configuration of any node from a single point.
In order to achieve these goals, we have developed a resource management as shown
in Figure 2. Each compute node has a daemon called CMA (Control and Monitoring Agent)
that collect node statistic continuously. This statistic is reported to a central resource
management server called SMA (System Management Agent). This server keeps track of over
all systems information. Management information is presented to user by a set of specific
management applications. A set of easy to use API called RMI (Resources Management
Interface) has been provides for the developer of management application. The interface are
available as C library, TCL/TK , and Java Class library. This interface is then used to develop
management application and utilities. We plan to release this software to public soon.
5
Current Applications
Presently, SMILE Cluster has been used as a scalable server to support courses and
researches in parallel processing. We are developing a parallel simulation and visualization of
particle movement in fluidize-bed. This project is collaboration with a team of researchers in
department of computer engineering, Kasetsart University. Simulation of fluidize-bed flow is
very compute intensive. Program must keeps track of movement of a large number of
particles in fluid. The movement is modeled by a set of differential equations, which must be
solved for each particle in each time step of simulation. Simulating 10,000 particles take days
to finish on workstation. Moreover, this application consumes a lot of memory, which limits
the size of model severely. Hence, parallelization of this application on cluster enables us to
tackle much more accurately and much larger model.
To develop or parallelize an application on a cluster, the guidelines that we found useful
are:
1. Do not start with the parallel version right away, always start on sequential code first.
2. Try to improve the sequential code first.
3. Identify the hot sport using profiling tool.
4. Parallelized only the hot spot.
We are parallelizing this application by converting the code from sequential
FORTRAN77 to FORTRAN 90 or F90 and then HPF (High Performance Fortran). F90 is a
superset of FORTRAN77 that include many high level operations such as array operations,
parallel loop. This enables compiler to perform automatic parallelization of code. HPF
increase the efficiency of F90 on parallel computer by adding a primitive that enable
programmer to assign how data should be distributed on the processors and how related data
should be group together to decrease communication. We decide to use this approach because
we are using IBM SP2 to develop code. Since there are many F90 and HPF compiler
available for the cluster, porting application to cluster is straightforward.
Simulation program generate huge amount of numerical that hard to understand.
Therefore, visualization is important to the study of the model. The step that we use to covert
data to animation are:
• For each time step, generate a object description file for the POVRAY rendering
software.
• Rendering each object description file to obtain a bit map image of each time step.
• Use an MPEG encoder application to convert a series of bit map images into MPEG-1
movies.
We found that the most time consuming step is the rendering a number of object description
file. So, we develop a script that distribute the render task to multiple nodes in cluster. This
help speedup a rendering process dramatically. Some results of the visualization are given in
Figure 3.
Figure 3 Visualization of Particle Movement in Fluidize Bed
6 Conclusion and Future Work
Although we started to gain more understanding about the setup and operation of parallel
Beowulf-class cluster computer, the following important issues need to be resolved.
• Limit number of application available. Researchers should be encouraged to develop
applications that can exploit this powerful platform.
• Lack of tools especially in the areas of Ensemble management, Parallel Programming and
Debugging, Visualization of Parallel Algorithm execution.
The future work that we plan is a direct respond to previously stated issues. Since application
is the key factir, we try to collabotare with scientists in various discipline to co-develop
parallel applications to enhance the system usefulness. We also plan to continue the
enhancement of our management tools in order to ease the management task of cluster
administrator.
There are again many issues involving system configuration that need to be mastered.
We are exploring how to setup a diskless node so that any computer can be brought into
cluster easily. Example similar automatic configuration can be found in [11].
Many tools need to be installed in order to make efficient use of the system. Parallel Tools
Consortium, which is a group of researcher that focus on various aspect of parallel tools has
issued a guideline for necessary tools on parallel supercomputer [12]. The list of all
components and components that supported by SMILE cluster are shown in Table 3. Notice
that the weakness of cluster is in debugging, resource management and job scheduling,
parallel I/O. We plan to enhance our system to meet this guideline.
Main Components
Sub Components
Application
Development
Shell and Utilities
Language Support
Debugging/ Tuning
Tools
Low-Level
Programming Interface
Operating Systems
Services
Parallel System
Administration Tools
Documentation
Stack Traceback
Utilities
Interactive
Debugger
Performance
Tuning Tools
Programming
Libraries
Math Libraries
Performance
Measurement
Libraries
Parallel I/O
Authentication/
Security and
Namespace
Management
File System
Job Management
and Scheduling
Resource
Management and
Accounting
Supported by
SMILE
1a,1b,1c,1d,1e,1
f
1g,1h
Unsupported by SMILE
2a
2b,2c,2d,2e
2f,2g,2h,2i
3a, 3c, 3d
3b
3g,3h
3e,3f,3I,3j,3k
3l,3m
4f
3n,3o,3p
4a,4b,4c,4d,4e
4g, 4h,4i
4j,4k,4l,4m
4n, 4o,4p,4q,4r,4s,4t
BDE-5a
(partial)
BDE-6a
(partial)
Table 3 List of Necessary Parallel Tools Identified by Parallel Tools Consortium
7
Acknowledgement
We would like to acknowledge the support from Department of Computer Engineering,
Faculty of Engineering, and Kasetsart University Research and Development Institute. Also,
the contribution of all students in Parallel Research make this project possible and should be
acknowledge here.
8
References
1. ASCI Red The World First Teraflop Ultracomputer, Sandia National Laboratory,
“http://www.sandia.gov/ASCI/Red”
2. Top 500 Supercomputer Sites, “http://www.netlib.org/benchmark/top500.html”
3. M. Litzkow, M. Livny, and M.W. Mutka,”Condor- A Hunter of Idle Workstation,”
Proceedings of the 8th International Conference on Distributed Computing Systems, pp.
104-111, June,1988.
4. David E. Culler, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Brent Chun, Steven
Lumetta, Alan Mainwaring, Richard Martin, Chad Yoshikawa, Federick Wong, “Parallel
Computing on Berkeley NOW”, Computer Science Division, University of California,
Berkley,1997.
5. Thomas Sterling, Donald J. Becker, Daniel Savarese, John E. Dorband Udaya A.
Ranawake, Charles V. Packer, “BEOWULF: A Parallel Workstation for Scientific
Computation”, Proceedings of ICPP95, 1995.
6. Martin Ewing, “Beowulf at SC97”, “http://www.yale.edu/secf/etc/sc97/beowulf.htm”
7. Kevin T. Pedretti, and Samuel A. Fineberg, “Analysis of 2D Torus and Hub Topologies of
100 Mb/s Ethernet for the Whitney Commodity Computing Testbed”, NAS Technical
Report NAS-97-017, NASA Ames Research Center, Moffett Field, California, 1997.
8. Sunderam, “PVM: A Framework for Parallel Distributed Computing”, Concurrency:
Practice and Experinece, December 1990, pp. 315-339.
9. MPI Forum,”MPI: A Message Passing Interface”, Supercomputing 93,1993
10. MPICH-A Portable Implementation of MPI, “http://www-c.mcs.anl.gov/mpi/mpich”
11. LAM/MPI Parallel Computing, “http://www.osc.edu/lam.html”
12. Jonathan M.D. Hill, D.B. Skillicorn,” Lessons Learned From Implementing BSP”,
Technical Report PRG-TR-21-96, Oxford University Computing Laboratory, Oxford
13. Samuel A. Fineberg,” A Scalable Software Architecture Booting and Configuring Nodes
in the Whitney Commodity Computing Testbed”, NAS Technical Report NAS-97-024,
NASA Ames Research Center, Moffett Filed, California, 1997.
14. Specification of Baseline Development Environment, Guidelines for Writing Systems
Software and Tools Requirements, Parallel Tools Consortium,
“http://www.cs.orst.edu/~pancake/SSTguidelines/baseline.html.”
Download