Building a Parallel Computer from Cheap PCs: SMILE Cluster Experiences Putchong Uthayopas, Thara Angskun, Jullawadee Maneesilp Parallel Research Group Computer and Network Systems Research Laboratory, Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand, 10900. Phone: 9428555 Ext. 1416 Fax: 5796245 E-mail: {pu,b38tra,b38jdm}@nontri.ku.ac.th http://smile.cpe.ku.ac.th ABSTRACT This paper presents our experiences in constructing a parallel computer from cluster of cheap PCs. This parallel computer can be programmed using PVM and MPI standard message passing interface. Applications developed on this machine are portable to most commercial supercomputer such as IBM SP System, SGI Power Challenge. The steps of system integration and the application of this system are presented. We found that key factor in building this kind of system is the integration of suitable hardware and software systems. This technology is important in providing affordable supercomputing power for research and academic communities in Thailand . 1 Introduction and Background Advanced in microprocessor technology enable the Intel based personal computer to deliver a performance comparable to a single node of supercomputer. In fact, the most powerful supercomputer currently in use at Sandia National Laboratory [1,2] is made of a collection of 200MHz Pentium Pro microprocessors. This microprocessor is even a generation behind a processor used in PC (Pentium II 300MHz). Also, the availability of low cost fast interconnection network and interface such as 100Mb/sec Fast Ethernet make it cheap and easy to combine these powerful PCs together to build high-performance parallel computing environment. With free version of UNIX such as LINUX, FreeBSD and public domain packages from various sources, this system yields a lowest price/ performance than any commercial high-performance computing system. It is clear that PC cluster is an excellent high-performance platform for Thai researchers whose work involves compute- intensive applications. The effort to deliver low-cost high-performance computing platform to scientific communities has been going on for many years. A network of workstation is a good candidate since is has the same architecture as Distributed Memory Multicomputer System. An early approach is to steal the cycle of idle workstations in workstation farm [3]. This approach can provide computing power at low cost but suffer from the inadequate network bandwidth and interference of user on each workstation. As the price of PC and workstation decrease rapidly, it seem appropriate to consider a dedicated group of workstations or cluster since they are now affordable by most of the research groups. One project that aims toward exploitation of workstation network is NOW (Network of Workstation) project from University of California at Berkeley [4]. Although workstation is robust and high performance, the price of workstation is still high compare to PC. As Intel rapidly improve the performance of PC processor, the gap between workstation performance and PC performance started to disappear. Hence, many research groups start to put together a commodity off the shelves PC and Fast Ethernet to build a parallel computer. One of the largest initiatives in this area is NASA Beowulf project [5,6]. The goal of Beowulf project is to explore the issues in both software and hardware necessary to build a large-scale parallel-computing platform at low cost. Currently, there are many systems operated at NASA space center and collaborating universities such as Whitney system at NASA Ames Research Center, Hyglac and Naegling System at Caltech, LoBoS, System at National Institute of Health. Our group are also one of the collaborating group under the Beowulf project. This paper discusses the experiences in building and operating a parallel PC cluster system named SMILE (Scalable Multicomputer Implementation using Low-cost Equipment) Beowulf cluster. This system has been operating for more than a year at Computer and Network Systems Research Laboratory, Department of Computer Engineering, Kasetsart University. We use this system as a platform for parallel processing research, computational science application development. The rest of this paper is organized as follows. Section 2 gives a description of both hardware and software architecture. Section 3 discusses the issues in system installation. Section 4 present the ideas of software tool we developed to manage the cluster follow by the application of our cluster in section 5. Finally, Section 6 presents the conclusion and point out the future direction. 2 2.1 SMILE Beowulf Cluster System Hardware Configuration Current SMILE system consists of 8 compute nodes and 1 front-end node connected by one 100 Mbps Fast Ethernet Hub. Four of the compute nodes are • Pentium II 233MHz, 512 KB L2 Cache. (ASUS P2-L97 440LX) • 64 MB of main memory , 2 GB EIDE hard disk • 3COM Ethernet card 3C905TX The rest of the compute nodes are • Pentium 133MHz, 256 KB L2 Cache • 64 MB of memory, 1.2 GB of IDE hard disk • 3COM Ethernet card 3C905TX. The front-end node uses • Pentium 133MHz with 256 KB L2 Cache • 64 MB of memory, 2 GB of EIDE hard disk • CD-ROM drive and 17 inches monitor • Two Ethernet card, one is Fast Ethernet card and another is 10 Mbps Ethernet. The configuration is shown in Figure 1. The separation of compute node from the out side world by front-end nodes yields a lot of benefit such as: • Minimize the use of the Internet address since we have a freedom to assign an internal Internet address freely. • Separate heavy network traffic of computation tasks from external network. • Tighten the security. There are many technologies available to connect a group of PC together to form a cluster. Conventional 10 Mbps Ethernet is too slow for most purposes except for application development. ATM is a fast but still very expensive. So, the most economical choice is fast Ethernet since this technology can deliver up to 100 Mbps at low cost for both switch network and interface card. Compute node Fronted node 10 MB Ethernet Hub Compute node Fast Ethernet Hub Compute node Compute node Figure 1 System configuration of SMILE cluster There are two approaches to link the compute nodes together. 1. Using a fast network switch. 2. Use multiple network card and link cluster using a static network configuration such as hypercube or mesh. The first approach has a benefit of easy installation and high throughput while second alternative may have a lower cost. For small systems, the first approach will be more effective since the second approach suffer from the latency caused by the routing of packet across different interfaces on a node. Nevertheless, one factor that help determine the applicability of second alternative is the communication pattern of application. In fact, many applications might be able to exploit the redundant communication link to gain even faster communication time. In recent literature [7], 2D Torus Topologies and Fast Ethernet Hub has been evaluated. They reported that 2D Torus topology can deliver more aggregate bandwidth and faster barrier synchronization. For simulated CFD applications, 2D Torus also demonstrated substantially better performance than Hub. In our system, 100MBPS Fast Ethernet Hub has been used instead of switch due to budget limitation. 2.2 Software Configuration 2.2.1 Operating System The most popular OS for clustered computing is Linux (see http://www.linux.org). The reasons are: 1. Linux is free and contains source code to the system that make it suitable for research and development. 2. Plenty of good public domain software are available. 3. Linux communities are fast to respond to technology change. New hardware driver, networking code are made available regularly. We choose Linux RedHat 5.0 for SMILE system due to its ease of installation and strong management tools and. 2.2.2 Parallel Programming Systems To use cluster for parallel processing, a software layer must be installed to make a virtual parallel machine from a group of nodes. This software usually take care of parallel task creation / deletion, passing data among tasks, and synchronize tasks execution. There are several systems available publicly such as: • PVM [8] from university of Tennessee and Oak-Ridge National Laboratory. • MPI [9] from MPI Forum. Since MPI is only a standard specification, there are many implementations available such as MPICH[10], LAM[11] • BSP [12] from Oxford University. PVM is flexible and contain many enhancement that still not available in MPI such as dynamic process creation. However, MPI started to gain more acceptances recently since it is a standard supported by many institutions. Another interesting programming systems is which started to gain a wide acceptance lately. MPICH and BSP have been install in SMILE System. Fortran90 and HPF (High Performance Fortran) are also an interesting alternative. These language help ease the process of porting old FORTRAN77 application to cluster by providing automatic parallelization of code and data under control of programmers. We are currently evaluating this alternative since it is similar to IBMSP2 that is available in Kasetsart University. The disadvantage is that there is no public domain compiler available. Some parallel math library are also available such as ScaLapack and PetSc. Researcher can easily code a program using these libraries since no knowledge of parallel processing is required. Table 1 Summarizes our choices. Category Operating System Message Passing Library Visualization Math Library Software used Linux RedHat 5.0 MPICH, PVM 3.3, BSP VIS5D Scalapack, PetSc Table 1 Smile main software components 3 System Installation SMILE System has a long history of changes. The initial system start from five 486 machines and evolves to this configuration. Typical steps to build a cluster are: 1. Install Linux system on the front-end. 2. Install each compute node and link them together to form a cluster. 3. Setup a single system view. 4. Install parallel computing systems. The following subsection explains these steps in detail. 3.1 Linux Installation on Front-end System Linux installation can be done using CD-ROM, NFS, and ftp. Front-end node should have a CD-ROM drive. The installation from CD is usually the simplest installation method. On front-end node, we partition hard disk into four partitions: root file system (/), swap space (/swap), and user command (/usr) and home system (/home). For the swap space, the rule of thumb is to allocate twice the size of physical memory. One of the trickiest parts of the installation is to setup to Ethernet card. We learn that: • Plug and Play support must be turned off by changing BIOS configuration. • All interrupt must be set such that there are no conflict. • Kernel must be compiled to support IP forwarding and IP masquerading. Once all of these have been done, the routing is set automatically by the system. At present, IP address is specify by hand in /etc/hosts. Internet address of internal nodes can be any valid address since they are separated from external network. 3.2 Installation of Compute Nodes Since a compute node has no CD-ROM drive, we copy the image of CD-ROM onto hard disk of front-end system and install RedHat software using ftp from the front-end. Another method that we also use is cloning the disk. Since hard disk on every node is identical, it can be plugged in to second IDE interface (set to slave) on a fully installed node. Then the entire disk is cloned by issueing a command "dd if=/dev/hda of=/dev/hdb". This command dumps all data from first hard disk to second hard disk. Then setup files must be change to correct node configuration later. 3.3 Single System View Single system view is the configurations that make the usage any nodes in cluster appears the same to users. This feature provides many benefits such as: • Single point of software installation • Single management and security control • Ease the installation of parallel programming systems To achieve this setup, we need • Single user authentication. This can be done using NIS (Network Information System) • Single file system view, which can be done using NFS (Network File System). In SMILE cluster, front-end node is also function as a NFS file server and NIS domain server. Each user of cluster will have the account created on the front-end node. NIS domain server (called ypserv) has already available in RedHat. So the front-end node run "ypserv" to service NIS request while compute node run "ypbind" to receive the login information from server. Hence, user can login and use any nodes transparently once the account has been create at front-end node. For file , we divide the directory into global directory and local directory. The global directory will look the same on all nodes. Table 2 shows the list of global directory and their intended purpose. Directory /usr/local /home Purpose Install commonly used software packages User home directory Table 2 Global Directory These directories are mounted to corresponding places on each compute node. As user login to any node, that node will send NIS message to validate the login with front-end nodes. If the login success, users will see the same /home and /usr/local directory. This setup enable us to install major software at a single place (/usr/local on front-end). 3.4 Installation of Parallel Programming System Most of the parallel programming systems can be install easily once the previous steps have been completed. However, most of these systems use remote execution support in UNIX (rsh, rlogin, rcp). So, UNIX security setting must be set to allow these operations to perform. Usually, this can be done by putting the name of every node in ".rhosts" files in each user home directory. This setup open the permission for remote machine. Setup like this can be done easily in SMILE by changing only one ".rhosts" file since all user share the same home directory on all nodes. Another choice is for the admin to go to every machines and setup "/etc/hosts.equiv" to allow transparent login for all participating nodes. After all these setups, we install MPI, PVM, BSP in /usr/local so that every users can have an access to them. 4 Managing the Cluster One of the major hurdles in operating the cluster is the lack of management tool. Realizing this fact, we have developed a management system for our cluster. This system helps centralize management activity to a single point. Cluster administrator can easily perform some management task such as: Compute node Client Machines CMA CMA CMA Nodes Information Management Application System Information SMA RMI LIBRARY Front-end node Figure 2 Cluster Management Systems Organization • • • • • Shutdown or reboot any nodes Remote login and remote command execution on any node. Submit parallel command that execute cluster-wide and get the result back at management point. Query important statistic such as CPU usage, memory usage from any node or the whole cluster. Browse system configuration of any node from a single point. In order to achieve these goals, we have developed a resource management as shown in Figure 2. Each compute node has a daemon called CMA (Control and Monitoring Agent) that collect node statistic continuously. This statistic is reported to a central resource management server called SMA (System Management Agent). This server keeps track of over all systems information. Management information is presented to user by a set of specific management applications. A set of easy to use API called RMI (Resources Management Interface) has been provides for the developer of management application. The interface are available as C library, TCL/TK , and Java Class library. This interface is then used to develop management application and utilities. We plan to release this software to public soon. 5 Current Applications Presently, SMILE Cluster has been used as a scalable server to support courses and researches in parallel processing. We are developing a parallel simulation and visualization of particle movement in fluidize-bed. This project is collaboration with a team of researchers in department of computer engineering, Kasetsart University. Simulation of fluidize-bed flow is very compute intensive. Program must keeps track of movement of a large number of particles in fluid. The movement is modeled by a set of differential equations, which must be solved for each particle in each time step of simulation. Simulating 10,000 particles take days to finish on workstation. Moreover, this application consumes a lot of memory, which limits the size of model severely. Hence, parallelization of this application on cluster enables us to tackle much more accurately and much larger model. To develop or parallelize an application on a cluster, the guidelines that we found useful are: 1. Do not start with the parallel version right away, always start on sequential code first. 2. Try to improve the sequential code first. 3. Identify the hot sport using profiling tool. 4. Parallelized only the hot spot. We are parallelizing this application by converting the code from sequential FORTRAN77 to FORTRAN 90 or F90 and then HPF (High Performance Fortran). F90 is a superset of FORTRAN77 that include many high level operations such as array operations, parallel loop. This enables compiler to perform automatic parallelization of code. HPF increase the efficiency of F90 on parallel computer by adding a primitive that enable programmer to assign how data should be distributed on the processors and how related data should be group together to decrease communication. We decide to use this approach because we are using IBM SP2 to develop code. Since there are many F90 and HPF compiler available for the cluster, porting application to cluster is straightforward. Simulation program generate huge amount of numerical that hard to understand. Therefore, visualization is important to the study of the model. The step that we use to covert data to animation are: • For each time step, generate a object description file for the POVRAY rendering software. • Rendering each object description file to obtain a bit map image of each time step. • Use an MPEG encoder application to convert a series of bit map images into MPEG-1 movies. We found that the most time consuming step is the rendering a number of object description file. So, we develop a script that distribute the render task to multiple nodes in cluster. This help speedup a rendering process dramatically. Some results of the visualization are given in Figure 3. Figure 3 Visualization of Particle Movement in Fluidize Bed 6 Conclusion and Future Work Although we started to gain more understanding about the setup and operation of parallel Beowulf-class cluster computer, the following important issues need to be resolved. • Limit number of application available. Researchers should be encouraged to develop applications that can exploit this powerful platform. • Lack of tools especially in the areas of Ensemble management, Parallel Programming and Debugging, Visualization of Parallel Algorithm execution. The future work that we plan is a direct respond to previously stated issues. Since application is the key factir, we try to collabotare with scientists in various discipline to co-develop parallel applications to enhance the system usefulness. We also plan to continue the enhancement of our management tools in order to ease the management task of cluster administrator. There are again many issues involving system configuration that need to be mastered. We are exploring how to setup a diskless node so that any computer can be brought into cluster easily. Example similar automatic configuration can be found in [11]. Many tools need to be installed in order to make efficient use of the system. Parallel Tools Consortium, which is a group of researcher that focus on various aspect of parallel tools has issued a guideline for necessary tools on parallel supercomputer [12]. The list of all components and components that supported by SMILE cluster are shown in Table 3. Notice that the weakness of cluster is in debugging, resource management and job scheduling, parallel I/O. We plan to enhance our system to meet this guideline. Main Components Sub Components Application Development Shell and Utilities Language Support Debugging/ Tuning Tools Low-Level Programming Interface Operating Systems Services Parallel System Administration Tools Documentation Stack Traceback Utilities Interactive Debugger Performance Tuning Tools Programming Libraries Math Libraries Performance Measurement Libraries Parallel I/O Authentication/ Security and Namespace Management File System Job Management and Scheduling Resource Management and Accounting Supported by SMILE 1a,1b,1c,1d,1e,1 f 1g,1h Unsupported by SMILE 2a 2b,2c,2d,2e 2f,2g,2h,2i 3a, 3c, 3d 3b 3g,3h 3e,3f,3I,3j,3k 3l,3m 4f 3n,3o,3p 4a,4b,4c,4d,4e 4g, 4h,4i 4j,4k,4l,4m 4n, 4o,4p,4q,4r,4s,4t BDE-5a (partial) BDE-6a (partial) Table 3 List of Necessary Parallel Tools Identified by Parallel Tools Consortium 7 Acknowledgement We would like to acknowledge the support from Department of Computer Engineering, Faculty of Engineering, and Kasetsart University Research and Development Institute. Also, the contribution of all students in Parallel Research make this project possible and should be acknowledge here. 8 References 1. ASCI Red The World First Teraflop Ultracomputer, Sandia National Laboratory, “http://www.sandia.gov/ASCI/Red” 2. Top 500 Supercomputer Sites, “http://www.netlib.org/benchmark/top500.html” 3. M. Litzkow, M. Livny, and M.W. Mutka,”Condor- A Hunter of Idle Workstation,” Proceedings of the 8th International Conference on Distributed Computing Systems, pp. 104-111, June,1988. 4. David E. Culler, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Brent Chun, Steven Lumetta, Alan Mainwaring, Richard Martin, Chad Yoshikawa, Federick Wong, “Parallel Computing on Berkeley NOW”, Computer Science Division, University of California, Berkley,1997. 5. Thomas Sterling, Donald J. Becker, Daniel Savarese, John E. Dorband Udaya A. Ranawake, Charles V. Packer, “BEOWULF: A Parallel Workstation for Scientific Computation”, Proceedings of ICPP95, 1995. 6. Martin Ewing, “Beowulf at SC97”, “http://www.yale.edu/secf/etc/sc97/beowulf.htm” 7. Kevin T. Pedretti, and Samuel A. Fineberg, “Analysis of 2D Torus and Hub Topologies of 100 Mb/s Ethernet for the Whitney Commodity Computing Testbed”, NAS Technical Report NAS-97-017, NASA Ames Research Center, Moffett Field, California, 1997. 8. Sunderam, “PVM: A Framework for Parallel Distributed Computing”, Concurrency: Practice and Experinece, December 1990, pp. 315-339. 9. MPI Forum,”MPI: A Message Passing Interface”, Supercomputing 93,1993 10. MPICH-A Portable Implementation of MPI, “http://www-c.mcs.anl.gov/mpi/mpich” 11. LAM/MPI Parallel Computing, “http://www.osc.edu/lam.html” 12. Jonathan M.D. Hill, D.B. Skillicorn,” Lessons Learned From Implementing BSP”, Technical Report PRG-TR-21-96, Oxford University Computing Laboratory, Oxford 13. Samuel A. Fineberg,” A Scalable Software Architecture Booting and Configuring Nodes in the Whitney Commodity Computing Testbed”, NAS Technical Report NAS-97-024, NASA Ames Research Center, Moffett Filed, California, 1997. 14. Specification of Baseline Development Environment, Guidelines for Writing Systems Software and Tools Requirements, Parallel Tools Consortium, “http://www.cs.orst.edu/~pancake/SSTguidelines/baseline.html.”