CSC 345 1 Supercomputing with Heterogeneous Cores and Operating Systems: How to Efficiently Communicate Between Systems and Load Balance By Miguel Garcia 2 Miguel Garcia Because it has become harder and harder to improve the speed of individual processor cores, modern day computers increase processing speed by combining multiple processor cores into a single processor (Supercomputer). Using a similar idea, modern day supercomputers use racks of processors working in massively parallel systems or even distributed systems where large groups of computers process data under the direction of a master computer (Supercomputer). Many of these systems are homogenous, using identical processors or computers running identical operating systems, but the ease with which these systems can be made with limited resources has led to systems with heterogeneous cores or even computers running multiple different OS’s. In these cases, the OS or application has to be able to communicate between the various processors and kernels not only to run the software but also to ensure that the system is running efficiently. This has led some people to focus on improving the communication between the different cores, and others working on dynamic load balancing for the mismatched parts. The communication specialists break up into two groups: those trying to improve the OS’s ability to communicate among systems, and those using applications to improve communication. For example, one group wrote new libraries to improve Java’s intercommunication capabilities (Taboado). Their goal was to bring Java’s communications up to par with other systems so that more programmers can take advantage of Java’s built in multi-threading support (Taboado). Theoretically, this would make it easier to write more efficient programs (Taboado). Others are focusing on creating stable systems that can work on any mix of cores and OS’s. They are setting communication protocols to make it easier for people to coordinate heterogeneous architecture (Massetto). Another group chose to create a wrapper for Interprocess Communication (IPC) system calls that translates closed source IPC system calls into open source equivalents. (Sharifi). This creates universality in systems that might be running multiple OS’s that allows them to communicate with fewer issues. Finally, one group has written algorithms that assume a common OS, but help the programmer distribute processes among the multiple different cores based on the capabilities of each individual processor (Martinez). So faster cores get the more complex threads and the slower cores get the simpler, faster threads. CSC 345 3 Modified MPI Protocol Now, the group working on stability recognized that many distributed systems were already using LAN’s and Wi Fi signals to allow communication between the various nodes (Massetto). They focused on the message passing protocols (MPI), which allow for fairly quick and stable passing of messages among homogeneous computer networks (Massetto). However, different OS’s have different protocols and the system runs into issues as soon as a few different computers are added to the system (Massetto). So they set out to find a way to allow for the quicker MPI messaging while maintaining stability in a cluster made up of different operating systems (Massetto). Masseto’s group realized that because the systems were already using LAN’s and Wi Fi as previously mentioned, they could use those systems to bridge the gap (MAssetto). So, Massetto’s group designed a library that could be added to existing MPI protocols (Massetto). Their library used the TCP/IP channels to translate messages between the different operating systems (Massetto). In this way, efficient MPI communication occurred between the homogenous portions of the computer network, but when communication needed to occur with sections using different operating systems, the OS’s fell back on TCP/IP protocol, which all the systems understood (Massetto). This increased overhead versus systems that were able to use MPI protocol for all messaging, but allowed stable and continuous communication between the various components of a heterogeneous computing cluster (Massetto). Java Based Protocol Ramon Taboada and his group sought to tackle the problem of both the efficiency and the communication of heterogeneous computer networks by taking advantage of User level programming via Java (Taboada). Taboada and his team noted that while Java has “built-in networking and multithreading support, object orientation automatic memory management, platform independence, portability, security, an extensive API and a wide community of developers” it has often been avoided in high end computing because of it’s inefficient communication protocols (Taboada). However, those protocols do allow Java applications on multiple systems to communicate with each other (Taboada). So, his 4 Miguel Garcia team created a library that can be added to java to improve that communication efficiency (Taboada). Their initial effort focused on improving the scalability of java’s MPJ, which was notably slower then MPI and not optimized to drop unnecessary protocols depending on the type of system (Taboada). They also sought to minimize the necessity of communication among the various processors as much as possible so as to limit the times that the less efficient Java communication protocols would be activated (Taboada). They did this by “favoring multithreading-based solutions” over “inter-node communication,” choosing an efficient algorithm to parcel out the processes based on their number and the required message size, and adding an “automatic performance tuning process” designed to create an “optimal configuration file” that “maximizes the collectives performance in a given system” (Taboada). In other words, they scaled down the messaging system to improve it’s overall efficiency, and then tried to make the system recognize how to assign it’s processes most efficiently at the beginning of a run so that as few messages as possible had to be sent (Taboada). After running programs with Java on various multicore systems they found significant improvements in the speed with which Java based software ran on these systems (Taboada). In fact, the new protocols were fairly close in efficiency to those of MPI protocols used by many homogenous systems (Doalla). Doalla concludes that because this messaging library significantly improves Java messaging so that it is comparable to MPI, Java is coming closer to displacing native languages like Fortran that don’t provide “built-in multithreading and networking support” like Java (Taboada). IPC Wrapper Previous teams focused on applications that coordinated among the various OS’s at the user level, but the final communication team looked at making universal system calls by using a wrapper. Sharifi’s team noted that more and more systems that required large amounts of computing power were relying on distributed computing models, and that these models made use of commercially available systems that were often run with off the shelf operating systems (Kahsyan). This meant that systems were put together with whatever computers happened to be available and could end up having some CSC 345 5 computers running one operating system, and others another completely different operating system (Sharifi). However, all of these systems still have to be able to talk to each other and be able to distribute the processes among themselves in order to run programs. Because none of them actually use the same IPC protocols, this led to issues in creating efficient systems (Sharifi). Sharifi noted that while user based IPC is the easiest to implement in heterogeneous systems, it’s more difficult to program and not terribly efficient (Sharifi). Kernel based IPC, however, is far simpler to program and a great deal more efficient (Sharifi). Unfortunately, because of the aforementioned problem where there is no one agreed upon protocol for IPC communications in commercial operating systems, it’s very hard to implement a program that can give IPC system calls across systems comprised of multiple operating systems (Sharifi). This is why many of the existing protocols for using heterogeneous systems like Condor, IPC layer, and LAM tend to use user level protocols (Sharifi). They translate system calls between the application and the kernel for the programmer and even then can lead to difficulties when two OS’s don’t share a system call (Sharifi). Their purpose is to ensure stable communication and function across the distributed network, not attaining high performance results (Sharifi). Other systems using other approaches, like a socket approach, run into portability issues that limit how easily they can be applied to new and different clusters of computers (Sharifi). In order to allow groups to take full advantage of existing computing resources without having to homogenize them, Sharifi and his team developed a wrapper specifically for systems that were comprised of a mix of Windows and Linux based computers (Sharifi). This wrapper translates Windows based system calls to their Linux based equivalents (Sharifi). Using it, programmers can write software that uses kernel based IPC calls without having to navigate multiple system architectures and determining what system gets what instructions (Sharifi). Basically, the system is designed so that one computer, a Linux machine, is designated as the lead computer (Sharifi). The programmers develop the software to be run on this Linux main computer and parceled out to all of its distributed counterparts as if they were all Linux as well (Sharifi). While programming for the Linux system, the programmer sees a list of Windows equivalent calls that the computer is creating based 6 Miguel Garcia on it’s knowledge of both systems (Sharifi). In this way, the Linux program and its Windows translation are created at the same time (Sharifi). So, even though the software was developed for Linux, when it is introduced to the computers in the group that are actually comprised of a mix of Linux and Windows based computers it is still able to run without issue (Sharifi). This is because the wrapper has allowed the program to be developed with the Windows translation being created at the same time (Sharifi). So, when switching between Windows and Linux based systems within the cluster, it is able to translate the data into the form that computer needs to be able to process it on a kernel level (Sharifi). The wrapper developed preformed on par with established ONC-RPC based system, which is one of the faster available heterogeneous system call translators, but both were significantly slower than shared memory and pipes, methods used on homogeneous systems (Sharifi). ALBIC ALBIC, unlike the previous three examples, does not focus on the communication protocols between the various parts of the supercomputer, instead, ALBIC focuses on the load balancing of processes among the cores of heterogeneous core systems for programs designed for homogenous parallel processing systems (Martinez). Basically, ALBIC assumes that the network already has effective communication protocols, and instead attempts to deal with the loss of efficiency suffered by programs that were initially designed for systems with homogenous cores (Martinez). Because those cores are all the same, the program is designed to send out the threads of similar size to each of the cores (Martinez). However, in a heterogeneous system, each core is going to have its own capabilities (Martinez). To address this issue, Martinez and his team assumed all the cores ran a Linux OS, modified that OS to take more frequent samples of the processor stack length, and added a system call at the beginning and at the end of the section of code to be balanced (Martinez). This lets the computer know what code sections to pay attention to and allows the computer to dynamically assess the processing capability of each processor by checking how much of it’s stack has been completed between samplings (Martinez). Processors that are moving through their stack slower than other processors are assigned fewer tasks, and the system becomes more efficient (Martinez). CSC 345 7 The best part, is that the modifications to the program require very little user input and can quickly adapt a program written for a homogeneous system to a heterogeneous core system without having to spend days studying the computer architecture and reprogramming (Martinez). So their system is advantageous not only because they were able to show increased efficiency compared to other proposals, but because their system can do so without having to spend hours or days of coding time re-optimizing the program for a new system (Martinez). Because computer systems have advanced so quickly, we are finally reaching the point where it is becoming easy to create a powerful computing system out of readily available equipment with very little specialized hardware or software. Unfortunately, because these components aren’t all standardized, we have to address how to coordinate components and software that weren’t necessarily meant to work together. How to ensure a stable system while efficiently making use of all resources is still a work in progress. 8 Miguel Garcia Bibliography Martínez, J., et al. "Adaptive Load Balancing Of Iterative Computation On Heterogeneous Nondedicated Systems." Journal Of Supercomputing 58.3 (2011): 385-393. Academic Search Complete. Web. 24 Apr. 2014. Massetto, F.I. ( 1 ), L.M. ( 1 ) Sato, and K.-C. ( 2 ) Li. "A Novel Strategy For Building Interoperable MPI Environment In Heterogeneous High Performance Systems." Journal Of Supercomputing 60.1 (2012): 87-116.Scopus®. Web. 24 Apr. 2014. Sharifi, Mohsen, et al. "A Platform Independent Distributed IPC Mechanism In Support Of Programming Heterogeneous Distributed Systems." Journal Of Supercomputing 59.1 (2012): 548-567. Academic Search Complete. Web. 24 Apr. 2014. “Supercomputer Architecture.” Wikipedia. Wikipedia. Web. 22 April 2014. Taboada, G.L., et al. "Design Of Efficient Java Message-Passing Collectives On MultiCore Clusters." Journal Of Supercomputing 55.2 (2011): 126-154. Scopus®. Web. 24 Apr. 2014.