Supercomputing with Heterogeneous Cores and Operating Systems

Supercomputing with Heterogeneous Cores and Operating Systems:
How to Efficiently Communicate Between Systems and Load Balance
Miguel Garcia
Because it has become harder and harder to improve the speed of individual
processor cores, modern day computers increase processing speed by combining multiple
processor cores into a single processor (Supercomputer). Using a similar idea, modern
day supercomputers use racks of processors working in massively parallel systems or
even distributed systems where large groups of computers process data under the
direction of a master computer (Supercomputer).
Many of these systems are homogenous, using identical processors or computers
running identical operating systems, but the ease with which these systems can be made
with limited resources has led to systems with heterogeneous cores or even computers
running multiple different OS’s. In these cases, the OS or application has to be able to
communicate between the various processors and kernels not only to run the software but
also to ensure that the system is running efficiently. This has led some people to focus on
improving the communication between the different cores, and others working on
dynamic load balancing for the mismatched parts.
The communication specialists break up into two groups: those trying to improve
the OS’s ability to communicate among systems, and those using applications to improve
communication. For example, one group wrote new libraries to improve Java’s
intercommunication capabilities (Taboado). Their goal was to bring Java’s
communications up to par with other systems so that more programmers can take
advantage of Java’s built in multi-threading support (Taboado). Theoretically, this would
make it easier to write more efficient programs (Taboado). Others are focusing on
creating stable systems that can work on any mix of cores and OS’s. They are setting
communication protocols to make it easier for people to coordinate heterogeneous
architecture (Massetto). Another group chose to create a wrapper for Interprocess
Communication (IPC) system calls that translates closed source IPC system calls into
open source equivalents. (Sharifi). This creates universality in systems that might be
running multiple OS’s that allows them to communicate with fewer issues. Finally, one
group has written algorithms that assume a common OS, but help the programmer
distribute processes among the multiple different cores based on the capabilities of each
individual processor (Martinez). So faster cores get the more complex threads and the
slower cores get the simpler, faster threads.
Modified MPI Protocol
Now, the group working on stability recognized that many distributed systems
were already using LAN’s and Wi Fi signals to allow communication between the
various nodes (Massetto). They focused on the message passing protocols (MPI), which
allow for fairly quick and stable passing of messages among homogeneous computer
networks (Massetto). However, different OS’s have different protocols and the system
runs into issues as soon as a few different computers are added to the system (Massetto).
So they set out to find a way to allow for the quicker MPI messaging while maintaining
stability in a cluster made up of different operating systems (Massetto). Masseto’s group
realized that because the systems were already using LAN’s and Wi Fi as previously
mentioned, they could use those systems to bridge the gap (MAssetto). So, Massetto’s
group designed a library that could be added to existing MPI protocols (Massetto). Their
library used the TCP/IP channels to translate messages between the different operating
systems (Massetto). In this way, efficient MPI communication occurred between the
homogenous portions of the computer network, but when communication needed to occur
with sections using different operating systems, the OS’s fell back on TCP/IP protocol,
which all the systems understood (Massetto). This increased overhead versus systems that
were able to use MPI protocol for all messaging, but allowed stable and continuous
communication between the various components of a heterogeneous computing cluster
Java Based Protocol
Ramon Taboada and his group sought to tackle the problem of both the efficiency
and the communication of heterogeneous computer networks by taking advantage of User
level programming via Java (Taboada). Taboada and his team noted that while Java has
“built-in networking and multithreading support, object orientation automatic memory
management, platform independence, portability, security, an extensive API and a wide
community of developers” it has often been avoided in high end computing because of
it’s inefficient communication protocols (Taboada). However, those protocols do allow
Java applications on multiple systems to communicate with each other (Taboada). So, his
team created a library that can be added to java to improve that communication efficiency
(Taboada). Their initial effort focused on improving the scalability of java’s MPJ, which
was notably slower then MPI and not optimized to drop unnecessary protocols depending
on the type of system (Taboada). They also sought to minimize the necessity of
communication among the various processors as much as possible so as to limit the times
that the less efficient Java communication protocols would be activated (Taboada). They
did this by “favoring multithreading-based solutions” over “inter-node communication,”
choosing an efficient algorithm to parcel out the processes based on their number and the
required message size, and adding an “automatic performance tuning process” designed
to create an “optimal configuration file” that “maximizes the collectives performance in a
given system” (Taboada). In other words, they scaled down the messaging system to
improve it’s overall efficiency, and then tried to make the system recognize how to assign
it’s processes most efficiently at the beginning of a run so that as few messages as
possible had to be sent (Taboada). After running programs with Java on various multicore
systems they found significant improvements in the speed with which Java based
software ran on these systems (Taboada). In fact, the new protocols were fairly close in
efficiency to those of MPI protocols used by many homogenous systems (Doalla). Doalla
concludes that because this messaging library significantly improves Java messaging so
that it is comparable to MPI, Java is coming closer to displacing native languages like
Fortran that don’t provide “built-in multithreading and networking support” like Java
IPC Wrapper
Previous teams focused on applications that coordinated among the various OS’s
at the user level, but the final communication team looked at making universal system
calls by using a wrapper. Sharifi’s team noted that more and more systems that required
large amounts of computing power were relying on distributed computing models, and
that these models made use of commercially available systems that were often run with
off the shelf operating systems (Kahsyan). This meant that systems were put together
with whatever computers happened to be available and could end up having some
computers running one operating system, and others another completely different
operating system (Sharifi).
However, all of these systems still have to be able to talk to each other and be able
to distribute the processes among themselves in order to run programs. Because none of
them actually use the same IPC protocols, this led to issues in creating efficient systems
(Sharifi). Sharifi noted that while user based IPC is the easiest to implement in
heterogeneous systems, it’s more difficult to program and not terribly efficient (Sharifi).
Kernel based IPC, however, is far simpler to program and a great deal more efficient
(Sharifi). Unfortunately, because of the aforementioned problem where there is no one
agreed upon protocol for IPC communications in commercial operating systems, it’s very
hard to implement a program that can give IPC system calls across systems comprised of
multiple operating systems (Sharifi). This is why many of the existing protocols for using
heterogeneous systems like Condor, IPC layer, and LAM tend to use user level protocols
(Sharifi). They translate system calls between the application and the kernel for the
programmer and even then can lead to difficulties when two OS’s don’t share a system
call (Sharifi). Their purpose is to ensure stable communication and function across the
distributed network, not attaining high performance results (Sharifi). Other systems using
other approaches, like a socket approach, run into portability issues that limit how easily
they can be applied to new and different clusters of computers (Sharifi).
In order to allow groups to take full advantage of existing computing resources
without having to homogenize them, Sharifi and his team developed a wrapper
specifically for systems that were comprised of a mix of Windows and Linux based
computers (Sharifi). This wrapper translates Windows based system calls to their Linux
based equivalents (Sharifi). Using it, programmers can write software that uses kernel
based IPC calls without having to navigate multiple system architectures and determining
what system gets what instructions (Sharifi).
Basically, the system is designed so that one computer, a Linux machine, is
designated as the lead computer (Sharifi). The programmers develop the software to be
run on this Linux main computer and parceled out to all of its distributed counterparts as
if they were all Linux as well (Sharifi). While programming for the Linux system, the
programmer sees a list of Windows equivalent calls that the computer is creating based
on it’s knowledge of both systems (Sharifi). In this way, the Linux program and its
Windows translation are created at the same time (Sharifi).
So, even though the software was developed for Linux, when it is introduced to
the computers in the group that are actually comprised of a mix of Linux and Windows
based computers it is still able to run without issue (Sharifi). This is because the wrapper
has allowed the program to be developed with the Windows translation being created at
the same time (Sharifi). So, when switching between Windows and Linux based systems
within the cluster, it is able to translate the data into the form that computer needs to be
able to process it on a kernel level (Sharifi). The wrapper developed preformed on par
with established ONC-RPC based system, which is one of the faster available
heterogeneous system call translators, but both were significantly slower than shared
memory and pipes, methods used on homogeneous systems (Sharifi).
ALBIC, unlike the previous three examples, does not focus on the communication
protocols between the various parts of the supercomputer, instead, ALBIC focuses on the
load balancing of processes among the cores of heterogeneous core systems for programs
designed for homogenous parallel processing systems (Martinez). Basically, ALBIC
assumes that the network already has effective communication protocols, and instead
attempts to deal with the loss of efficiency suffered by programs that were initially
designed for systems with homogenous cores (Martinez). Because those cores are all the
same, the program is designed to send out the threads of similar size to each of the cores
(Martinez). However, in a heterogeneous system, each core is going to have its own
capabilities (Martinez). To address this issue, Martinez and his team assumed all the
cores ran a Linux OS, modified that OS to take more frequent samples of the processor
stack length, and added a system call at the beginning and at the end of the section of
code to be balanced (Martinez). This lets the computer know what code sections to pay
attention to and allows the computer to dynamically assess the processing capability of
each processor by checking how much of it’s stack has been completed between
samplings (Martinez). Processors that are moving through their stack slower than other
processors are assigned fewer tasks, and the system becomes more efficient (Martinez).
The best part, is that the modifications to the program require very little user input and
can quickly adapt a program written for a homogeneous system to a heterogeneous core
system without having to spend days studying the computer architecture and
reprogramming (Martinez). So their system is advantageous not only because they were
able to show increased efficiency compared to other proposals, but because their system
can do so without having to spend hours or days of coding time re-optimizing the
program for a new system (Martinez).
Because computer systems have advanced so quickly, we are finally reaching the
point where it is becoming easy to create a powerful computing system out of readily
available equipment with very little specialized hardware or software. Unfortunately,
because these components aren’t all standardized, we have to address how to coordinate
components and software that weren’t necessarily meant to work together. How to ensure
a stable system while efficiently making use of all resources is still a work in progress.
