Uploaded by waithira99irungu

Hypercube Multiprocessors: Architecture & Algorithms

Jomo Kenyatta University of Agriculture and Technology
ICS 2410: Parallel Systems
Hypercube Multiprocessors
Group 3
Dr. Eunice Njeri
February 21, 2022
Table of Contents
TABLE OF CONTENTS ..............................................................................................................................................2
ABSTRACT ..................................................................................................................................................................3
INTRODUCTION .........................................................................................................................................................4
DEFINITIONS ..............................................................................................................................................................4
HYPERCUBE INTERCONNECTION .........................................................................................................................6
THE WHY BEHIND HYPERCUBE MULTIPROCESSORS ......................................................................................7
OPERATING PRINCIPLES OF HYPERCUBE MULTIPROCESSORS ....................................................................8
PROCESSOR ALLOCATION IN HYPERCUBE MULTIPROCESSORS .................................................................9
MAJOR STEPS IN PROCESSOR ALLOCATION ............................................................................................................. 11
TYPES OF PROCESSOR ALLOCATION POLICIES ......................................................................................................... 11
APPROACHES TO PROCESSOR ALLOCATION ............................................................................................................. 11
Top-down Approach ........................................................................................................................................... 12
Bottom-up Approach ........................................................................................................................................... 12
TASK MIGRATION IN HYPERCUBE MULTIPROCESSORS ............................................................................... 13
MAJOR STEPS IN TASK MIGRATION ......................................................................................................................... 14
FAULT-TOLERANT HYPERCUBE MULTIPROCESSORS ................................................................................... 15
HYPERCUBE MULTIPROCESSORS: ROUTING ALGORITHMS ........................................................................ 18
SOURCE VS. DISTRIBUTED ROUTING ........................................................................................................................ 20
DETERMINISTIC VS. ADAPTIVE ROUTING ................................................................................................................. 20
MINIMAL VS. NONMINIMAL ROUTING...................................................................................................................... 21
TOPOLOGY DEPENDENT VS. TOPOLOGY AGNOSTIC ROUTING .................................................................................. 21
ILLUSTRATION: ADAPTIVE ROUTING ....................................................................................................................... 21
PERFORMANCE MEASURES FOR HYPERCUBE MICROPROCESSORS ......................................................... 22
SPEEDUP .................................................................................................................................................................. 22
Limits to Speedup ................................................................................................................................................ 23
EFFICIENCY .............................................................................................................................................................. 23
SCALABILITY ........................................................................................................................................................... 24
CASE STUDY: CEDAR MULTIPROCESSOR ......................................................................................................... 24
CEDAR MACHINE ORGANIZATION ........................................................................................................................... 24
Cedar Cluster ...................................................................................................................................................... 25
Memory Hierarchy .............................................................................................................................................. 25
Importance of Cedar ........................................................................................................................................... 26
CONCLUSION ........................................................................................................................................................... 26
REFERENCES ............................................................................................................................................................ 28
Various parallel computing machines have been developed over the years to achieve the main
goal of faster processing times at the lowest cost. One of these computers are hypercube
multiprocessors, which comprise D = 2d processing elements (PEs). Hypercubes were a popular
technology in the late 1980s, but they proved to be too expensive for the performance merits they
offered. Regardless, there is still a lot to learn from the technology to inform the field of parallel
computing. This paper focuses on the architecture, processor allocation methods, and task
migration algorithms for these machines. It is established that activities within a hypercube are
coordinated through messages among processors. Therefore, various message routing algorithms
were developed to ensure that this process was undertaken effectively. Additionally, this paper
looks at the Cedar Multiprocessor, an early-generation hypercube that provides a practical view
regarding how the technology worked. Overall, this research shows the advances that parallel
processing has made over the years, with each technology contributing to making the field better.
Keywords: hypercube multiprocessors, processing elements, processor allocation, parallel
Hypercube Multiprocessors
Hypercubes were a popular class of parallel machines that emerged in the late 1980s and
early 1990s (Matloff, 2011). A hypercube computer consisted of several ordinary Intel
processors – each processor had memory and some serial I/O hardware for connections to its
neighboring processors. However, hypercubes proved to be too expensive for the performance
merits they offered. That combined with the small market share they enjoyed, hypercube
computers were slowly phased out. Nonetheless, they are essential for historical reasons as old
techniques are often recycled in the computing field. The algorithms driving hypercube
computers have become immensely popular among general machines.
Therefore, this report will discuss the architecture, processor allocation methods, and task
migration algorithms for these machines:
A hypercube of dimension, d, consists of D = 2d processing elements (PEs) and is called a
d-cube. Processing elements describe processor-memory pairs that enjoy fast serial I/O
The PEs in a d-cube will often have numbers 0 through D-1, meaning for the 4-cube
shown in Figure 1 below where D=16 (24), the PEs would be numbered 0000 through 1111, and
the PE numbered 1011 would have four neighbors – 0011, 1111, 1001, and 1010 – flipping a
given bit defines neighbor pairs across a specific dimension (Ostrouchov, 1987).
Figure 1: Hypercubes of dimensions 0 through 4 (Ostrouchov, 1987).
At times, it is easier to build up cubes from lower-dimensional cases using a simple twostep procedure:
1. Take a d-dimensional cube and duplicate it – the two duplicates can be referred to
as subcubes 0 and 1. Therefore, a subcube is a sub-graph of a hypercube that
preserves the hypercube's properties (Rai et al., 1995).
2. For each pair of same-numbered PEs, add 0 to the front of the label for the PE in
subcube 0 and 1 for the PE in subcube 1 (Matloff, 2011). Create a link between
the two same-numbered PEs.
Figure 2 below shows how a 4-cube may be constructed from two 3-cubes:
Figure 2: How a 4-cube can be constructed from two 3-cubes (Matloff, 2011).
Hypercube Interconnection
A hypercube/binary d-cube multiprocessor represents a loosely-coupled system with
D=2d interconnected processors. Each processor comprises a PE in the cube, and the direct
communication paths among neighbor processors represent the cube's edges.
Additionally, two different d-bit binary addresses may be assigned to each processor so
that every processor address varies from each of its d neighbors by 1-bit position. Hence, a
Boolean d-cube consists of 2d vertices, such that an edge exists between any two vertices if their
binary labels differ by 1.
Each link between two PEs acts as a dedicated connection, meaning if a PE needs to
communicate with a non-neighbor PE, multiple links (usually as many as d of them) have to be
traversed. This phenomenon may lead to significantly high communication costs since routing
messages through a d-cube structure may require from as many as one to d links from the source
to destination PE (Matloff, 2011).
Also, a hypercube employs the multiple-instruction multiple-data (MIMD) architecture
by allowing direct point-to-point communication between PEs (Das et al., 1990). Since each PE
has its local memory, shared global memory is unnecessary, implying that messages are passed
over the hypercube network.
The Why behind Hypercube Multiprocessors
A multiprocessor's speedup is one of the most objective methods of measuring a parallel
processing system's progress. Therefore, the desire for cost-effective computing became the
ultimate driving force for the hypercube industry. Advancing microelectronic technology by
developing the hypercube was considered progress towards the universal objective of boosting
machine architecture and performance. The hypercube computer was more impressive because it
separated brute force from skillful design (Chen et al., 1988). The hypercube was meant to
address the Von Neumann Bottleneck without disrupting the current computer industry. The
existing multiprocessor systems were plagued by generality, implying that a machine model that
attained better performance proportional to a multiprocessor's implementation costs was
Additional features that make hypercube multiprocessors valuable and popular as
general-purpose parallel machines include:
The hypercube topology is isotropic, meaning a d-cube appears the same from each PE,
leading to edge and node symmetry (Das et al., 1990). There are no edges, borders, or
apexes where a specific PE needs to be considered differently, and no particular resource
may cause a bottleneck.
The geometry leading to a logarithmic diameter also provides a beneficial tradeoff
between the high connectivity costs of a completely-connected scheme and significant
diameter issues associated with a ring geometry.
A hypercube also supports multiple interconnection topologies, including meshes, rings,
and trees, that may be embedded within any d-cube (see Figure 3 below). In keeping with
its topology flexibility, a hypercube supports multiple popular algorithms integrated onto
a d-cube to direct communication between neighboring PEs (Das et al., 1990).
Figure 3: 3-d hypercube embeddings of some interconnection schemes (Ostrouchov, 1987).
Lastly, routing messages between non-adjacent PEs is straightforward after dividing the
hypercube into smaller cube arrays to ensure multi-programming and simpler faulttolerant designs, as discussed in later sections (Das et al., 1990).
For these four reasons, the hypercube is suitable for a wide range of applications and
often provides an excellent test subject for parallel algorithms.
Operating Principles of Hypercube Multiprocessors
The processor activities in a hypercube parallel computer are coordinated through
messages among processors. Consequently, if a message needs to be sent between two nonneighbor PEs, the message is routed through the necessary intermediate PEs. The routing process
is simplistic:
The message is sent to a neighbor whose binary label is one bit closer to the destination
PE (Ostrouchov, 1987). Thus, the path length of such a message is the number of bit
positions in which the binary labels of the two PEs differ.
Typically, a communication co-processor exists on each PE to free the central processor
from too much communication overhead, meaning that messages only pass through the
co-processor on each PE.
Although all PEs are identical, a separate host processor exists to manage the hypercube
PEs. It has communication links with all the PEs and executes its management role
loosely by sending messages over these links.
The two operating principles mentioned above guarantee that a hypercube can attain a
satisfactory balance between the number of channels per PE (degree) and the diameter
(maximum path length between any two PEs) (Ostrouchov, 1987).
Ideally, a d-cube's interconnection scheme should have a small diameter to ensure fast
communication with PEs of a small degree so that the hypercube is easily scalable. It is also
important to note that if only a single path exists between any pair of PEs, then a communication
bottleneck may occur at the cube's root.
Processor Allocation in Hypercube Multiprocessors
A hypercube can run multiple separate jobs simultaneously on different
subcubes/subgraphs of its processors. As a result, subcube allocation and deallocation techniques
are critical as they help maximize processor utilization by minimizing task durations.
An efficient communication scheme in a hypercube involves the PEs performing more
computation than communication tasks. Nevertheless, most hypercube architectures comprise
PEs primarily involved in communicating messages between neighbors (Ahuja & Sarje, 1995).
Therefore, the processor allocation issue in a hypercube is identifying and locating a free
subcube that can accommodate a specific request of a particular size while minimizing system
fragmentation and maximizing hypercube utilization. The issue also extends to subcube
deallocation and the reintegration of these released processors into the set of available hypercube
Fragmentation may be internal or external. Internal fragmentation occurs when the
allocation scheme fails to recognize available subcubes. In contrast, external fragmentation
happens when a sufficient number of available processors cannot form a subcube large enough to
accommodate the incoming task request as they are scattered. External fragmentation is depicted
in Figure 4 below, where four available nodes cannot form a 2-dimensional cube, meaning if a
task requiring such a cube arrives, it is either rejected or queued:
Figure 4: An example of hypercube fragmentation (Chen & Shin, 1990).
Major Steps in Processor Allocation
1. Determining subcube size to accommodate the incoming task.
In this stage, each incoming request is represented using a graph in which a PE denotes a
task module, and each link represents the inter-module communication (Ahuja & Sarje,
2. Locating the subcube of the size determined in step (1) within the hypercube
The second step establishes whether a subcube can accommodate the request given
specific constraints.
Types of Processor Allocation Policies
a) Processor allocation may occur online or offline:
The operating system collects many requests before subcube allocation in an offline
dynamic (Rai et al., 1995).
On the other hand, in online allocation, each subcube request is addressed or ignored
immediately it arrives regardless of the number of subsequent requests (Rai et al., 1995).
This strategy requires the largest-sized subcubes to be maintained after every allocation
and relinquished after processing.
b) Processor allocation may also be static or dynamic:
A policy is static if the only incoming requests are for assignment, meaning deallocation
is not considered at any time.
But a dynamic technique handles both allocation and relinquishment depending on job
arrival and completion, leading to better resource utilization. However, finding a perfect
dynamic policy is difficult due to the increased operational overhead.
Approaches to Processor Allocation
Top-down Approach
Maintains a set or list to keep track of the free available cubes.
A standard top-down method is the free list strategy which updates a list for each
hypercube dimension by perfectly recognizing all the subcubes (Yoon et al., 1991). A free list
records all disjoint but free subcubes, meaning that a requested subcube can only be allocated
from the list. Although the allocation procedure is simple, the deallocation step is complicated. A
deallocated cube has to be compared with all idle, adjacent cubes to form a higher dimension
subcube – leading to a very high worst time complexity.
Secondly, Yoon et al. (1991) proposed another top-down technique [Heuristic Processor
Allocation (HPA)] strategy, which uses an undirected graph. The graph's vertex represents the
available system subcubes and is used to deallocate and allocate subcubes. A heuristic algorithm
is also used to ensure the available subcube is as large as possible during deallocation. The HPA
technique minimizes external fragmentation and reduces search time by modifying only related
PEs in the graph when addressing an allocation or reallocation request.
Bottom-up Approach
2d allocation bits help keep track of the availability of all hypercube processors, meaning
the allocation policy examines the allocation bits when forming subcubes (Yoon et al., 1991). A
binary tree structure records the allocated and free nodes, which requires extensive storage and
extended execution time to collapse the binary tree representation. The collapsed tree
successively forms a subcube by bringing together distant PEs starting from the lowest
Since subcube computation and information are distributed among idle nodes, the host's
burden is reduced, increasing the subcube recognition capacity.
Disadvantages of the bottom-up approach:
It is more challenging to eradicate external fragmentation as the allocation policy only
relies on allocation bit data.
Also, it is tedious to search for free subcubes under heavy system load.
The three central elements that impact the superiority of a top-down or bottom-up
allocation scheme are subcube recognition capability, time complexity associated with subcube
allocation, and optimality (if the policy can allocate a d-cube when there at least 2d free PEs)
(Yoon et al., 1991). Optimality can either be static or dynamic:
A statically optimal scheme can allocate a d-cube with at least 2d free PEs, given that
none of the allocated subcubes are released.
In contrast, a dynamically optimal policy can accommodate processor allocation and
relinquishment at any time.
Task Migration in Hypercube Multiprocessors
As mentioned above, subcube allocation and deallocation may result in a fragmented
hypercube even when a sufficient number of PEs is available since the available cubes may not
form a subcube large enough to accommodate an incoming task request. Such fragmentation
results in poor utilization of a hypercube's PEs, implying that task migration is necessary to
address this problem.
Task migration involves compacting and relocating active tasks within the hypercube at
one end to make for larger subcubes at the other end. It is essential to note that considerable
dependence exists between task migration and the subcube allocation policy (Chen & Shin,
1989). The relationship is necessary as tasks can only be relocated so that the allocation
technique being used can detect subcube availability.
Since a collection of occupied subcubes is called a configuration, it is essential to
establish the goal configuration to which a given fragmented hypercube must change after active
task relocation. The portion of a functional task located at each PE after the migration is called a
task module, implying that a moving step describes a PE's action to move its task module to
neighboring processors. Thus, the cost of task migration is measured using the moving steps
required since task migrations between different source and destination cubes can occur in
However, for parallel task migration to occur, it is critical to avoid deadlocks - adding an
extra stage during task migration requires finding a routing procedure that uses the shortest
deadlock-free route while formulating the PE-mapping between each pair of source and
destination subcubes.
Major Steps in Task Migration
Determining a goal configuration.
The allocation scheme used must be statically optimal before a goal configuration is
Determining the node mapping between source and destination subcubes.
Determining the shortest deadlock-free routing for moving task requests (Chen & Shin,
Complete parallelism and minimal deadlock instances are possible when using
stepwise disjoint paths for task migration.
Additional considerations during task migration:
An active task must be suspended when moved to another subcube location (Chen &
Shin, 1990). The task then resumes execution upon reaching its destination
Task migration induces operational overhead, which degrades system performance. As
such, determining an optimal threshold is necessary – such a value is arrived at by
compromising between task admissibility and resulting operational overhead.
The host processor tracks every PE's status, meaning it can easily design an optimal
threshold to decide when to execute task migration.
Fault-Tolerant Hypercube Multiprocessors
Hypercubes generally have high resilience since their parallel algorithms can be
partitioned into multiple tasks run on separate PEs' processors to attain high performance. Such
computers can run numerous tasks concurrently and individually on each processor within them.
For instance, in a d-cube, its extensive connectivity prevents non-faulty processors from
disconnection, meaning if the number of faulty PEs does not exceed d, the surviving network's
diameter can be bounded reasonably (Latifi, 1990). Nonetheless, the hypercube's universality
may be compromised if the topology changes after a fundamental algorithm is reorganized in an
inefficient and non-uniform way.
As such, a single PE's failure may destroy one dimension, meaning the loss of as few as
two PEs could compromise more than one dimension, which limits the subcube's reliability.
Without making any adjustments, the best choice when a link fails in a d-cube is to extract the
operational (d-1)-cube from the damaged section (Latifi, 1990). However, this leaves more than
half the nodes and connections within the network underused.
Thus, alternative fault-tolerance techniques are necessary to utilize the hypercube's
inherent redundancy and symmetry to reconfigure the architecture:
1. Use of hypercube free dimension
A hypercube is partitioned into several subcubes such that each partition contains at most
one faulty PE using the free dimension concept. A dimension in any hypercube is considered free
if no pair of faulty PEs exists across that dimension (Abd-El-Barr & Gebali, 2014). For example,
figure 5 demonstrates this idea using four different hypercube structures with differing free
Figure 5: Illustration of the free-dimension concept. In (a), all three dimensions are free. In (b),
only dimensions 2 and 3 are free. In (c), only dimensions 1 and 3 are free. In (d), only
dimensions 1 and 2 are free (Abd-El-Barr & Gebali, 2014).
2. Use of spanning trees (STs)
STs are used to configure a specific hypercube to bypass faulty PEs. The initial step is to
construct an ST for a given d-cube (Figure 6 depicts a 4-cube and its corresponding ST in Figure
Figure 6: A 4-cube multiprocessor with one faulty node [1010] (Abd-El-Barr & Gebali, 2014).
Figure 7: A spanning tree of a 4-cube (Abd-El-Barr & Gebali, 2014).
The ST construction is followed by reconfiguring the hypercube's architecture. First, the
faulty PE is removed from the ST, and the children of this removed PE are reconnected (Abd-ElBarr & Gebali, 2014). Figure 8 shows the removal of the PE 1010 from the ST and the
reconnection of its children. The reconfiguration defines new parents and children to circumvent
the faulty PE, 1010 - for example, in figure 8, the new children of the PE 1001 become 1000 and
1011, while the parent to 1011 becomes 1001.
Figure 8: How a spanning tree is reconfigured under a faulty PE (Abd-El-Barr & Gebali, 2014).
Hypercube Multiprocessors: Routing Algorithms
Message routing in large interconnection networks has attracted much attention, with
various approaches being proposed. Some of the fundamental distinctions among routing
algorithms involve the length of the messages injected into the network, the static or dynamic
nature of the injection model, particular assumptions on the semantic of the messages, the
architecture of the network and router, and the degree of synchronization in the hardware (Pifarre
et al. 1994). In terms of message length, several issues have been studied concerning the ways to
handle long messages (of potentially unknown size) and concise messages (typically of 150 to
300 bits). In packet-switching routing, the messages are of constant (and small) size, and they are
stored entirely in every node they visit. In wormhole routing, messages of unknown size are
routed in the network. These messages are never wholly stored in a node. Only pieces of the
messages, called flits, are buffered when routing.
According to Holsmark (2009), two subjects of long-standing interest in routing are
deadlock and live-lock freedom. In a deadlock, packets are involved in a circular wait that cannot
be resolved, whereas, in live-lock, packets wander in the network forever without reaching the
destination. Another vital property that networks strive to avoid is starvation, whereby packets
never get service in a node. Several strategies can be applied to handle deadlocks. Deadlock
avoidance techniques, for example, ensure that deadlock never can occur. On the other hand,
deadlock recovery schemes allow the formation of deadlocks but resolve them after the
Live-lock may occur in networks where packets follow paths that do not always lead
them closer to the destination. A standard solution to live-lock is to prioritize traffic based on
hop-counters. For each node a packet traverses, a hop counter is incremented. If several packets
request a channel, the one with the most considerable hop-counter value is granted access. This
way, packets that have long circled the network will receive higher priority and eventually reach
the destination. An example of starvation is when packets with higher priorities constantly outrank lower priority packets in a router. As a result, the lower prioritized packets are stopped from
advancing in the network. Starvation is a critical aspect when designing the router arbitration
Various routing algorithms can be used in hypercube processors to address the above issues.
Routing is the mechanism that determines message routes, which links and nodes each message
will visit from a source node to a destination node (Holsmark, 2009). A routing algorithm is a
description of the method that determines the route. Standard classifications of routing
algorithms are:
Source vs. Distribution routing
Deterministic vs. adaptive (Static vs. Dynamic) routing
Minimal vs. nonminimal routing
Topology dependent vs. topology agnostic routing.
Source vs. Distributed Routing
This classification is based on where the routing decisions are made. In source routing,
the source node decides the entire path for a packet and appends it as a field in the packet. After
leaving the source, routers switch the packet according to the path information. As routers are
passed, route information that is no longer necessary may be stripped off to save bandwidth.
Source routing allows for a straightforward implementation of switching nodes in the network.
However, the scheme does not scale well since header size depends on the distance between
source and destination. Allowing more than a single path is also inconvenient using source
In distributed routing, routes are formed by decisions at each router. Each router decides
whether it should be delivered to the local resource or forwarded to neighboring routers based on
the packet destination address. Distributed routing requires that more information is processed in
network routers. On the other hand, the header size is smaller and less dependent on network
size. It also allows for a more efficient way of adapting the route, depending on network and
traffic conditions, after a packet has left the source node.
Deterministic vs. Adaptive Routing
Another popular classification divides routing algorithms into deterministic (oblivious,
static) or adaptive (dynamic) types. Deterministic routing algorithms provide only a single fixed
path between a source node and a destination node. This scheme allows for the simple
implementation of network routers.
Adaptive routing allows several paths between a source and a destination. The final path
selection is determined at run-time, often depending on network traffic status.
Minimal vs. Nonminimal Routing
Route lengths determine if a routing algorithm is minimal or nonminimal. A minimal
algorithm only permits paths that are the shortest possible, also known as profitable routes. A
nonminimal algorithm can temporarily allow paths that, in this sense, are non-profitable. Even
though nonminimal routes result in a longer distance, the time for a packet transmission can be
reduced if the longer route avoids congested areas. Nonminimal paths may also be required for
fault tolerance.
Topology Dependent vs. Topology Agnostic Routing
Several routing algorithms are developed for specific topologies. Some are only usable
on regular topologies like meshes, whereas others are explicitly created for irregular topologies.
There is also a particular area of fault-tolerant routing algorithms, designed to work if the
topology is changed, where a regular topology is turned into an irregular topology.
Illustration: Adaptive Routing
A desirable feature of routing algorithms is adaptivity, that is, the ability of messages to
use multiple paths toward their destinations. In this way, alternative paths can be followed based
on factors that are local to each node, such as conflicts arising from messages competing for the
same resources, faulty nodes, or links. For example, consider the 16-node hypercube shown in
Figure 1. A message starting from node (0, 0, 0, 0) having (1, 0, I, 1) as its destination can move
to any of the following nodes as its first step: (I, 0, 0, 0), (0, 0. I, 0), and (0.0,0,1). An adaptive
routing function will allow more than one choice among the three possible nodes. The decision
about which node is selected is based on an arbitration mechanism that optimizes available
resources. If one of the involved links out of (0. 0, 0,0) is congested or faulty, a message can use
another link to reach its destination. Thus, adaptivity may yield more throughput by distributing
the traffic over the network's resources better.
Figure 7: The paths available in the 4-hypercube between node (0, 0, 0, 0) and node (1, 0, I, 1)
(Pifarre et al. 1994).
Performance Measures for Hypercube Microprocessors
In hypercube microprocessors, as in all parallel systems, time and memory are the
dominant performance metrics. The faster process is preferred between alternate methods that
use different amounts of memory, provided there is enough memory to run both ways. There is
no advantage to using less memory than might be made available to the application unless less
memory results in a reduction in execution time. Performance measurements in a similar domain
have been made more complex by the desire to know how much faster an application is running
on a parallel computer. What benefit arises from the use of parallelism, or how much is the
speedup that results from parity. According to Sahni and Thanvantri (2002), there are varied
definitions of speedup, one of which is actual analytical speedup. This metric can be computed
using the number of operations or workload. For example, the effective speed of the computer
may vary with the workload as larger workloads may require more memory and may eventually
require the use of slower secondary memory.
Table 1: Connected components average actual speedup on a nCube hypercube.
Limits to Speedup
Since each problem instance may be assumed to be solvable by a finite amount of work,
it follows that by increasing the number of processors indefinitely, a point is reached when there
is a lack of any work to be distributed to the newly added processors. No further speedup is
possible (Sahni & Thanvantri, 2002). Therefore, to attain ever-increasing amounts of speedup,
one must solve more extensive instances of workloads. Perhaps the oldest and most quoted
observation about limits to attainable speedup is Amdahl's law, which observes that if a problem
contains both serial (s) and parallel (p) components, then the observed speedup will be
(s + p) / s + p/P, where P is the speedup of a given processor.
Efficiency is a performance metric closely related to speedup. It is the ratio of speedup
and the number of processors P. Depending on the variety of speedup used, and one gets a
different sort of efficiency. Since speedup can exceed P, efficiency can exceed 1(Sahni &
Thanvantri, 2002). However, like speedup, efficiency should not be used as a performance
metric independent of run time. The reason for this is that since speedup favors slow processors,
so does efficiency. Additionally, the easiest way to get a relative efficiency of 1 is to use P = 1,
which results in no speedup.
The term scalability of a parallel system refers to the change in performance of the
parallel system as the problem size and computer size increase. Intuitively, a parallel system is
scalable if its performance continues to improve as the system, both the problem and machine, is
scaled in size (Sahni & Thanvantri, 2002).
Case Study: Cedar Multiprocessor
The Cedar multiprocessor was the first scalable, cluster-based, hierarchical sharedmemory multiprocessor of its kind in the 1980s (Yew, 2011). It was designed and built at the
Center for Supercomputing Research and Development (CSRD). The project successfully
created a complete, scalable shared-memory multiprocessor system with a working 4-cluster (32processor) hardware prototype, a parallelizing compiler for Cedar Fortran, and an operating
system called Xylem for scalable multiprocessor. The Cedar project was started in 1984, and the
prototype became functional in 1989 (Yew, 2011).
Cedar Machine Organization
The organization of Cedar consists of multiple clusters of processors connected through
two high-bandwidth single-directional global interconnection networks (GINs) to a globally
shared memory system (GSM). One GIN provides memory requests from clusters to the GSM.
The other GIN provides data and responses from GSM back to clusters.
Figure 8: Cedar machine organization (Yew, 2011).
Cedar Cluster
In each cluster, there are eight processors, called computational elements (CEs). Those eight CEs
are connected to a four-way interleaved shared cache through an 8 x 8 crossbar switch and four
ports to a global network interface (GNI) that provide access to GIN and GSM. On the other side
of the shared cache is a high-speed shared bus connected to multiple cluster memory modules
and interactive processors (IPs). IPs handle input/output and network functions.
Memory Hierarchy
The 4 GB physical memory address space of Cedar is divided into two halves between the
cluster memory and the GSM - 64 MB of GSM and 64 MB of cluster memory in each cluster on
the Cedar prototype. It also supports a virtual memory system with a 4 KB page size. GSM could
be directly addressed and shared by all clusters, but cluster memory is only addressable by the
CEs within each cluster. Data coherence among multiple copies of data in different cluster
memories is maintained explicitly through software by either programmer or the compiler. The
GSM is double-word interleaved and aligned among all global memory modules. Each GSM
module has a synchronization processor that could execute each atomic Cedar synchronization
instruction issued from a CE and staged at GNI.
Importance of Cedar
Cedar had many features that were later used extensively in large-scale multiprocessors
systems that could speed up the most time-consuming part of many linear systems in large-scale
scientific applications, such as:
1. Software-managed cache memory to avoid costly cache coherence hardware support.
2. Vector data prefetching to cluster memories for hiding long memory latency
3. Parallelizing compiler techniques that take sequential applications and extract task-level
parallelism from their loop structures.
4. Language extensions that included memory attributes of the data variables to allow
programmers and compilers to manage data locality more easily.
5. And parallel dense/sparse matrix algorithms.
Hypercubes were a popular class of parallel computers that have had a significant impact
on parallel computing. Generally, a hypercube is an arrangement of several ordinary processors,
each with its memory and serial I/O hardware for connections with neighboring processors. The
processors can be arranged in various topologies, such as rings, trees, and meshes, each with
advantages and disadvantages. However, underlying each arrangement is the desire that drives
all parallel systems: to achieve maximum processing output at the least cost. The processor
activities in a hypercube parallel computer are coordinated through messages among processors.
Therefore, the issues of deadlock and live-lock prevention are essential in a hypercube. Different
message routing algorithms have been implemented to overcome these issues, including source
vs. distribution, static vs. dynamic, minimal vs. nonminimal, and topology dependent vs.
topology agnostic routing approaches. These algorithms have varying strengths and weaknesses,
but, overall, they seek to enhance the three performance measures of hypercubes: speedup,
efficiency, and scalability. While hypercubes may no longer be implemented commercially, they
have been the foundation of many modern-day parallel computing systems.
Abd-El-Barr, M., & Gebali, F. (2014). Reliability analysis and fault tolerance for hypercube
multi-computer networks. Information Sciences, 276, 295-318.
Ahuja, S., & Sarje, A. K. (1995). Processor allocation in extended hypercube
multiprocessor. International Journal of High- Speed Computing, 7(04), 481-488.
Chen, M. S., & Shin, K. G. (1989, April). Task migration in hypercube multiprocessors.
In Proceedings of the 16th annual international symposium on Computer
architecture (pp. 105-111).
Chen, M. S., & Shin, K. G. (1990). Subcube allocation and task migration in hypercube
multiprocessors. IEEE Transactions on Computers, 39(9), 1146-1155.
Chen, M., DeBenedictis, Y. E., Fox, G., Li, J., & Walker, Y. D. (1988). Hypercubes are GeneralPurpose Multiprocessors with High Speedup. CALTECH Report.
Das, S. R., Vaidya, N. H., & Patnaik, L. M. (1990). Design and implementation of a hypercube
multiprocessor. Microprocessors and Microsystems, 14(2), 101-106.
Holsmark, R. (2009). Deadlock free routing in mesh networks on chip with regions (Doctoral
dissertation, Linköping University Electronic Press)
Latifi, S. (1990). Fault-tolerant hypercube multiprocessors. IEEE transactions on
reliability, 39(3), 361-368.
Matloff, N. (2011). Programming on parallel machines. University of California, Davis.
Ostrouchov, G. (1987, March). Parallel computing on a hypercube: an overview of the
architecture and some applications. In Computer Science and Statistics, Proceedings of
the 19th Symposium on the Interface (pp. 27-32).
Pifarre, G. D., Gravano, L., Denicolay, G., & Sanz, J. L. C. (1994). Adaptive deadlock-and live
lock-free routing in the hypercube network. IEEE Transactions on Parallel and
Distributed Systems, 5(11), 1121-1139
Rai, S., Trahan, J. L., & Smailus, T. (1995). Processor allocation in hypercube
multiprocessors. IEEE transactions on parallel and distributed systems, 6(6), 606-616.
Sahni, S., & Thanvantri, V. (1996). Parallel computing: Performance metrics and models. IEEE
Parallel and Distributed Technology, 4(1), 43-56
Yew, P-C. (2011). Cedar multiprocessor. In Padua D. (eds) Encyclopedia of Parallel
Computing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4_112
Yoon, S. Y., Kang, O., Yoon, H., Maeng, S. R., & Cho, J. W. (1991, December). A heuristic
processor allocation strategy in hypercube systems. In Proceedings of the Third IEEE
Symposium on Parallel and Distributed Processing (pp. 574-581). IEEE.