On Characterizing Bandwidth Requirements of Parallel Applications

advertisement
To appear in Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, May 1995.
On Characterizing Bandwidth Requirements of Parallel Applications
Anand Sivasubramaniam
Aman Singla
Umakishore Ramachandran
H. Venkateswaran
College of Computing
Georgia Institute of Technology
Atlanta, GA 30332-0280.
anand, aman, rama, venkat @cc.gatech.edu
Abstract
Synthesizing architectural requirements from an application viewpoint can help in making important architectural design decisions
towards building large scale parallel machines. In this paper, we
quantify the link bandwidth requirement on a binary hypercube
topology for a set of five parallel applications. We use an executiondriven simulator called SPASM to collect data points for system
sizes that are feasible to be simulated. These data points are then
used in a regression analysis for projecting the link bandwidth requirements for larger systems. The requirements are projected as
a function of the following system parameters: number of processors, CPU clock speed, and problem size. These results are also
used to project the link bandwidths for other network topologies.
Our study quantifies the link bandwidth that has to be made available to limit the network overhead in an application to a specified
tolerance level. The results show that typical link bandwidths (200300 MBytes/sec) found in current commercial parallel architectures
(such as Intel Paragon and Cray T3D) would have fairly low network overhead for the applications considered in this study. For two
of the applications, this overhead is negligible. For the other applications, this overhead can be limited to about 30% of the execution
time provided the problem sizes are increased commensurate with
the processor clock speed. The technique presented can be useful
to a system architect to synthesize the bandwidth requirements for
realizing well-balanced parallel architectures.
1
Introduction
Parallel machines promise to solve computationally intensive problems that may not be feasibly computed due to resource limitations
on sequential machines. Despite this promise, the delivered performance of these machines often falls short of the projected peak
performance. The disparity between the expected and observed
performance may be due to application deficiencies such as serial
fraction and work-imbalance, software slow-down due to scheduling and runtime overheads, and hardware slow-down stemming
from the synchronization and communication requirements in the
application. For building a general-purpose parallel machine, it
is essential to identify and quantify the architectural requirements
necessary to assure good performance over a wide range of applications. Such a synthesis of requirements from an application
view-point can help us make cost vs. performance trade-offs in
This work has been funded in part by NSF grants MIPS-9058430 and MIPS9200005, and grants from DEC, SYSTRAN, and IBM.
important architectural design decisions. The network is an important artifact in a parallel machine limiting the performance of
many applications, and is the focus of our study. Using five parallel
applications, we quantify bandwidth requirements needed to limit
the overheads arising from the network to an acceptable level for
these applications.
Latency and contention are two defining characteristics of a network from the application viewpoint. Latency is the sum of the
time spent in transmitting a message over the network links and the
switching time assuming that the message did not have to contend
for any network links. Contention is the time spent by a message
in the network waiting for links to become free. Both latency and
contention depend on a number of factors including the connectivity, the bandwidth of the links in the network, the switching delays,
and the length of the message. Of the architectural factors, link
bandwidth and connectivity are the most crucial network parameters impacting latency and contention. Hence, in order to quantify
requirements that limit network overheads (latency and contention)
to an acceptable level, it is necessary to study the impact of link
bandwidth and network connectivity on these overheads.
Dally [8] and Agarwal [2] present analytical models to study the
impact of network connectivity and link bandwidth for -ary -cube
networks. The results suggest that low-dimensional networks are
preferred (based on physical and technological constraints) when
the network contention is ignored or when the workload (the application) exhibits sufficient network locality; and that higher dimensional networks may be needed otherwise. Adve and Vernon [1]
show using analytical models that network locality has an important role to play in the performance of the mesh. Since network
requirements are sensitive to the workload, it is necessary to study
them in the context of real applications.
The RISC ideology clearly illustrates the importance of using
real applications in synthesizing architectural requirements. Several researchers have used this approach for parallel architectural
studies [20, 7, 13]. Cypher et al. [7] use a range of scientific applications in quantifying the processing, memory, communication and
I/O requirements. They present the communication requirements
in terms of the number of messages exchanged between processors
and the volume (size) of these messages. As identified in [22],
communication in parallel applications may be categorized by the
following attributes: communication volume, the communication
pattern, the communication frequency and the ability to overlap
communication with computation. A static analysis of the communication as conducted in [7] fails to capture the last two attributes,
making it very difficult to quantify the contention in the system.
The importance of simulation in capturing the dynamics of parallel system (an application-architecture combination) behavior has
been clearly illustrated in [12, 22, 25]. In particular, using an
execution-driven simulator, one can faithfully capture all the attributes of communication that are important to network requirements synthesis. For example, in [12] the authors use an execution
driven simulator to study -ary -cube networks in the context of
applications drawn from image understanding, and show the impact
of application characteristics on the choice of the network topology.
We take a similar approach to deriving the network requirements in
this study.
Using an execution-driven simulation platform called SPASM
[27, 25], we simulate the execution of five parallel applications on
an architectural platform with a binary hypercube network topology. We vary the link bandwidth on the hypercube and quantify its
impact on application performance. From these results, we derive
link bandwidths that are needed to limit network overheads to an
acceptable level. We also study the impact of the number of processors, the CPU clock speed and the application problem size on
bandwidth requirements. Using regression analysis and application
knowledge, we extrapolate requirements for larger systems of 1024
processors and other network topologies. The results suggest that
existing link bandwidth of 200-300 MBytes/sec available on machines like Intel Paragon [14] and Cray T3D [17] can easily sustain
the requirements of two applications (EP and FFT) even on highspeed processors of the future. For the other three, one may be able
to maintain network overheads at an acceptable level if the problem
size is increased commensurate with the processing speed.
The technique outlined in this paper may be used to derive the
bandwidth requirements of an application as a function of the system
parameters (such as processor clock speed, number of processors,
and expected problem size of applications). This information can
then be used by an architect to deduce the network requirements for
a specific set of system parameters. When cost and technological
factors prohibit supporting the required bandwidth, the information
may be useful to find out the efficiency that would result from a
lower bandwidth or the factor by which the problem size needs to
be scaled up to maintain reasonable efficiency.
Section 2 gives an overview of our approach and details of the
simulation platform; section 3 briefly describes the hardware platform and applications used in this study; section 4 presents results
from our experiments along with an analysis of bandwidth requirements as a function of system parameters; section 5 summarizes
the implication of these results; and section 6 presents concluding
remarks.
2
Approach
As observed in [22], communication in an application may be characterized by four attributes. Volume refers to the number and size
of messages. The communication pattern in the application determines the source-destination pairs for the messages, and reflects
on the application’s ability to exploit network locality. Frequency
pertains to the temporal aspects of communication, i.e., the interval
between successive network accesses by each processor as well as
the interleaving in time of accesses from different processors. This
temporal aspect of communication would play an important role
in determining network contention. Tolerance is the ability of an
application to hide network overheads by overlapping computation
with communication. Modeling all these attributes of communication in a parallel application is extremely difficult by simple static
analysis. Further, the dynamic access patterns exhibited by many
applications makes modeling more complex. Several researchers
[22, 25, 12] have observed the importance of simulation for capturing the communication behavior of applications.
In this study, we use an execution-driven simulator called SPASM
(Simulator for Parallel Architectural Scalability Measurements)
[27, 25] that enables us to accurately model the behavior of applications on a number of simulated hardware platforms. SPASM has
been written using CSIM [16], a process oriented sequential simulation package, and currently runs on SPARCstations. The input to
the simulator are parallel applications written in C. These programs
are preprocessed (to label shared memory accesses), the compiled
assembly code is augmented with cycle counting instructions, and
the assembled binary is linked with the simulator code. As with
other recent simulators [5, 9, 6, 19], bulk of the instructions is executed at the speed of the native processor (the SPARC in this case)
and only instructions (such as LOADs and STOREs on a shared
memory platform or SENDs and RECEIVEs on a message-passing
platform) that may potentially involve a network access are simulated. The input parameters that may be specified to SPASM are
the number of processors, the CPU clock speed, the network topology, the link bandwidth and switching delays. We can thus vary a
range of system parameters and study their impact on application
performance. The main problem with the execution-driven simulation approach is the tremendous resource (both space and time)
requirement in simulating large parallel systems. Related studies
[28, 24] address this problem and show how it may be alleviated by
augmenting simulation with other evaluation techniques.
SPASM gives a wide range of statistical information about the
execution of the program. The novel feature of SPASM is its ability
to provide a complete isolation and quantification of different overheads in a parallel system that limit its scalability. The algorithmic
overhead (arising from factors such as the serial part and workimbalance in the algorithm) and the network overheads (latency
and contention) are the important overheads that are of relevance to
this study. SPASM gives the total time (simulated time) which is the
maximum of the running times of the individual parallel processors.
This is the time that would be taken by an execution of the parallel
program on the target parallel machine. SPASM also gives the ideal
time, which is the time taken by the parallel program to execute on
an ideal machine such as the PRAM [31]. This metric includes the
algorithmic overheads but does not include any overheads arising
from architectural limitations. Of the network overheads, the time
that a message would have taken in a contention free environment
is charged to the latency overhead, while the rest of the time is
accounted for in the contention overhead.
To synthesize the communication requirements of parallel applications, the separation of the overheads provided by SPASM is
crucial. For instance, an application may have an algorithmic deficiency due to either a large serial part or due to work-imbalance,
in which case 100% efficiency1 is impossible regardless of other
architectural parameters. The separation of overheads, provided by
SPASM, enables us to quantify the bandwidth requirements as a
function of acceptable network overheads (latency and contention).
We thus quantify the bandwidth needed to limit the network overheads to 10%, 30% and 50% of the overall execution time. This
also amounts to quantifying the bandwidth needed to attain an efficiency which is 90%, 70% and 50% of the ideal efficiency (on an
ideal machine with zero network overhead).
3
Experimental Setup
We have chosen a CC-NUMA (Cache Coherent Non-Uniform
Memory Access) shared memory multiprocessor as the architectural
platform for this study. Since uniprocessor architecture is getting
standardized with the advent of RISC technology, we fix most of
the processor characteristics by using a SPARC chip as the baseline
for each processor in a parallel system. But to study the impact of
processor speed on network requirements we allow the clock speed
of a processor to be varied. Each node in the system has a piece
of the globally shared memory and a 2-way set-associative private
cache (64KBytes with 32 byte blocks). The cache is maintained
sequentially consistent using an invalidation-based fully-mapped
directory-based cache coherence scheme. Rothberg et al. [20]
1
Efficiency is defined as where is the number of processors.
Speedup(p) is the ratio of the time taken to execute the parallel application on 1
processor to the time taken to execute the same on processors.
(the curves flatten) since the network overheads (both latency and
contention) are sufficiently low at this point. In all our results,
we observe such a distinct knee. One would expect the efficiency
beyond this knee to be close to 100%. But owing to algorithmic
overheads such as serial part or work-imbalance, the knee may occur at a much lower efficiency ( 0 in Figure 1 may be much lower
than 1.0). These algorithmic overheads may also cause the curves
for each configuration of system parameters to flatten out at entirely
different levels in the efficiency spectrum. The bandwidth corresponding to the knee ( 0 ) still represents an ideal point at which we
would like to operate since the network overheads beyond this knee
are minimal and the network is no longer the bottleneck for any
loss of efficiency. We track the horizontal movement of this knee
to study the impact of system parameters (processors, CPU clock
speed, problem size) on link bandwidth requirements.
Efficiency
show that a cache of moderate size (64KBytes) suffices to capture
the working set in many applications, and Wood et al. [30] show
that for several applications, the execution time is not significantly
different across cache coherence protocols. In an earlier study [28],
we show that for an invalidation-based protocol with a full-map
directory, the coherence maintenance overhead due to the protocol
is not significant for a range of applications. Thus in our approach
to synthesizing network requirements, we fix the cache parameters
and vary only the clock speed of the processor and the network
parameters. The synchronization
1.0primitive supported in hardware
is a test-and-set operation and applications use a test-test-and-set to
implement higher level synchronization.
The study is conducted for a binary hypercube interconnect. The
hypercube is assumed to have serial
e0 (1-bit wide) unidirectional links
[29]. Messages are circuitand uses the -cube routing algorithm
switched using a wormhole routing strategy and the switching delay
is assumed to be zero. We simulate the network in its entirety in the
context of the given applications. Ideally, we would like to simulate
other networks as well in order to study the change in link bandwidth
requirements with network connectivity. Since these simulations
take an inordinate amount of time, we have restricted ourselves to
simulating the hypercube network in this study. We use the results
from the hypercube study in hypothesizing the requirements for
other networks using analytical techniques coupled with application
knowledge.
We have chosen five parallel applications in this study that exhibit different characteristics and are representative of many scientific computations. Three of the applications (EP, IS and CG) are
from the NAS parallel benchmark suite [4]; CHOLESKY is from
the SPLASH benchmark suite [23]; and FFT is the well-known
Fast Fourier Transform algorithm. EP and FFT are well-structured
applications with regular communication patterns determinable at
compile-time, with the difference that EP has a higher computation
to communication ratio. IS also has a regular communication pattern, but in addition it uses locks for mutual exclusion during the
b0other appliexecution. CG and CHOLESKY are different from the
Link
cations in that their communication patterns are not regular (both
use sparse matrices) and cannot be determined at compile time.
While a certain number of rows of the matrix in CG is assigned to
a processor at compile time (static scheduling), CHOLESKY uses
a dynamically maintained queue of runnable tasks. Further details
on the applications can be found in [26].
4
Results and Analysis
In this section, we present results from our simulation experiments.
Using these results, we quantify the link bandwidth necessary to
support the efficient performance of these applications and project
the bandwidth requirements for building large-scale parallel systems with a binary hypercube topology.
The experiments have been conducted over a range of processors
( =4, 8, 16, 32, 64), CPU clock speeds ( =33, 100, 300 MHz) and
link bandwidths ( =1, 20, 100, 200, 600 and 1000 MBytes/sec). The
problem size of the applications has been varied as 16K, 32K,
64K, 128K and 256K for EP, IS and FFT, 1400 1400 and 5600
5600 for CG, and 1806 1806 for CHOLESKY. In studying the
effect of each parameter (processors, clock speed, problem size),
we keep the other two constant.
4.1 Impact of System Parameters on Bandwidth Requirements
As the link bandwidth is increased, the efficiency of the system is
also expected to increase as shown in Figure 1. But we soon reach
a point of diminishing returns beyond which increasing the bandwidth does not have a significant impact on application performance
Bandwidth
1
communication ratio, thus making the network requirements more
stringent.
0.8
1
p= 4
p= 8
p=16
p=32
p=64
0.4
0.2
0
0.8
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Efficiency
Efficiency
Impact of CPU clock speed
0.6
Figure 4: FFT ( =64K, =33 MHz)
0.6
0.4
s=33
s=100
s=300
0.2
1
0
0.8
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Figure 7: EP ( =64, =128K)
1
p= 4
p= 8
p=16
p=32
p=64
0.4
0.2
0
0.8
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
0.6
s=33
s=100
s=300
0.4
0.2
Figure 5: CG ( =1400*1400, =33 MHz)
0
1
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
0.8
Figure 8: IS ( =64, =64K)
0.6
1
0.4
0.8
p= 4
p= 8
p=16
p=32
p=64
0.2
0
Efficiency
Efficiency
0.6
Efficiency
Efficiency
0.6
0.4
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
s=33
s=100
s=300
0.2
Figure 6: CHOLESKY ( =1806*1806, =33 MHz)
Figures 2, 3, 4, 5 and 6 show the impact of varying link bandwidth
on the efficiency of EP, IS, FFT, CG and CHOLESKY respectively,
across different number of processors ( ). The knees for EP and
FFT, which display a high computation to communication ratio,
occur at low bandwidths and are hardly noticeable in these figures.
The algorithmic overheads in these applications is negligible yielding efficiencies that are close to 100%. For the other applications,
the knee occurs at a higher bandwidth. Further, the curves tend to
flatten at different efficiencies suggesting the presence of algorithmic overheads. For all the applications, the knee shifts to the right as
the number of processors is increased indicating the need for higher
bandwidth. As the number of processors is increased, the network
accesses incurred by a processor in the system may increase or decrease depending on the application, but each such access would
incur a larger overhead from contending for network resources (due
to the larger number of messages in the network as a whole for
the chosen applications). Further, the computation performed by
a processor is expected to decrease, lowering the computation to
0
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Figure 9: FFT ( =64, =64K)
1
Efficiency
0.8
0.6
0.4
s=33
s=100
s=300
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Figure 10: CG ( =64, =1400*1400)
1
0.8
0.8
s=33
s=100
s=300
0.6
Efficiency
Efficiency
1
0.4
0.2
0
0.6
n=64
n=128
n=256
n=512
0.4
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Figure 11: CHOLESKY ( =64, =1806*1806)
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Figure 14: FFT ( =64, =33 MHz)
1
Figures 7, 8, 9, 10 and 11 show the impact of link bandwidth
on the efficiency of EP, IS, FFT, CG and CHOLESKY respectively,
across different CPU clock speeds ( ). As the CPU clock speed is
increased, the computation to communication ratio decreases. In
order to sustain the same efficiency, communication has to be sped
up to keep pace with the CPU speed thus shifting the knee to the
right uniformly across all applications.
Figures 12, 13, 14, and 15 show the impact of link bandwidth on
the efficiency of EP, IS, FFT, and CG respectively, across different
problem sizes. An increase in problem size is likely to increase the
amount of computation performed by a processor. At the same time,
a larger problem may also increase the network accessesincurred by
a processor. In EP, FFT and CG, the former effect is more dominant
thereby increasing the computation to communication ratio, making
the knee move to the left as the problem size is increased. The two
counteracting effects nearly compensate each other in IS showing
negligible shift in the knee.
0.8
Impact of Problem Size
1
Efficiency
0.8
0.6
0.4
n=64
n=128
n=256
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Figure 12: EP ( =64, =33 MHz)
1
Efficiency
0.8
0.6
n=64
n=128
n=256
0.4
0.2
Efficiency
0.6
0.4
n=1400
n=5600
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Figure 15: CG ( =64, =33 MHz)
4.2 Quantifying Link Bandwidth Requirements
We analyze bandwidth requirements using the above simulation
results in projecting requirements for large-scale parallel systems.
We track the change in the knee with system parameters by quantifying the link bandwidth needed to limit the network overheads
to a certain fraction of the overall execution time. This fraction
would determine the closeness of the operating point to the knee.
For instance, if the network overhead is less than 10% of the overall
execution time, then it amounts to saying that we are achieving an
efficiency that is within 90% of the ideal efficiency (on an ideal
machine with zero network overhead). Ideally, one would like to
operate as close to the knee as possible. But owing to either cost or
technological constraints, one may be forced to operate at a lower
bandwidth and it would be interesting to investigate if one may still
obtain reasonable efficiencies under these constraints. Figure 16
shows the trade-off between the tolerable network overhead and
the resulting bandwidth that needs to be sustained to maintain the
overhead within the specified level. With the ability to tolerate a
larger network overhead, the bandwidth requirements are expected
to decrease as shown by the curve labeled “Actual” in Figure 16.
To calculate the bandwidth requirement needed to limit the network
overhead (both the latency and contention component) to a certain
value, we simulate the applications and the network in their entirety
over a range of link bandwidths. We plot the bandwidths and the
resulting network overheads as shown by the curve labeled “Simulated” in Figure 16. We perform a linear interpolation between these
data points to calculate the bandwidth ( ) required to limit the network overhead to % of the total execution time. This bandwidth
would represent a good upper bound on the corresponding “Actual”
bandwidth ( ) required. Instead of presenting all the interpolated
graphs, we simply tabulate the requirements for = 10%, 30% and
50% in the following discussions. These requirements are expected
0
0 100 200 300 400 500 600 700 800 900 1000
Bandwidth (MBytes/sec)
Figure 13: IS ( =64, =33 MHz)
Required Bandwidth
to change with the number of processors, the CPU clock speed and
the application problem size. The rate of change in the knee is used
to study the impact of these system parameters. In cases where
the analysis is simple, we use our knowledge of the application
and architectural characteristics in extrapolating the performance
for larger systems. In cases where such a static analysis is not
possible (due to the dynamic nature of the execution), we perform
a non-linear regression analysis of the simulation results using a
multivariate secant method with a 95% confidence interval in the
SAS [21] statistics package.
Actual
Simulated
bs
ba
x%
Tolerable Network Overhead
cause no change in communication in the above mentioned phases
of IS, while the computation is expected to grow as
. Hence,
if we employ such a scaling strategy and increase the problem size
linearly with the CPU clock speed, we may be able to limit the network overheads to within 30-50% for this application with existing
technology.
agree with theoretical results presented in [10] where the authors
show that FFT is scalable on the cube topology and the achievable
efficiency is only limited by the ratio of the CPU clock speed and
the link bandwidth.
50% ovhd.
< 1.0
< 1.0
< 1.0
< 1.0
< 1.0
-
30% ovhd.
10% ovhd.
6.40
16.35
6.52
16.40
=8
7.52
16.75
=16
=32
7.83
16.87
8.65
17.19
=64
B/w Fns.
0 75 0 36 5 11
0 01 0 99 16 37
=1024
14.85
29.93
=64K, =33 MHz
Table 7: FFT: Impact of Processors on Link Bandwidth (in MBytes/sec)
=4
=4
=8
=16
=32
=64
50% ovhd.
7.75
13.38
22.00
38.65
47.03
23 60
B/w Fns.
=1024
0 28
28 79
143.25
30% ovhd.
12.91
30.75
66.44
78.61
84.61
74 41
0 22
10% ovhd.
68.69
92.87
168.71
211.45
293.44
88 68
91 21
0 34
82 12
251.91
907.80
=128K, =33 MHz
Table 3: IS: Impact of Processors on Link Bandwidth (in MBytes/sec)
=16K
=32K
=64K
=128K
=256K
B/w Fns.
=8192K
Table 5:
=33MHz
=100MHz
=300MHz
50% ovhd.
337.34
735.15
> 5000
30% ovhd.
396.55
1053.08
> 5000
50% ovhd.
< 1.0
< 1.0
< 1.0
< 1.0
< 1.0
-
=33MHz
=100MHz
=300MHz
50% ovhd.
-
30% ovhd.
5.15
8.22
17.38
10% ovhd.
21.64
48.58
144.50
1024,
230
Table 10: FFT: Link Bandwidth Projections (in MBytes/sec)
CG
The implementation of FFT has been optimized so that all the
communication takes place in one phase where every processor
communicates with every other processor, and the communication
in this phase is skewed in order to reduce contention. The compulog
tation performed by a processor in FFT grows as
while the communication grows as
1 2 . Thus, these
components decrease at comparable rates with an increase in the
number of processors. As the number of processors is increased, the
contention encountered by each message in the network is expected
to grow. However, due to the implementation strategy the bandwidth requirements of the network grow slowly with the number
of processors as is shown in Table 4.2. These requirements can be
satisfied even for faster processors (see Table 8). As we mentioned
earlier, the computation to communication ratio is proportional to
log , and the network requirements are expected to become
even less stringent as the problem size is increased. Table 9 confirms this observation. Hence, in projecting the requirements for a
1024-node system, link bandwidths of around 100-150 MBytes/sec
would suffice to limit the network overheads to less than 10% of the
execution time (see Table 10). The results shown in the above tables
10% ovhd.
1366.34
3586.08
> 5000
FFT
1024,
223
Table 6: IS: Link Bandwidth Projections (in MBytes/sec)
30% ovhd.
10% ovhd.
9.42
17.63
9.03
17.45
8.65
17.19
8.38
17.03
7.97
16.84
11 02 0 4 log
18 43 0 2 log
3.02
14.43
=64, =33 MHz
Table 9: FFT: Impact of Problem Size on Link Bandwidth (in MBytes/sec)
=16K
=32K
=64K
=128K
=256K
B/w Fns.
=230
50% ovhd.
30% ovhd.
10% ovhd.
46.60
83.80
270.08
47.16
84.52
286.98
47.03
84.61
293.44
47.48
85.09
303.41
48.67
85.53
307.75
0 007 1 00 46 61
0 006 0 99 84 10
19 57 0 26 230 19
110.75
133.19
441.66
=64, =33 MHz
IS: Impact of Problem Size on Link Bandwidth (in MBytes/sec)
50% ovhd.
30% ovhd.
10% ovhd.
< 1.0
8.65
17.19
8.65
13.81
29.86
17.19
29.20
88.81
=64, =64K
Table 8: FFT: Impact of CPU speed on Link Bandwidth (in MBytes/sec)
=33MHz
=100MHz
=300MHz
50% ovhd.
30% ovhd.
10% ovhd.
47.03
84.61
293.44
102.49
224.69
770.16
356.14
649.95
1344.72
=64, =64K
Table 4: IS: Impact of CPU speed on Link Bandwidth (in MBytes/sec)
=33MHz
=100MHz
=300MHz
The main communication in CG occurs in the multiplication of a
sparse matrix with a dense vector [26]. Each processor performs
this operation for a contiguous set of rows allocated to it. The
elements of the vector that are needed by a processor to perform
this operation depend on the distribution of non-zero elements in
the matrix and may involve external accesses. Once an element is
fetched, a processor may reuse it for a non-zero element in another
row at the same column position. As the number of processors is
increased, the number of rows allocated to a processor decreases
thus decreasing the computation that it performs. Increasing the
number of processors has a dual impact on communication. Since
the number of rows that need to be computed decreases, the probability of external accesses decreases. There is also a decreased
probability of reusing a fetched data item for computing another
row. These complicated interactions are to a large extent dependent
on the input data and are difficult to analyze statically. We use
the data sets supplied with the NAS benchmarks [4]. The results
from our simulation are given in Table 11. We observe that the
effect of lower local computation, and lesser data reuse has a more
significant impact in increasing the communication requirements
for larger systems. Increasing the clock speed has an almost linear
impact on increasing bandwidth requirements as given in Table 12.
As the problem size is increased, the local computation increases,
and the probability of data re-use also increases. The rate at which
these factors impact the requirements depends on the sparsity factor
of the matrix. Table 13 shows the requirements for two different
problem sizes. For the 1400 1400 problem, the sparsity factor
is 0.04, while the sparsity factor for the 5600 5600 problem is
0.02. The corresponding factor for the 14000 14000 problem
suggested in [4] is 0.1 and we scale down the bandwidth requirements accordingly in Table 14 for a 1024 node system. The results
suggest that we may be able to limit the overheads to within 50% of
the execution time with existing technology. As the processors get
faster than 100 MHz, it would need a considerable amount of bandwidth to limit overheads to within 30%. But with faster processors,
and larger system configurations, one may expect to solve larger
problems as well. If we increase the problem size (number of rows
of the matrix) linearly with the clock speed of the processor, one
may expect the bandwidth requirements to remain constant, and we
may be able to limit network overheads to within 30% of execution
time even with existing technology.
50% ovhd.
30% ovhd.
10% ovhd.
1.74
3.25
5.81
9.73
15.63
2.90
5.41
9.68
16.22
46.10
8.71
16.23
52.03
82.39
124.19
0 62
1 63
=4
=8
=16
=32
=64
1 25
B/w Fns.
=1024
1 28 0 04 3 61
=1400*1400,393.32
=33 MHz
94.79
18 80
0 51
50% ovhd.
30% ovhd.
33 07
618.28
10% ovhd.
15.63
46.10
124.19
43.50
96.75
386.12
120.89
262.84
1022.14
=64, =1400*1400
Table 12: CG: Impact of CPU speed on Link Bandwidth (in MBytes/sec)
50% ovhd.
30% ovhd.
10% ovhd.
15.63
46.10
124.19
9.48
25.47
78.55
=64, =33 MHz
Table 13: CG: Impact of Problem Size on Link Bandwidth (in MBytes/sec)
=1400*1400
=5600*5600
50% ovhd.
30% ovhd.
10% ovhd.
34.87
120.04
247.33
97.05
251.93
1200.49
269.7
684.41
2035.64
=1024, =14000*14000
Table 14: CG: Link Bandwidth Projections (in MBytes/sec)
=33MHz
=100MHz
=300MHz
50% ovhd.
30% ovhd.
10% ovhd.
5.62
11.98
76.56
=8
6.86
13.11
78.32
=16
7.77
14.48
80.83
=32
8.91
16.02
84.12
10.49
17.48
87.35
=64
B/w Fns.
1 47 0 37 3 43
2 14 0 33 8 82
6 77 0 27 66 60
23.44
31.26
110.70
=1024
=1806*1806, =33 MHz
Table 15: CHOLESKY: Impact of Processors on Link Bandwidth (in MBytes/sec)
=4
Table 11: CG: Impact of Processors on Link Bandwidth (in MBytes/sec)
=33MHz
=100MHz
=300MHz
may be easily satisfied with existing technology for 1024 processors. Even with a 300 MHz clock, one may be able to limit network
overheads to around 30% as shown in Table 16. Owing to resource
constraints, we have not been able to simulate other problem sizes
for CHOLESKY in this study. But an increase in problem size is
expected to increase the the computation to communication ratio
and has been experimentally verified on the KSR-1, which suggests
that bandwidth requirements are expected to decrease with problem
size. Hence, as the processors get faster, one may still be able to
maintain network overheads at an acceptable level with existing
technology if the problem size is increased correspondingly.
CHOLESKY
This application performs a Cholesky factorization of a sparse positive definite matrix [26]. Each processor while working on a column
will need to access the non-zero elements in the same row position
of other columns. Once a non-local element is fetched, the processor can reuse it for the next column that it has to process. The
communication pattern in processing a column is similar to that of
CG. The difference is that the allocation of columns to processors
in CHOLESKY is done dynamically. As with CG, an increase in
the number of processors is expected to decrease the computation
to communication ratio as well as the probability of data reuse. Further, the network overheads for implementing dynamic scheduling
are also expected to increase for larger systems. Table 15 reflects
this trend, showing that bandwidth requirements for CHOLESKY
grow modestly with increase in processors. Still, the requirements
50% ovhd.
30% ovhd.
10% ovhd.
10.49
17.48
87.35
16.56
60.13
278.92
29.51
171.29
712.60
=64, =1806*1806
Table 16: CHOLESKY: Impact of CPU speed on Link Bandwidth (in MBytes/sec)
=33MHz
=100MHz
=300MHz
5
Discussion
In the previous section, we quantified the link bandwidth requirements of five applications for the binary hypercube topology as a
function of the number of processors, CPU clock speed and problem size. Based on these results we also projected the requirements
of large systems built with 1024 processors and CPU clock speeds
upto 300 MHz. We observed that EP has negligible bandwidth
requirements and FFT has moderate requirements that can be easily
sustained. The network overheads for CG and CHOLESKY may be
maintained at an acceptable level for current day processors, and as
the processor speed increases, one may still be able to tolerate these
overheads provided the problem size is increased commensurately.
The network overheads of IS are tolerable for slow processors,
but the requirements become unmanageable as the clock speed increases. As we observed, the deficiency in this problem is in the
way the problem is scaled (the number of buckets is scaled linearly
with the size of the input list to be sorted). On the other hand, if
the number of buckets is maintained constant, it may be possible
to sustain bandwidth requirements by increasing the problem size
linearly with the processing speed.
In [18], the authors show that the applications EP, IS, and CG
scale well on a 32-node KSR-1. Although our results suggest that
these applications may incur overheads affecting their scalability,
this does not contradict the results presented in [18] since the implications of our study are for larger systems built with much faster
processors.
All of the above link bandwidth results have been presented
for the binary hypercube network topology. The cube represents
a highly scalable network where the bisection bandwidth grows
linearly with the number of processors. Even though cubes of 1024
nodes have been built [11], cost and technology factors often play
an important role in its physical realization. Agarwal [2] and Dally
[8] show that wire delays (due to increased wire lengths associated
with planar layouts) of higher dimensional networks make low
dimensional networks more viable. The 2-dimensional [15] and 3dimensional [17, 3] toroids are common topologies used in current
day networks, and it would be interesting to project link bandwidth
requirements for these topologies.
A metric that is often used to compare different networks is the
bisection bandwidth available per processor. On a -ary -cube,
the bisection bandwidth available per processor is inversely proportional to the radix of the network. One may use a simple rule
of thumb of maintaining per processor bisection bandwidth constant in projecting requirements for lower connectivity networks.
For example, considering a 1024-node system, the link bandwidth
requirement for a 32-ary 2-cube would be 16 times the 2-ary 10cube bandwidth; similarly the requirement for a 3-D network would
be around 5 times the 10-D network. Such a projection assumes
that the communication in an application is devoid of any network locality and that each message crosses the bisection. But we
know that applications normally tend to exploit network locality
and the projection can thus become very pessimistic [1]. With a
little knowledge about the communication behavior of applications,
one may be able to reduce the degree of pessimism. In both FFT
and IS, every processor communicates with every other processor,
and thus only 50% of the messages cross the bisection. Similarly,
instrumentation in our simulation showed that around 50% of the
messages in CG and CHOLESKY traverse the bisection. To reduce the degree of pessimism in these projections, one may thus
introduce a correction factor of 0.5 that can be multiplied with the
above-mentioned factors of 16 and 5 in projecting the bandwidths
for 2-D and 3-D networks respectively. EP would still need negligible bandwidth and we can still limit network overheads of FFT to
around 30% on these networks with existing technology. But the
problem sizes for IS, CG and CHOLESKY would have to grow by
a factor of 8 compared to their hypercube counterparts if we are to
sustain the corresponding efficiency on a 2-D network with current
technology. Despite the correction factor, these projections are still
expected to be pessimistic since the method ignores the temporal
aspect of communication. The projection assumes that every message in the system traverses the bisection at the same time. If the
message pattern is temporally skewed, then a lower link bandwidth
may suffice for a given network overhead. It is almost impossible
to determine these skews statically, especially for applications like
CG and CHOLESKY where the communication pattern is dynamic.
It would be interesting to conduct a detailed simulation for these
network topologies to confirm these projections.
6
Concluding Remarks
In this study, we undertook the task of synthesizing the bandwidth
requirements of five parallel applications. Such a study can help in
making cost-performance trade-offs in designing and implementing
networks for large scale multiprocessors. One way of conducting
such a study would be to examine the applications statically, and
develop simple analytical models to capture their communication
requirements. But as we mentioned in Section 2, it is difficult to
faithfully model all the attributes of communication by a simple
static analysis for all applications. On the other hand, simulation
can faithfully capture all the attributes of communication. We used
an execution-driven simulator called SPASM for simulating the
applications on an architectural platform with a binary hypercube
topology. The link bandwidth of the simulated platform was varied
and its impact on application performance was quantified. From
these results, the link bandwidth requirements for limiting the network overheads to a specified tolerance level were identified. We
also studied the impact of system parameters (number of processors,
processing speed, problem size) on link bandwidth requirements.
Using regression analysis and analytical techniques, we projected
requirements for large scale parallel systems with 1024 processors
and other network topologies. The results show that existing link
bandwidth of 200-300 MBytes/sec available on machines like Intel
Paragon [14] and Cray T3D [17] can sustain high speed applications with fairly low network overhead. For applications like EP
and FFT, this overhead is negligible. For the other applications,
this overhead can be limited to about 30% of the execution time
provided the problem sizes are increased commensurate with the
processor clock speed.
Using the technique outlined in this paper, it would be possible
for an architect to synthesize the bandwidth requirements of an application as a function of system parameters. For instance, given
a set of applications, the system size (number of processors) and
the CPU speed, an architect may use this technique to calculate the
bandwidth that he needs to support in hardware. In cases where
cost/technological problems prohibit supporting this bandwidth, the
architect may use the results to find out the efficiency that would
result from a lower hardware bandwidth or the factor by which the
problem size needs to be scaled to maintain good efficiency. The
results may also be used to quantify the rate at which the network
(which is often custom-built) capabilities have to be enhanced in
order to accommodate the rapidly improving off-the-shelf components used in realizing the processing nodes.
References
[1] V. S. Adve and M. K. Vernon. Performance analysis of mesh
interconnection networks with deterministic routing. IEEE
Transactions on Parallel and Distributed Systems, 5(3):225–
246, March 1994.
[2] A. Agarwal. Limits on Interconnection Network Performance. IEEE Transactions on Parallel and Distributed Systems, 2(4):398–412, October 1991.
[3] R. Alverson et al. The Tera Computer System. In Proceedings
of the ACM 1990 International Conference on Supercomputing, pages 1–6, Amsterdam, Netherlands, 1990.
[4] D. Bailey et al. The NAS Parallel Benchmarks. International
Journal of Supercomputer Applications, 5(3):63–73, 1991.
[5] E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl.
PROTEUS : A high-performance parallel-architecture simulator. Technical Report MIT-LCS-TR-516, Massachusetts
Institute of Technology, Cambridge, MA 02139, September
1991.
[6] R. G. Covington, S. Madala, V. Mehta, J. R. Jump, and J. B.
Sinclair. The Rice parallel processing testbed. In Proceedings
of the ACM SIGMETRICS 1988 Conference on Measurement
and Modeling of Computer Systems, pages 4–11, Santa Fe,
NM, May 1988.
[7] R. Cypher, A. Ho, S. Konstantinidou, and P. Messina. Architectural requirements of parallel scientific applications with
explicit communication. In Proceedings of the 20th Annual
International Symposium on Computer Architecture, pages
2–13, May 1993.
[8] W. J. Dally. Performance analysis of k-ary n-cube interconnection networks. IEEE Transactions on Computer Systems,
39(6):775–785, June 1990.
[9] H. Davis, S. R. Goldschmidt, and J. L. Hennessy. Multiprocessor Simulation and Tracing Using Tango. In Proceedings
of the 1991 International Conference on Parallel Processing,
pages II 99–107, 1991.
[10] A. Gupta and V. Kumar. The scalability of FFT on parallel
computers. IEEE Transactions on Parallel and Distributed
Systems, 4(8):922–932, August 1993.
[11] J. L. Gustafson, G. R. Montry, and R. E. Benner. Development
of Parallel Methods for a 1024-node Hypercube. SIAM Journal on Scientific and Statistical Computing, 9(4):609–638,
1988.
[12] W. B. Ligon III and U. Ramachandran. Simulating interconnection networks in RAW. In Proceedings of the Seventh
International Parallel Processing Symposium, April 1993.
[13] W. B. Ligon III and U. Ramachandran. Evaluating multigauge
architectures for computer vision. Journal of Parallel and
Distributed Computing, 21:323–333, June 1994.
[14] Intel Corporation, Oregon. Paragon User’s Guide, 1993.
[15] D. Lenoski, J. Laudon, K. Gharachorloo, W-D Weber,
A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The
Stanford DASH multiprocessor. IEEE Computer, 25(3):63–
79, March 1992.
[16] Microelectronics and Computer Technology Corporation,
Austin, TX 78759. CSIM User’s Guide, 1990.
[17] W. Oed. The Cray Research Massively Parallel Processor
System Cray T3D, 1993.
[18] U. Ramachandran, G. Shah, S. Ravikumar, and J. Muthukumarasamy. Scalability study of the KSR-1. In Proceedings
of the 1993 International Conference on Parallel Processing,
pages I–237–240, August 1993.
[19] S. K. Reinhardt et al. The Wisconsin Wind Tunnel : Virtual
prototyping of parallel computers. In Proceedings of the ACM
SIGMETRICS 1993 Conference on Measurement and Modeling of Computer Systems, pages 48–60, Santa Clara, CA, May
1993.
[20] E. Rothberg, J. P. Singh, and A. Gupta. Working sets, cache
sizes and node granularity issues for large-scale multiprocessors. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 14–25, May 1993.
[21] SAS Institute Inc., Cary, NC 27512. SAS/STAT User’s Guide,
1988.
[22] J. P. Singh, E. Rothberg, and A. Gupta. Modeling communication in parallel algorithms: A fruitful interaction between
theory and systems? In Proceedings of the Sixth Annual ACM
Symposium on Parallel Algorithms and Architectures, 1994.
[23] J. P. Singh, W-D. Weber, and A. Gupta. SPLASH: Stanford Parallel Applications for Shared-Memory. Technical Report CSL-TR-91-469, Computer Systems Laboratory, Stanford University, 1991.
[24] A. Sivasubramaniam, U. Ramachandran, and H. Venkateswaran. A comparative evaluation of techniques for studying parallel system performance. Technical Report GIT-CC94/38, College of Computing, Georgia Institute of Technology, September 1994.
[25] A. Sivasubramaniam, A. Singla, U. Ramachandran, and
H. Venkateswaran. An Approach to Scalability Study of
Shared Memory Parallel Systems. In Proceedings of the ACM
SIGMETRICS 1994 Conference on Measurement and Modeling of Computer Systems, pages 171–180, May 1994.
[26] A. Sivasubramaniam, A. Singla, U. Ramachandran, and
H. Venkateswaran. On characterizing bandwidth requirements
of parallel applications. Technical Report GIT-CC-94/31,
College of Computing, Georgia Institute of Technology, July
1994.
[27] A. Sivasubramaniam, A. Singla, U. Ramachandran, and
H. Venkateswaran. A Simulation-based Scalability Study of
Parallel Systems. Journal of Parallel and Distributed Computing, 22(3):411–426, September 1994.
[28] A. Sivasubramaniam, A. Singla, U. Ramachandran, and
H. Venkateswaran. Abstracting network characteristics and
locality properties of parallel systems. In Proceedings of the
First International Symposium on High Performance Computer Architecture, pages 54–63, January 1995.
[29] H. Sullivan and T. R. Bashkow. A large scale, homogenous,
fully-distributed parallel machine. In Proceedings of the 4th
Annual Symposium on Computer Architecture, pages 105–
117, March 1977.
[30] D. A. Wood et al. Mechanisms for cooperative shared memory.
In Proceedings of the 20th Annual International Symposium
on Computer Architecture, pages 156–167, May 1993.
[31] J. C. Wyllie. The Complexity of Parallel Computations. PhD
thesis, Department of Computer Science, Cornell University,
Ithaca, NY, 1979.
Download