Understanding the Design Trade-offs among Current Multicore Systems for Numerical Computations Seunghwa Kang

advertisement
Understanding the Design Trade-offs
among Current Multicore Systems
for Numerical Computations
Seunghwa Kang David A. Bader Richard Vuduc
Georgia Institute of Technology
Atlanta, GA, 30332, USA
Abstract
In this paper, we empirically evaluate fundamental design trade-offs among the most recent multicore processors
and accelerator technologies. Our primary aim is to aid
application designers in better mapping their software to
the most suitable architecture, with an additional goal of
influencing future computing system design. We specifically examine five architectures, based on: the Intel quadcore Harpertown processor, the AMD quad-core Barcelona
processor, the Sony-Toshiba-IBM Cell Broadband Engine
processors (both the first-generation chip and the secondgeneration PowerXCell 8i), and the NVIDIA Tesla C1060
GPU. We illustrate the software implementation process on
each platform for a set of widely-used kernels from computational statistics that are simple to reason about; measure
and analyze the performance of each implementation; and
discuss the impact of different architectural design choices
on each implementation.
1 Introduction
Driven by the limits to single-core performance due to
power constraints, multicore processors and accelerator architectures are becoming ubiquitous in every computing
area. These systems are attractive to application developers because of their impressive peak computational potential and (in several cases) their energy-efficient processing
capabilities. However, the architectures themselves are diverse and reflect a wide variety of design trade-offs. Consequently, we might expect the performance of a given application to be an even more sensitive function of the architecture than in previous generation single-core general
purpose processors, which in turn is expected to affect software development costs significantly. The longer-term goal
of our research effort is to better understand the impact of
the architectural design trade-offs on software.
The research literature on software optimization for
these multicore systems is growing rapidly, particularly
for platforms based on the Cell/B.E. [5, 12, 27] and for
GPUs [21, 30]. Ryoo, et al. [25, 26] publish extensively on
generalizing optimization principles for the NVIDIA GPUs
based on the CUDA framework [19]. In addition to the
optimization research for a single platform, several interarchitecture comparisons are published as well. Williams,
et al. [31] compare the performance of emerging multicore
platforms, including the AMD dual-core Opteron processor, the Intel quad-core Harpertown processor, the SUN
Niagara processor, and the Cell/B.E. processor for sparse
matrix-vector multiplication. Also, there are a few studies
that compare the performance of CPUs, GPUs, FPGAs, the
Cell/B.E. processor, and the Cray MTA-2 [16, 7, 5].
The focus of this paper is on evaluating the impact
of fundamental design trade-offs on a particular class of
widely-used but simple-to-analyze software kernels. We
study a range of systems with native double-precision support, described in detail in Section 2, including a two processor (2P) system with the Intel quad-core Harpertown 2.5
GHz E5420 processors, a four processor (4P) system based
on the AMD quad-core Barcelona 2.0 GHz 8350 processors, an IBM QS20 blade with two Cell/B.E. 3.2 GHz processors, an IBM QS22 blade with two PowerXCell 8i 3.2
GHz processors, and a desktop system equipped with one
NVIDIA Tesla C1060 GPU. As the test systems have varying numbers of chips, clock frequencies, prices, and power
consumptions, we focus on architectural design trade-offs
and their impacts on different kernels rather than identifying the best performing processor.
We evaluate these systems experimentally using three
kernels from computational statistics with differing computational characteristics. First, we create two kernels by
extracting the most computationally intensive part of the covariance/correlation computation from the R statistics package [3]. These kernels are based on Pearson’s method and
Kendall’s method [13], which we hereafter refer to as Ker-
nel1 and Kernel2, respectively. The third kernel (Kernel3) is
created by modifying Kernel2 to highlight each system’s capability in processing highly floating-point intensive computation. Section 3 describes these kernels in more detail.
We also discuss our implementation and software optimization process, to highlight the challenges and complexities of software development for each architecture (Section 4). However, we consider the main contribution of
this paper to be our inter-architectural analysis, not the optimization work. Our experimental results highlight the performance of each system in executing different mix of instructions for compute-bound and data-intensive cases (Section 5). We consider both single-precision and doubleprecision performance for floating-point operations and attempt to characterize the resulting performance in terms of
each system’s design choices.
2 Inter-architectural Design Trade-offs
This section describes five multicore systems of interest
in this study, which are summarized in Table 1. In particular, each subsection considers a particular design dimension, and qualitatively summarizes the differences among
the architectures. For additional processor details, we refer
interested readers elsewhere [9, 8, 14].
Among “conventional” general purpose multicore
microprocessors, we consider the Intel quad-core E5420
Harpertown processor and AMD’s quad-core 8350
Barcelona. The Intel Harpertown and AMD Barcelona processors share a similar micro-architecture, but take distinct
approaches to cache hierarchy and memory subsystem design. The Intel Harpertown has 12 MB L2 cache memory,
and two cores among four cores in a chip share 6 MB L2
cache. The AMD Barcelona has a dedicated 512 KB L2
cache per core in addition to 2 MB L3 cache shared by all
four cores in a chip. Also, the AMD Barcelona supports
NUMA (Non-Uniform Memory Access) architecture while
the Intel Harpertown is based on UMA (Uniform Memory
Access) architecture.
Among the multicore accelerator systems, we consider
two generations of the STI Cell/B.E. processor and an
NVIDIA Tesla C1060 GPU. The Cell/B.E. processor is
a heterogeneous multicore processor with one conventional PowerPC core (“PPE”) and eight specialized singleinstruction multiple-data (SIMD) accelerators (“SPEs”).
The Tesla C1060 is the latest GPU from NVIDIA and delivers 933 Gflop/s in single-precision with native 78 Gflop/s
double-precision support. The C1060, based on Tesla architecture [14], has 30 SMs (Streaming Multiprocessors),
and each SM has 8 SPs (Streaming Processors) for singleprecision and one SP for double-precision support. Also,
each SM has two SFUs (Special Function Units) for transcendental functions and attribute interpolation.
2.1
Requirements for Parallelism
The Intel Harpertown, AMD Barcelona, and Cell/B.E. processors have four to nine cores, and each core supports SIMD instructions for acceleration. To exploit the
parallelism in these processors, these chips require coarsegrain parallelism to exploit their multiple cores in addition
to SIMD parallelism.
In contrast, the Tesla C1060 has 240 SPs and 60 SFUs
for single-precision. In addition, to be detailed in Section 2.4, the Tesla architecture adopts massive hardwaremultithreading to hide the DRAM access latency. The Tesla
C1060 requires at least several thousand-way parallelism to
exploit its architectural features. Also, the Tesla C1060 requires SIMT (Single Instruction Multiple Threads) parallelism, and every thread in a single warp (a group of 32
threads) needs to agree on the execution path to maximize
chip utilization.
2.2
Computation Units
Each core of the Intel Harpertown or AMD Barcelona
processor can retire up to two SIMD floating-point instructions (one SIMD add and one SIMD multiply) [4, 10]; thus
each core can deliver 4 double-precision floating-point operations per cycle. Also the Intel and AMD cores can execute integer instructions in parallel with floating-point instructions. Each SPE in the Cell/B.E. processor has even
and odd pipelines. The even pipeline can execute floatingpoint instructions, fixed point arithmetics, and logical and
word-granularity shift and rotate instructions. The odd
pipeline executes load/store instructions and fixed point
byte-granularity shift, rotate mask, and shuffle instructions.
Each SPE can retire one SIMD FMA (fused-multiply-andadd) instructions per cycle. Each SP in the Tesla C1060
executes one scalar instruction per cycle, including a singleprecision FMA instruction. Each SFU can also execute four
single-precision multiply instructions per cycle. One SP
for double-precision support can retire one double-precision
FMA instruction per cycle.
The sustainable flop-rate is highly affected by the mix
of different instruction types in an execution stream and the
structure of the computation unit. The Intel and AMD cores
have separate multiply and add units instead of a FMA unit
and also can run integer instructions in parallel; thus, they
can more flexibly execute different combination of integer
and floating-point instructions. Still, to fully utilize both
floating-point multiply unit and add unit, this chip requires
a 1:1 ratio of multiply and add instructions. One SPE of
the Cell/B.E. can issue only one floating-point instruction
in a single cycle (whether it is FMA or not), so if multiply
cannot be fused with add, the achievable peak flop-rate becomes halved. Still, the Cell/B.E. can run several types of
System
Clock
Num. chips
Num. cores
/ chip
2P Harpertown
E5420
2.5 GHz
2
4
4P Barcelona 8350
QS20-Cell/B.E.
2.0 GHz
4
4
3.2 GHz
2
1 PPE + 8 SPEs
QS22-PowerXCell
8i
3.2 GHz
2
1 PPE + 8 SPEs
29.2
409.6
256 KB local store
per SPE
204.8
409.6
256 KB local store
per SPE
Rambus XDR
NUMA
DDR2
NUMA
1.296 GHz
1
30 SMs × (8 singleprecision SPs + 1
double-precision
SP + 2 SFUs)
78
933
16 KB shared memory + 8 KB constant
cache + 16 KB texture cache per SM
GDDR3
N/A
double (or triple)
buffering
51.2
double (or triple)
buffering
51.2
hardwaremultithreading
102
73.7 (NVIDIA
SDK bandwidthTest
160 (typical, per
board), 200 (max.,
per board)
NVIDIA nvcc
(release 2.0
V0.2.1221)
-O3 -arch sm 13
DP Gflop/s
SP Gflop/s
On-chip memory
80
160
6 + 6 MB L2 cache
per chip
DRAM type
Shared DRAM
access
Latency hiding
DDR2
UMA
128
256
512 KB L2 cache
per core and 2 MB
shared cache per
chip
DDR2
NUMA
cache + prefetching
cache + prefetching
Theoretical peak
bandwidth (GB/s)
Measured
bandwidth (GB/s)
21.4
42.8
8.1 (baseline
STREAM)
9.9 (baseline
STREAM)
N/A
N/A
Power (W)
2 × 80 (per chip)
4 × 75 (per chip)
315 (per blade)
250 (per blade)
Compiler
Intel icc (10.1.018)
Intel icc (10.1.015)
IBM xlc (10.1)
IBM xlc (10.1)
Optimization flag
-fast
-fast
-O5 -qarch=cellspu
-qtune=cellspu
-O5 -qarch=edp
-qtune=edp
Tesla C1060
Table 1: Summary of the test systems.
fixed point instructions in parallel with floating-point operations. Fully exploiting the SMs in the Tesla C1060 requires
FMA instructions and additional multiply instructions for
the SFUs. Also, one SP can execute only one instruction
per cycle.
2.3
Start-up Overhead
Thread creation or kernel launching is faster on the Intel or AMD processors than on the Cell/B.E. or the Tesla
C1060. Thus, for a small amount of computation, these general purpose processors outperform the accelerators while
the accelerator architectures often exhibit impressive performance for larger data [16].
The test systems also incur different levels of data offloading overhead. In the Harpertown, Barcelona, and Cell/B.E. processors, the off-loading overhead is largely determined by the off-chip memory bandwidth. In contrast, GPU
systems incur additional host memory to the on-board device memory data transfer via a slower PCI Express bus.
While GPUs can partially hide the off-loading overhead
with asynchronous data transfer (i.e., double-buffering), this
mechanism currently works only for page-locked memory
and incurs additional programming overhead [20]. To amortize the off-loading overhead, GPUs require higher computational intensity than other processors [6, 28, 16]. However, the Tesla C1060’s on-board memory is much larger (4
GB) than the Harpertown or Barcelona’s cache memory (12
or 2 MB) or the Cell/B.E.’s local store (256 KB per SPE ×
8 SPEs). Accordingly, the Tesla C1060 can fit larger data
into its on-board memory to minimize the data transfer over
the PCI Express bus.
2.4
Memory Latency Hiding
The Intel Harpertown and AMD Barcelona processors
hide memory latency via cache memory and prefetching
mechanisms. The Cell/B.E. overlaps computation with
communication via double- or triple-buffering. Double
buffering efficiently hides the latency but requires explicit
software intervention. The Tesla C1060 tolerates several
hundred cycle DRAM access latency via massive hardwaremultithreading. The Tesla C1060 also has per SM shared
memory (8 KB), constant cache (8 KB), and texture cache
(16 KB). Yet, as each SM can run hundreds of threads
in parallel, these on-chip memories have minimum performance impact if there is only a low degree of data sharing
among different threads.
2.5
Control over On-chip Memory
For the cache-based multicore processors, cache memory is managed by hardware using a LRU (Least Recently
Used) policy (or its variants), and programmers have essentially no control over cache partitioning.1 By contrast,
programmers can explicitly manage on-chip (“local store”)
memory on the Cell/B.E. The Tesla C1060, along with the
NVIDIA CUDA framework, also allows programmers to
control the placement of data arrays to the chip’s different
types of memories.
2.6
Main Memory Access Mechanisms
and Bandwidth Utilization
Figure 1: The 2P Intel Harpertown system with UMA architecture (left) and the 4P AMD Barcelona system with
NUMA architecture (right).
The AMD Barcelona and Cell/B.E. processors use a
NUMA (Non-Uniform Memory Access) architecture, while
the Intel Harpertown processor adopts a UMA (Uniform
Memory Access) architecture. UMA is conceptually simpler but NUMA has scalability advantages if cores running
on different chips access distinct data arrays. In particular, by locating data to the chip’s local main memory, we
can minimize the contention and interference in the main
memory interface. For instance, if a computation accesses
read-only data multiple times, we can replicate the data to
each processor’s local DRAM to maximize the bandwidth
for accessing the data.2 The Tesla C1060’s device memory
does not support shared memory access over two or more
GPUs.
1 Intel also announced that their upcoming Larrabee processor will better support streaming access by providing a mechanism to mark streaming
cache lines for early eviction [29].
2 Intel has announced a shift to a NUMA approach starting with the
future Nehalem processor.
The systems also differ in their deliverable memory
bandwidth [22]. In terms of peak aggregate bandwidth,
the 2P Harpertown system can deliver 21.4 GB/s, the 4P
Barcelona system supports 42.8 GB/s, the QS20 and QS22
blades support 51.2 GB/s for main memory access. The
Tesla C1060 supports 102 GB/s peak bandwidth to its 4
GB device memory. However, there is often a significant
gap between the peak bandwidth and the sustainable bandwidth [11, 18]. The gap is even larger for the multicore
processors due to the interference among multiple threads
performing data accesses [17, 23, 24].
The systems adopt different memory controller architectures. In general, most memory controllers are designed to
deliver the highest bandwidth when accessing a large contiguous chunk of data, in particular by exploiting maximum
locality of a row buffer and bank level parallelism. Switching between DRAM read and write should also be minimized to achieve the highest bandwidth utilization. However, even for simple computations in which each thread
is reading a linear array with stride one, memory requests
coming from multiple cores can be intermixed, thereby destroying the locality and parallelism of a DRAM chip. For
the Intel Harpertown and AMD Barcelona processors, the
granularity of memory access is the lowest level cache line
size (64 byte). For data-intensive applications, memory access requests from multiple cores with a size of 64 bytes
can be heavily interleaved. The situation is even worse for
the Intel Harpertown processor with its UMA configuration,
as the memory controller hub must mix memory access requests coming out from two different chips. For NVIDIA
GPUs, the access granularity is 32, 64, or 128 bytes [20].
By contrast, the Cell/B.E. adopts a different memory
subsystem. First, each SPE generates DMA requests with a
significantly larger size of up to 16 KB. Even for the dataintensive applications, each SPE issues DMA requests in
an intermittent fashion. This minimizes the inter-core interference in DRAM accesses. Therefore, programmers can
maximize the bandwidth utilization by increasing the size
of DMA accesses [12, 27]. Thus, the Cell/B.E. architecture
fundamentally lends itself to higher bandwidth utilization
than other systems, in spite of the significant effort toward
increasing bandwidth utilization in general purpose multicore processors [17, 15, 11]. The AMD Barcelona processor claims to adopt optimized scheduling algorithms especially for interleaved DRAM access streams as well [4].
Williams, et al., also demonstrate the first-generation Cell/B.E.’s high bandwidth utilization [31], though an open
question is what the ultimate impact of adopting different DRAM technologies will be (i.e., first- and secondgeneration Cell/B.E. architectures adopt different DRAM
technologies, namely, XDR and DDR2, respectively). Finally, the Cell/B.E. has an additional advantage in optimizing memory controller, as this processor targets streaming
applications that are highly latency tolerant. The Cell/B.E.’s memory controller can focus on bandwidth utilization, while general purpose multicore processors attempt
to address the significantly more difficult problem of balancing bandwidth utilization with fairness and latency issues [18, 17, 23].
2.7
Ideal Software Implementations
To optimize the code for the Intel Harpertown, AMD
Barcelona, and Cell/B.E. processors, one first needs to identify coarse-grain parallelism and partition data to exploit all
the cores. One then needs to consider data layout for higher
data transfer and vectorization efficiency and vectorization
to exploit SIMD units. The Harpertown and Barcelona
processors are significantly less sensitive to data alignment
than the Cell/B.E., since they support multiple additional
instructions for unaligned data accesses; however, data layout still affects the performance in a non-negligible amount.
At high level, optimizing for the Cell/B.E. does not differ
much from the Harpertown and Barcelona processors, but
the actual implementation is significantly more complex as
programmers need to explicitly program for data transfer
within the local store size limit of 256 KB. In addition,
the gap between the performance of baseline and optimized
code is significantly higher for the Cell/B.E., and this often
requires manual optimization.
The optimization process for the Tesla C1060 is largely
different from the above three processors. For the Tesla
C1060, easily identifiable coarse-grain parallelism does not
suffice to fully exploit the chip. Thus, the optimization
should focus on extracting additional levels of parallelism.
To benefit from the high bandwidth and low latency onchip memories, programmers need to modify an algorithm
to maximize data sharing among multiple cores. Data coalescing and broadcasting mechanisms are also crucial for
the high performance, and this also needs to be considered
in algorithm design. For the Tesla C1060 or other CUDA
enabled GPUs, the key challenge arises from high level algorithm design, and the actual implementation is less complex in terms of code size.
For the NUMA-based systems, one can gain significant
speedup for bandwidth intensive algorithms by controlling
thread binding and data allocation. The optimization result for the Cell/B.E. often is more predictable than the x86
based architectures or the Tesla C1060 owing to its simpler architecture. For the x86 based architectures, the multilevel memory hierarchy with different latency, size, and associativity in each level and complex and adaptive prefetching mechanisms across the memory hierarchy significantly
complicate the performance analysis. The Tesla C1060 optimization is complicated by its large search space as well,
which is non-linear in nature [26].
3 Kernel Descriptions and Qualitative Analysis
For our evaluation, we consider two versions of covariance computation based on Pearson’s method and Kendall’s
method, as implemented in the open-source R statistics
package [3]. We also create the third Kernel by modifying the second kernel based on Kendall’s method. Given
two test data sets, represented by an nX × n matrix X, an
nY ×n matrix Y , and pre-computed mean vectors x̄ and ȳ of
length nX and nY , respectively, the basic covariance computation (based on Pearson’s method) produces an nX ×nY
matrix C such that
n
Cij ←
1 (Xik − x̄i ) · (Yjk − ȳj )
n−1
k=1
In this section, we describe three kernels we considered,
and explain their high-level characteristics.
3.1
Conventional Sequential Code
/ / p x : a pointer for X
/ / p y : a pointer for Y
/ / p xm : a p o i n t e r f o r x̄
/ / p ym : a p o i n t e r f o r ȳ
/ / p ans : a po in t e r for C
f o r ( i = 0 ; i < nX ; i ++) {
p x x = &p x [ i ∗ n ] ;
xxm = p xm [ i ] ;
f o r ( j = 0 ; j < nY ; j ++)
p y y = &p y [ j ∗ n ] ;
yym = p ym [ j ] ;
sum = 0 . 0 ;
f o r ( k = 0 ; k < n ; k ++)
sum += ( p x x [ k ] − xxm )
}
p a n s [ i ∗ nY + j ] = sum
}
}
{
{
∗ ( p y y [ k ] − yym ) ;
/ (n − 1);
Code 1: C code for Kernel1
We show the C implementation of the basic covariance
kernel based on Pearson’s method in Code 1. We refer to
this code as “Kernel1.” “Kernel2” computes covariance using Kendall’s method and “Kernel3” is artificially created
by modifying Kernel2. Kernel2 and Kernel3 are shown in
Code 2. Kernel1 and Kernel2 codes are adopted from the R
project source code [3].
In Kernel1, we can first subtract the mean vector from
each matrix operand to remove the redundant subtracts.
Then, transposing matrix Y converts this algorithm to a
dense matrix multiplication problem, which is extensively
studied and also there is a highly optimized BLAS library
/ / p x : a pointer for X
/ / p y : a pointer for Y
/ / p ans : a p oi n t er for C
f o r ( i = 0 ; i < nX ; i ++) {
p x x = &p x [ i ∗ n ] ;
f o r ( j = 0 ; j < nY ; j ++) {
p y y = &p y [ j ∗ n ] ;
sum = 0 . 0 ;
f o r ( k = 0 ; k < n ; k ++) {
f o r ( n1 = 0 ; n1 < n ; n1 ++) {
# i f SIGN / / K e r n e l 2
sum += s i g n ( ( p x x [ k ] − p x x [ n1 ] )
∗ ( p y y [ k ] − p y y [ n1 ] ) ) ;
# e l s e / / Kernel3
sum += ( p x x [ k ] − p x x [ n1 ] )
∗ ( p y y [ k ] − p y y [ n1 ] ) ) ;
# endif
}
}
p a n s [ i ∗ nY + j ] = sum ;
}
}
Code 2: C code for Kernel2 and Kernel3
ral locality in the row data of matrix X than the row data of
matrix Y. For Kernel2 and Kernel3, if we can place two rows
from matrix X and Y on on-chip memory, we can perform
O(n2 ) computation over O(n) data without off-chip memory access. The total number of flops executed by Kernel1
is (nX + nY )× n+ nX × nY × n× 2, and nX × nY × n2 × 4
for Kernel2 and Kernel3.
4 Baseline Architecture-specific Implementations
For subsequent evaluation, we create a basic parallel implementation for each architecture, described in this section.
These implementations include “baseline” parallelization
and tuning, meaning they include some degree of platformspecific tuning but are not extensively tuned. Again, as Section 3.1 states, our focus is on system evaluation and not on
kernel optimization.
4.1
for the problem. Kernel2 and Kernel3 have more complex
data access patterns but still can be optimized based on a
cache blocking approach. Initially, we intentionally ignore
these particular optimization opportunities for the following two reasons. First, we wish to stress the memory systems experimentally, and secondly, we wish to show the
more typical and intuitive optimization process that will be
common in practice. Then, if memory bandwidth turns out
to be a performance bottleneck, we implement the blocking
approach. Our focus is on highlighting the impact of architectural design trade-offs on performance and programmability. In particular, we do not intend to conclude which system is the “best” for computing covariance, nor do we claim
to have implemented the best possible covariance code.
3.2
Basic Algorithmic Analysis
The memory footprint of all three kernels is O((nX +
nY ) × n) and the size of two input matrices are typically
much larger than mean vectors or the output matrix. The
computational complexity is O(nX × nY × n) for Kernel1
and O(nX × nY × n2 ) for Kernel2 and Kernel3. While
these kernels are compute-intensive in their asymptotic notations, if the entire memory footprint does not fit into the
on-chip memory of the test systems, then these kernels
can be bandwidth-bound. All three kernels have obvious
nX × nY way parallelism as every pair of rows from matrix
X and Y can be computed independently. Also, if we ignore
the floating-point associativity issues, we can also trivially
parallelize the innermost loop of Kernel1 and the second
innermost loop of Kernel2 and Kernel3. For Kernel1, if we
execute the code in a sequential way, there is higher tempo-
Intel Harpertown (2P) and AMD
Barcelona (4P) Multicore Implementations
We can easily parallelize the outermost loop of all three
kernels with OpenMP or pthreads for our 8 and 16 core systems, assuming a sufficiently large nX (nY ). For Kernel1,
we apply auto-vectorization with two directives, #pragma
unroll(16) and #pragma vector aligned, achieving comparable performance to an intrinsics-based vectorization approach.
For Kernel2, the sign() function involves branches,
lowering the performance significantly. We replace the
branch with an SSE compare (e.g., mm cmpgt pd()
and mm cmplt pd()) and bitwise operations (e.g.,
mm and pd() and mm or pd()). The Intel icc compiler fails to perform this replacement automatically, and so
we hand-code this translation to use SIMD intrinsics. The
Kernel3 code can be trivially vectorized in the same way.
For the 4P Barcelona system, which is NUMA-based, we
replicate the input matrices to all four chip’s local DRAM,
and pin the threads to each core. The replication cost can
be amortized with the multiple reads, and this optimization
maximizes the available bandwidth while minimizing the
interference.
Even though our test kernels are asymptotically
compute-intensive, if the input matrices do not fit into the
on-chip cache memory, these algorithms can be bandwidthbound. A blocking approach can reduce the amount of offchip data transfer at the cost of increased implementation
complexity. Also, for the Harpertown and Barcelona processors, selecting the optimal block size requires exhaustive
search over parameter space as it is a complex function of
the multiple levels of cache hierarchy and their size and associativity. This exhaustive search is beyond the scope of
our work and we find the block size based on heuristics.
4.2
STI Cell/B.E. (2P) Implementation
The Cell/B.E. implementation resembles the 4P AMD
Barcelona implementation, though it provides an additional
opportunity for fine-tuning owing to the higher level of control over on-chip memory supported by the architecture. In
particular, observe that a row data of matrix X has, assuming the given loop order, a higher temporal locality than a
row data of matrix Y . Thus, we can assign a larger buffer
for matrix X than Y . Furthermore, to reduce the bandwidth
requirement even when a single row does not fit into the local store, we allocate additional small buffers for streaming.
In this case, we read data from the larger buffer for matrix
X and Y for accessing the initial part of the row (which fits
into the local store), and then we switch to the streaming
mode with the smaller buffers for the remaining.
However, fine-grained control over on-chip memory significantly increases the coding complexity, especially when
the on-chip memory requirement varies as a function of a input data size. A blocking approach, even though it adds additional complexity in high level, fixes the on-chip memory
requirement regardless of a input data size. Accordingly,
a blocking approach can reduce the coding complexity for
the Cell/B.E. in addition to the improved performance. For
the Cell/B.E., the impact of different block size is easier to
understand owing to its simple memory subsystem. Larger
block height reduces the amount of traffic whereas larger
block width increases the iteration count of the innermost
loop to improve the compute efficiency. We can also simply pick the largest block size that fits into the local store
instead of considering different cache sizes in the memory
hierarchy.
way (in the innermost loop of Code 2, array index k remains
constant and only array index n1 changes. In our optimized
code, every thread in a same half warp accesses p xx and
p yy with same n1 but different k), and we can use the onchip shared memory to exploit this fact. As each thread accesses the same data element, we can use the shared memory’s broadcasting mechanism as well.
One critical issue is the sign() function, which involves
branch instructions. The NVIDIA CUDA compiler replaces
branch instructions with predicates when the number of instructions controlled by the branch is equal to or less than
the threshold value (4 or 7 instructions) [20]. Therefore, by
using the CUDA framework, we do not need to manually
optimize for sign() function, as we did on the Intel Harpertown, AMD Barcelona, and Cell/B.E. platforms. Optimization for the Tesla C1060 is more involved in high level, but
simpler to program than the Cell/B.E. for these kernels.
Also, the proper use of on-chip cache memory significantly reduces the bandwidth requirement, and Kernel2
and Kernel3 become compute-bound even without explicit
blocking.
4.4
Quantitative Comparison of Implementation Costs
Harpertown
and
Barcelona - initial
Harpertown
and
Barcelona - blocking
Cell/B.E. - initial
Cell/B.E. - blocking
Tesla C1060 - initial
Tesla C1060 - blocking
4.3
NVIDIA Tesla C1060 Implementation
For the NVIDIA Tesla C1060, nX × nY -way parallelism
may not be sufficient for practical data set sizes. Even when
nX × nY is very large, having every thread processes a distinct pair of rows can lead to poor bandwidth utilization (no
coalescing in data transfer) or low on-chip cache utilization
(no data sharing). For Kernel1, we partition the innermost
loop with chunks of size 16 elements (a half warp, as high
memory bandwidth utilization is achieved when the memory accesses from a half warp can be coalesced [20]). Each
thread in a half warp processes one element out of 16 elements in a chunk to maximize the coalescing. For Kernel2
and Kernel3, we partition the second innermost loop identical to the case of Kernel1. In this case, every thread in
a half warp traverses the same row data in a synchronized
kernel code size
(# of lines)
335
approximate coding time
1 day
419
2 days
1620
1004
52 (Kernel1) + 97
(Kernel2/3)
88 (Kernel1)
7 days
2 days
2 days
1 day
Table 2: Quantitative comparison of implementation costs
in terms of code size and implementation time. This excludes the code for the kernel invocation and the residual
part computation.
Table 2 summarizes the comparison. For the Cell/B.E.,
a blocking approach fixes the local store space requirement
regardless of a input data size, and simplifies the coding in
addition to the improved performance. We can also identify that the code size for the Tesla C1060 is significantly
smaller than the other architectures. For the Tesla C1060,
the challenge is in extracting additional parallelism and best
exploiting the memory subsystem (based on data access coalescing and broadcasting mechanism and efficient use of
the shared memory).
5 Experimental Evaluation: Measurements
and Analysis
Recall the evaluation platforms from Table 1. To measure the sustained bandwidth of the 2P Harpertown system, we use PAPI [2] and count the number of memory bus
transactions (event code 0x40001045 in PAPI). For the 4P
Barcelona system, we use AMD CodeAnalyst [1] and count
the number of DRAM accesses (event select code 0xE0 for
the AMD Barcelona’s event select register). For the systems with the Cell/B.E. processors, we attach counter variables to every DMA memory requests and ignore the PPE
initiated traffic. For the Tesla C1060, we estimate the total bandwidth requirement using the following equations:
nX × nY × n× sizeof(float or double) ×2 for Kernel1 and
1
nX × nY × n × n× sizeof(float or double) ×2 × 16
(a half
warp width, owing to data sharing) for Kernel2 and Kernel3.
For the NUMA-based AMD Barcelona and Cell/B.E. architectures, we replicate matrix X and Y, which are read multiple times, for higher bandwidth utilization. This replication
cost is included for the timing, and the off-loading overhead
to the device memory in the Tesla C1060 is also included.
5.1
Kernel1
The top half of Figure 2 depicts the sustained Gflop/s
and bandwidth for Kernel1 in single-precision with the initial implementation. Although the algorithm is computationally intensive, the performance is bounded by memory
bandwidth since the entire data do not fit into the on-chip
memory. The Tesla C1060 benefits from its high bandwidth
to on-board DRAM, but the sustained bandwidth is lower
than the theoretical peak and varies significantly for different input matrix sizes. The off-loading overhead accounts
for 18% (for the smallest matrix) to 2.4% (for the largest
matrix) of the total execution time. The QS20 and QS22
blades achieve the highest bandwidth utilization on average across the different values of n owing to their DMA
based data transfer mechanism with a large chunk size. The
QS20 blade (with Rambus XDR) achieves higher bandwidth utilization than the QS22 blade (with DDR2). The
4P Barcelona system achieves significantly higher sustained
bandwidth and bandwidth utilization than the 2P Harpertown system. This exemplifies the scalability benefit of
NUMA architecture. The AMD Barcelona has optimized
the memory access scheduling algorithms for interleaved
streaming accesses, and this also contributes to higher bandwidth utilization. However, the 2P Harpertown system delivers higher flop rates per unit bandwidth consumption owing to the large shared cache memory.
The blocked implementations3 yield significantly better
results than the initial implementation, as shown in the bot3 We
find the demand paging related performance issue with the
Figure 2: Sustained Gflop/s (left) and bandwidth utilization
(GB/s) (right) for the initial (top) and the blocking based
(bottom) implementations of Kernel1 (single-precision).
Missing points for the QS20 are due to memory allocation
failure. Here, nX = nY = 1024.
tom half of Figure 2, but still deliver significantly lower
performance than the theoretical peak. In the case of the
x86 architectures, there are a number of possible explanations. First, we need to tune the blocking with respect to the
different sizes and associativities at all levels of the cache
hierarchy to achieve higher performance. This task would
be daunting task even for skilled programmers. Secondly,
the blocked implementation may interacts, and may even
interfere, with the various hardware mechanisms in an unintentionally negative way. For example, the blocked version
has a more complex memory access pattern, which may reduce the effectiveness of the hardware prefetchers. Thirdly,
the behavior of the memory system mechanisms are comNUMA-based implementations later and update the performance results
of the AMD Barcelona and the QS20 accordingly (60%-80% and 5%20% speedup for the Barcelona and the QS20 in the case of Kernel1 with
blocking, respectively. The speedup is insignificant in other cases). The
QS22 blade is not accessible at this time, and for the QS22, we expect the
speedup comparable to the QS20.
Figure 3: Sustained Gflop/s (left) and bandwidth (GB/s)
(right) for the initial (top) and the blocking based (bottom)
implementations of Kernel1 (double-precision). Missing
points for the QS20 are due to memory allocation failure.
Here, nX = nY = 1024.
plex and challenging to reason about. For instance, on the
Harpertown, data in DRAM is first read into the lower level
(L2) cache, whereas it is read into the highest level (L1)
cache first and moved to the non-inclusive L2 and L3 caches
(when the cache line is evicted) on the Barcelona. All these
differences can affect the performance in non-intuitive ways
and this imposes challenges to a programmer if one wishes
to extract the highest achievable flop rates out of the chip.
In contrast, for the Cell/B.E. based systems, it is significantly easier to understand the data transfer related performance issues owing to its simple architecture. Still, to
achieve the highest flop rate, programmers need to consider
their code at the assembly level. Each iteration of the innermost loop in Kernel1 requires two vector loads, one vector
stores, two address increments, and one vector FMA (fusedmultiply-and-add) instructions. As a result, the fixed-point
instructions and load/store instructions can become a performance bottleneck. Extensive low level tuning is required
to balance the even and odd pipeline and minimize the address calculation overhead.
For the Tesla C1060, the delivered performance is lower
than 10% of its theoretical peak even in the best case. The
off-loading overhead (11%-70% of the total execution time,
which increases as n decreases), integer instructions for address increments, and the lack of additional multiply instructions to feed the SFU lower the deliverable performance. Also, to load data to the block array in the shared
memory, the Tesla C1060 needs to calculate an address and
issue a load instruction for every single real number even
though DRAM access latency can be in principle efficiently
hidden with the hardware-multithreading mechanism; thus
the device memory to the shared memory traffic cannot be
perfectly overlapped with the computation as is the case of
the Cell/B.E. The Tesla C1060 architecture is less transparent than the Cell/B.E.’s, and so the performance impact of
the tunable parameters (e.g. thread block size, block width
and height in a blocking approach, and loop unrolling factors) are more difficult to predict, thereby requiring explicit
search and tuning.
Figure 3 shows the results for double-precision. Interestingly, we can see that the sustained bandwidth for the Tesla
C1060 is higher than the single-precision case, largely due
to the increased granularity of data transfer from 64 byte
(16 threads in a half warp × 4 byte floating point number)
to 128 byte (16 × 8 byte floating point number). In the case
of double-precision, the QS22 blade delivers higher performance than the QS20 blade, and the QS20 blade becomes
compute-bound. We can also note that QS20 and the Tesla
C1060 achieve a significantly higher fraction of its theoretical peak than the single-precision case. This shows that the
QS20 and the Tesla C1060’s peak double-precision performance is easier to achieve than its single-precision counterpart.
5.2
Kernel2
Figures 4 and 5 summarize the results for Kernel2 in
single- and double-precision, respectively. This kernel is
highly compute-intensive, and if a pair of two rows can fit
into the on-chip memory, Kernel2 will be compute-bound
even without blocking. For single-precision, the Tesla
C1060 delivers nearly 100 Gflop/s. Still, this number is significantly lower than the advertised peak numbers, for similar reasons discussed in the case of Kernel1 in addition to
sign() function. Kernel2’s invocation of sign() is replaced
with predicated instructions; however, computing predication also requires additional cycles. In addition, as the SP
has only one execution pipeline and does not have separate
integer execution units, performance is more susceptible to
the mix of integer instructions than on other architectures.
For the QS20 and QS22 blades, the kernel is compute-
Figure 4: Sustained Gflop/s (left) and bandwidth (GB/s)
(right) for the initial (top) and the blocking based (bottom) implementations of Kernel2 (single-precision). Here,
nX = nY = 128.
Figure 5: Sustained Gflop/s (left) and bandwidth (GB/s)
(right) for the initial (top) and the blocking based (bottom) implementations of Kernel2 (double-precision). Here,
nX = nY = 128.
bound for small n and bandwidth-bound for large n without
blocking. In contrast, the QS20 and QS22 blades become
compute-bound even with large n with blocking. In the
bandwidth-limited cases, the QS20 blade achieves nearly
full bandwidth utilization while the QS22 blade delivers approximately 80% of the theoretical peak bandwidth. The
QS20 blade uses XDR DRAM and the QS22 blade uses
DDR2 DRAM, which explains the reason for the different
sustained bandwidth.
For the Tesla C1060, the kernel becomes computebound, even without blocking, because of the efficient use
of the shared memory. Indeed, blocking does not improve
performance, and so we need not consider it further.
without blocking, but the flop rates remains nearly constant
with blocking. Considering that our kernel has more adds
and subtracts than multiplies, the QS20 and QS22 blades
demonstrate near the maximum achievable floating-point
performance. Yet, the performance for the Tesla C1060 is
only one-quarter of its advertised peak performance. For
double-precision, the Tesla C1060 delivers over half of its
peak flop rates in comparison to one-quarter in the singleprecision case as double-precision SP can be solely used
for floating-point operations and single-precision SPs can
be exploited for integer instructions (e.g. address calculation).
5.3
6 Conclusions
Kernel3
Kernel3 is highly floating-point intensive, and its performance is given in Figure 6 and 7. For single-precision, the
QS20 blade, the QS22 blade, and the Tesla C1060 card
deliver comparable performance for small n. The QS20
and QS22 blade’s performance drops sharply with large n
Based on our detailed comparison of multicore processors and accelerators, we can draw several general conclusions. First, a hardware-multithreading approach lowers
programming complexity by freeing programmers from latency issues at the expense of increased requirements for
Figure 6: Sustained Gflop/s (left) and bandwidth (GB/s)
(right) for the initial (top) and the blocking based (bottom) implementations of Kernel3 (single-precision). Here,
nX = nY = 128.
Figure 7: Sustained Gflop/s (left) and bandwidth (GB/s)
(right) for the initial (top) and the blocking based (bottom) implementations of Kernel3 (double-precision). Here,
nX = nY = 128.
parallelism. Second, NUMA has benefits over UMA for
data-intensive algorithms if the underlying system or programmer can properly handle data placement and thread
binding. Third, an explicit DMA-based data transfer mechanism can effectively achieve high bandwidth utilization, at
the cost of higher coding complexity. Fourth, in the case
of the Cell/B.E., blocking can also reduce the programming
complexity as the programmer is forced to optimize the program with the fixed size on-chip memory. Fifth, increasing
raw floating-point performance only may not lead to overall system performance enhancement in many cases, and
the off-chip memory bandwidth is critical even when the
algorithm appears to be compute-bound in asymptotic notation. Integer execution units are also important to achieve
the peak flop rates as well. Sixth, the Cell/B.E.’s simple
memory subsystem increases the coding complexity to implement the first correct version of the code with proper data
transfer mechanisms. Yet, the subsequent optimization process is easier and more intuitive. The opposite holds for
the x86 based architectures. This paper provides quantitative support for these conclusions, presented in a tutorial-
style discussion that we hope can be used to guide future
tuning and architecture selection efforts. Looking forward,
we intend to evaluate different hardware systems with more
complex and irregular kernels, as well as real-world applications.
Acknowledgments
This work was supported in part by an IBM Shared
University Research (SUR) award and NSF Grants CNS0614915 and DBI-0420513. We acknowledge NVIDIA
Corp. for its donation of GPU cards and the professor partnership award, and the Sony-Toshiba-IBM Center of Competence for use of Cell/B.E. resources that have contributed
to this research.
References
[1] AMD CodeAnalyst, 2009. http://developer.amd.
com/cpu/CodeAnalyst.
[2] PAPI, 2009. http://icl.cs.utk.edu/papi.
[3] The R project for statistical computing, 2009. http://
www.r-project.org/.
[4] AMD Corporation. Software Optimization Guide for AMD
Family 10h Processors, 3.06 edition, Apr. 2008.
[5] O. Bockenbach, M. Knaup, and M. Kachelriess. Implementation of a cone-beam backprojection algorithm on the Cell
Broadband Engine processor. In Proc. SPIE Medical Imaging, San Diego, CA, Feb. 2007.
[6] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and
K. Skadron. A performance study of general-purpose applications on graphics processors using CUDA. Journal of Parallel and Distributed Computing, 68(10):1370–1380, 2008.
[7] S. Che, J. Li, J. W. Sheaffer, K. Skadron, and J. Lach. Accelerating compute-intensive applications with GPUs and FPGAs. In Proc. 6th IEEE Symp. on Application Specific Processors (SASP), Anaheim, CA, Jun. 2008.
[8] T. Chen, R. Raghavan, J. Dale, and E. Iwata. Cell Broadband
Engine Architecture and its first implementation-a performance view. IBM Journal of Research and Developments,
51(5):559–572, Sep. 2007.
[9] P. Gepner, D. L. Fraser, and M. F. Kowalik. Second generation quad-core Intel Xeon processors bring 45 nm technology and a new level of performance to HPC applications.
Lecture Notes in Computer Science, 5101:417–426, 2008.
[10] Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual, Nov. 2007.
[11] E. Ipek, O. Mutlu, J. F. Martinez, and R. Caruana. Selfoptimizing memory controllers: A reinforcement learning
approach. ACM SIGARCH Computer Architecture News,
36(3):39–50, Jun. 2008.
[12] D. Jimenez-Gonzalez, X. Martorell, and A. Ramirez. Performance analysis of Cell Broadband Engine for high memory bandwidth applications. In Proc. 7th IEEE Int’l Symp.
on Performance Analysis of Systems and Software (ISPASS),
San Jose, CA, Apr. 2007.
[13] M. G. Kendall. A new measure of rank correlation.
Biometrika Trust, 30(1):81–93, Jun. 1938.
[14] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym.
NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39–55, Mar. 2008.
[15] S. A. McKee, W. A. Wulf, J. H. Aylor, R. H. Klenke, M. H.
Salinas, S. I. Hong, and D. A. B. Weikle. Dynamic access
ordering for streamed computations. IEEE Transactions on
Computers, 49(11):1255–1271, Nov. 2000.
[16] J. S. Meredith, S. R. Alam, and J. S. Vetter. Analysis of
a computational biology simulation technique on emerging
processing architectures. In Proc. 6th IEEE Int’l Workshop
on High Performance Computational Biology (HICOMB),
Long Beach, CA, 2007.
[17] O. Mutlu and T. Moscibroda. Enhancing the performance
and fairness of shared DRAM systems with parallelismaware batch scheduling. In Proc. 35th Ann. Int’l Symp. on
Computer Architecture (ISCA), Beijing, China, Jun. 2008.
[18] C. Natarajan, B. Christenson, and F. Briggs. A study of
performance impact of memory controller features in multiprocessor server environment. In Proc. 3rd Workshop on
Memory Performance Issues (WMPI), Munich, Germany,
Jun. 2004.
[19] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable
parallel programming with CUDA. ACM Queue, 6(2):40–
53, Mar. 2008.
[20] NVIDIA Corporation. NVIDIA CUDA Compute Unified
Device Architecture Programming Guide, 2.0 edition, Jun.
2008.
[21] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris,
J. Kruger, A. E. Lefohn, and T. J. Purcell. A survey of
general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80–113, Mar. 2007.
[22] L. A. Polka, H. Kalyanam, G. Hu, and S. Krishnamoorthy.
Package technology to address the memory bandwidth challenge for tera-scale computing. Intel Technology Journal,
11(3), 2007.
[23] N. Rafique, W. T. Lim, and M. Thottethodi. Effective management of DRAM bandwidth in multicore processors. In
Proc. 16th Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), Brasov, Romania, Sep. 2007.
[24] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.
Owens. Memory access scheduling. In Proc. 27th Ann.
Int’l Symp. on Computer Architecture (ISCA), Vancouver,
Canada, Jun. 2000.
[25] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B.
Kirk, and W. W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using
CUDA. In Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Salt
Lake City, UT, Feb. 2008.
[26] S. Ryoo, C. I. Rodrigues, S. S. Stone, J. A. Stratton, S. Ueng,
S. S. Baghsorkhi, and W. W. Hwu. Program optimization
carving for GPU computing. Journal of Parallel and Distributed Computing, 68(10):1389–1401, 2008.
[27] T. Saidani, S. Piskorski, L. Lacassagne, and S. Bouaziz. Parallelization schemes for memory optimization on the Cell
processor: a case study of image processing algorithm. In
Proc. Workshop on memory performance (MEDEA), Brasov,
Romania, Sep. 2007.
[28] O. Schenk, M. Christen, and H. Burkhart. Algorithmic
performance studies on graphics processing units. Journal
of Parallel and Distributed Computing, 68(10):1360–1369,
2008.
[29] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash,
P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee:
a many-core x86 architecture for visual computing. ACM
Transactions on Graphics, 27(3), Aug. 2008.
[30] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune
dense linear algebra. In Proc. Int’l Conf. on High Performance Computing and Networking (SC), Austin, TX, 2008.
[31] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and
J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proc. Int’l Conf. on
High Performance Computing and Networking (SC), 2007.
Download