Optimizing Sparse Matrix Vector Multiplication on Emerging Multicores

advertisement
Optimizing Sparse Matrix Vector Multiplication
on Emerging Multicores
Orhan Kislal, Wei Ding, Mahmut Kandemir
The Pennsylvania State University University
Park, Pennsylvania, USA
omk103, wzd109, kandemir@cse.psu.edu
1
Saturday, September 7, 13
Ilteris Demirkiran
Embry-Riddle Aeronautical University
San Diego, California, USA
demir4a4@erau.edu
Introduction
Importance of Sparse Matrix-Vector
Multiplication (SpMV)
Dominant component for solving eigenvalue
problems and large-scale linear systems
Difference from uniform/regular dense matrix
computations
Irregular data access patterns
Compact data structure
2
Saturday, September 7, 13
Background
SpMV is usually in the form of b=Ax+b, where
A is a sparse matrix, and x and b are dense
vectors x: source vector
b: destination vector
Only x and b can be reused.
One of the most common data structures for A:
Compressed Sparse Row (CSR) format
3
Saturday, September 7, 13
BackGround con’t
CSR format
// Basic SpMV implementation,
// b = A*x + b, where A is in CSR and has m rows
col
...
val
...
for (i = 0; i < m; ++i) {
double b = b[i];
for (k = ptr[i]; k < ptr[i+1]; ++k)
b+= val[k] * x[col[k]];
b[i] = b;
ptr
(a)
}
(b)
Each row of A is packed one after the other in a dense array val
integer array (col) stores the column indices of each stored element.
ptr: keeps track of where each row starts in val and col.
4
Saturday, September 7, 13
Motivation
Computation mapping and scheduling
C0
C1
C2
C3
C4
C5
L1
L1
L1
L1
L1
L1
L2
L2
Mapping assigns the computation
that involves one or more rows of A
to a core (computation block)
L3
(a)
C0
C1
C2
C3
C4
C5
L1
L1
L1
L1
L1
L1
L2
L2
L3
(b)
L2
Scheduling determines the execution
order of those computations
How to take the on-chip cache
hierarchy into account to improve
the data locality?
5
Saturday, September 7, 13
Motivation con’t
If two computations share data -> better to
map them to the cores that share a cache in
some layer (more frequent sharing -> higher
layer) Mapping!
For these two computations, better to let the
shared data accessed by two cores in close
proximity in time. Scheduling!
6
Saturday, September 7, 13
Motivation con’t
Data Reordering
The source vector x is read-only
Ideally, x can have a customized layout for
each row computation rx, i.e., data elements in x
that correspond to the nonzero elements in r are
placed contiguously in memory (reduce cache
footprint)
However, can we have a smarter scheme?
7
Saturday, September 7, 13
Framework
Original SpMV
code
Mapping
Cache hierarchy
description
Scheduling
Data
Reordering
Optimized SpMV
code
Mapping (cache hierarchy-aware)
Scheduling (cache hierarchy-aware)
Data Reordering (seek a way to determine the
minimal number of layouts for x that keep
cache footprint during computation as small as
possible )
8
Saturday, September 7, 13
Mapping
Only consider the data sharing among the
cores
Basic idea: for two computation blocks, higher
data sharing means mapping them to higher
level of cache.
We quantify the data sharing for two
computation blocks as the sum of the number
of nonzero elements at the same column (for
those computation blocks).
9
Saturday, September 7, 13
Mapping con’t
v3
1 v5
v11
1
1
v2
2
1
Constructing the reuse graph
v4
v1
v9
1
1
v7
v6
Vertex: computation block
Weight on an edge: the amount of data
sharing
10
Saturday, September 7, 13
v8
1
1
v12
v10
Mapping-Algorithm
SORT: Edges are sorted by their weights in a
decreasing order
PARTITION: Vertices are visited based on the
order of edges. We then hierarchically partition
the reuse graph. The number of partitions is
equal to the number of cache levels.
LOOP: Repeat Step 2 until the partition for the
LLC is reached. The assignment of each
partition to a set of cores is based on the cache
hierarchy.
11
Saturday, September 7, 13
Mapping-example
10
0
1
2
3
L1
L1
L1
L1
2
5
10
1
L2
1
L2
10
12
L3
11
(a)
10
2
5
10
0
1
2
3
L1
L1
L1
L1
1
1
L2
10
L2
12
L3
11
(b)
Saturday, September 7, 13
12
Scheduling
Specify an order in which each row block is to
be executed
Goal: ensure the data sharing among the
computation blocks can be caught in the
expected cache level.
13
Saturday, September 7, 13
Scheduling con’t
SORT (same as the mapping component)
INITIAL: assign the logical time slot for the two
nodes (vl and vr) that have the edge in between
with the highest weight, and set up the offset o(v)
for each vertex v. (o(vl) = +1, o(vr) = -1)
Purpose of employing offset: ensure the nodes
mapped to the same core with high data sharing
are scheduled to be executed as closely as
possible.
14
Saturday, September 7, 13
Scheduling CON’T
SCHEDULE
CASE 1: vx and vy are mapped to different cores.
Then assign vx and vy to be executed at the same
time slot or |T(vx) - T(vy)| is minimized
CASE 2: vx and vy are mapped to the same core.
If vx is already assigned, then vy will be assigned
at T(vx) + o(vx) and o(vy) = o(vx). Otherwise,
initialize vx and vy at Step 2
LOOP: repeat Step 3 until all vertices are scheduled.
15
Saturday, September 7, 13
Scheduling-example
t0-2 t0-1
20
t0
t0+1
time slot
v2
v1
19
v3
v3
v2
v1
v2
v1
(a)
v3
(b)
(a) is a portion of the reuse graph and (b) is the
illustration of two schedules for v3. The first one
places v3 next to v1 and the second one places v3
next to v2. Using the offset, our scheme successfully
generates the first schedule instead of the second one.
16
Saturday, September 7, 13
dATA rEORDERING
Original layout:
x1 x2 x3
...
x10 x11 x12
memory blocks store x4-x9
Layout 1:
x1 x2 x12
...
memory blocks store x3-x11
Layout 2:
x3 x11 x12
...
memory blocks store x1, x2 and x4-x9
Find a customized data layout for x used in
each set of rows or row blocks such that the
cache footprint generated by the computation
of these rows can be minimized.
17
Saturday, September 7, 13
data reordering con’t
Layout 1:
x1 x2 x3
...
Layout 2:
x4 x5 x6
...
Combined
layout:
x1 x2 x3
x4 x5 x6
p%b
...
(ni-p)%b b-(ni-p)%b-p%b
...
...
...
b
(b)
(a)
Case 1: r1 and r2 have no common nonzero
elements, then x can have the same data layout for
r1x and r2x (see (a))
Case 2: otherwise, assuming they have p common
nonzero elements, the memory block size is b, and
the number of nonzero elements in r1 and r2 are ni
and nj, respectively. (see (b))
18
Saturday, September 7, 13
end if
end for
k + +;
end for
[1], and represent a wide variety of a
also exhibit diversity regarding typ
matrix dimension, and the number o
experiment Setup
For each matrix in Table III, w
with four
different versions. All the
I MPORTANT
CHARACTERISTICS OF THE AMD O
TABLE I.
I MPORTANT CHARACTERISTICS OF THE TABLE
I NTEL II.
eordering
known ARCHITECTURE
data locality. optimizations
D UNNINGTON ARCHITECTURE .
AMD
Opteron
Intel
Dunnington
rix A, the source matrix ~x, and the last
and iteration space tiling as well as S
Number of Cores
48 cores (4 sockets)
Number of Cores
12 cores (2 sockets)
ze Clock
b Frequency
gies (whenever possible
Clock Frequency
2.20GHzinnermost lo
2.40GHz
fferent layout
for ~x.32KB, 8-way, 64-byte line size, 3 cycle latency
L1 rewritten
64KB,
line extensio
size
L1
are
to full,
use64-byte
SIMD
L2
512KB, 4-way,are
64-byte
line size (ma
L2
3MB, 12-way, 64-byte line size, 12 cycle latency
how
loop iterations
assigned
L3
12MB, 16-way, 64-byte line size, 40 cycle latency
L3
12MB, 16-way, 64-byte line size
T
are scheduled,
or how da
· ,~rOff-Chip
about 85 ns
TLBiterations
Size
1024 4K pages
n ) */ Latency
Our
versions
arephysical,
as follows:
Address Sizes
40 bits physical, 48 bits virtual
Address
Sizes
48 bits
48 bits virtual
of A do
• Default: In this version, the
= NULL then
across
cores
insize
a blocked
fa
uses
a
64-bit
instruction
set.
The
TLB
is
1024
4
newLayout(~
x); six 32 KB, 8 way privateTABLE
E7450 employs
L1
caches.
There
Benchmarks
III.
T ESTED M ATRICES
.
a set of support
successive
iterations
SSE pairwise
4a is supported but SSSE3
is missing.
k;
three 3 MB, 12 way L2 caches that are shared
assigned to a core are sched
Structure
Dimension
Non-zeros
ong cores and also a 12 MB, Name
16 way L3 cache
to be shared
me
single socket
of(lexicographic
Opteron 6174,
we have tw
order).
caidaRouterLevel
192244
1218132
~rall
A do
cores. The 1066
MTS front sidesymmetric
bus isOna aquad
j ofsix
sam
KB private
L1 caches
and twelve 512 KB private L2
symmetric
88343
2441727
~ri ,~r j ) bus
is a function
to test if ~rnet4-1
and ~GB/s
r j have
mped
with a theoretical
data
transfer
rate.
i 8.5
can
•327680
Mapping:a This
version
emp
shallow water2
square
81920
In
addition
every
socket
employs
single
12
MB
L3
column.*/ enables source
eon-zero
MTS technology
synchronous
transfer
of
discussed
inseries
Section
IV, eff
but
ohne2
square
181343
6869939
share
among
all
cores.
Opteron
6100
processor
~rress
== data,
NULLtherefore
then the data
i ,~r j )and
lpl1 is transferred
square four times
32460
328036
formed,
the
iterations
assign
nub
HyperTransport
technology
links,
providing
system
er elements
and addresses
arebetransferred
two
times
faster.
E7450
ta
in ~x can
placed
in
the
way
rmn10
unsymmetric
46835
2329092
inthe
their
default
relative
orde
gra
among
CPUs
to
improve
system
balance
and
sc
kim1 Multiple
38415
933195
o Figure
supports9(a)
an without
architecture
called
Independent Bus
conflicting
with theunsymmetric
Four x16 links
inMapping+Scheduling:
this system to provideThis
6.4
pin
bcsstk17
symmetric
10974 are used
htsone
in Cprocessor
•428650
k then on each bus. We use 16 GB of DDR2-1066
tsc opf 300
symmetric
9774
820783
link. The input/output
busping
frequency
isdescribed
1.8 GHz inand
aSl
M. = k;
out
strategy
ins2
symmetric
309412 it provides
2751484 is 102.4 GB/s. This mod
I/O19bandwidth
of
nstraint to Ck
the scheduling strategy disc
with was
an integrated
DDR3 memory controller withinD
2)Saturday,
AMD
Opteron: The AMD Opteron 6100 series
the
September 7, 13
Experiment Setup
CON’T
Different versions in our experiments
Default
Mapping
Mapping+Scheduling
Mapping+Scheduling+Layout
20
Saturday, September 7, 13
Experimental Results
Con’t
'()*+),-./("0,1)+2(,(.3"456"
Performance improvement on Dunnington
&!"
7-118.9"
7-118.9:;/<(=>?8.9"
7-118.9:;/<(=>?8.9:@-A+>3"
%#"
%!"
$#"
$!"
#"
!"
(,(.3"456"
Mapping over Default: 8.1%
ts with the Intel Dunnington system.
Mapping+Scheduling over Mapping: 1.8%
&#"
7-118.9"
7-118.9:;/<(=>?8.9"
7-118.9:;/<(=>?8.9:@-A+>3"
Mapping+Scheduling+Layout
over
&!"
Mapping+Scheduling: 1.7%
21
%#"7, 13
Saturday, September
Experimental Results
Con’t
ovements with the Intel Dunnington system.
Performance improvement on AMD
'()*+),-./("0,1)+2(,(.3"456"
&#"
7-118.9"
7-118.9:;/<(=>?8.9"
7-118.9:;/<(=>?8.9:@-A+>3"
&!"
%#"
%!"
$#"
$!"
#"
!"
)+2(,(.3"456"
Mapping over Default: 9.1%
ovements with the 48-core AMD based system.
Mapping+Scheduling over Default: 11%
%#"
/-<=->+83()?(2(@"
Mapping+Scheduling+Layout over
.(3AB$""
%!"
;C-@@+DED-3()%""
22
Default: 14%
+C.(%"
Saturday, September 7, 13
$#"
Thank you!
23
Saturday, September 7, 13
Download