A Component Model of Spatial Locality Xiaoming Gu Ian Christoper Tongxin Bai

advertisement
A Component Model of Spatial Locality
Xiaoming Gu
Intel China Research Center
xiaoming@cs.rochester.edu
Ian Christoper
Tongxin Bai
Department of Computer Science
University of Rochester
{ichrist2,bai}@cs.rochester.edu
Chengliang Zhang
Microsoft Corporation
chengzh@microsoft.com
Chen Ding
Department of Computer Science
University of Rochester
cding@cs.rochester.edu
Abstract
Good spatial locality alleviates both the latency and bandwidth
problem of memory by boosting the effect of prefetching and improving the utilization of cache. However, conventional definitions
of spatial locality are inadequate for a programmer to precisely
quantify the quality of a program, to identify causes of poor locality, and to estimate the potential by which spatial locality can be
improved.
This paper describes a new, component-based model for spatial
locality. It is based on measuring the change of reuse distances
as a function of the data-block size. It divides spatial locality into
components at program and behavior levels. While the base model
is costly because it requires the tracking of the locality of every
memory access, the overhead can be reduced by using small inputs
and by extending a sampling-based tool. The paper presents the
result of the analysis for a large set of benchmarks, the cost of the
analysis, and the experience of a user study, in which the analysis
helped to locate a data-layout problem and improve performance
by 7% with a 6-line change in an application with over 2,000 lines.
Categories and Subject Descriptors C.4 [Computer Systems Organization]: Performance Of Systems—measurement techniques
General Terms Measurement, Performance
Keywords Spatial locality, Reuse distance
1.
Introduction
Given a fixed access order, the effect of caching and prefetching
depends on the layout of program data — whether the program has
good spatial locality or not. Conventionally, the term may mean
three different effects at the cache level. Here a memory block is a
unit of memory data that is loaded into a cache block when being
accessed by a program.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
ISMM’09, June 19–20, 2009, Dublin, Ireland.
c 2009 ACM 978-1-60558-347-1/09/06. . . $5.00.
Copyright • Intra-block spatial locality — Successive memory operations
access data from the same memory block, resulting in cacheblock reuse.
• Inter-block spatial locality — Program operations access mem-
ory blocks that do not map to the same cache set, avoiding cache
conflicts.
• Adjacent-block spatial locality — The program traverses mem-
ory contiguously, maximizing the benefit of hardware prefetching.
Intra-block and adjacent-block locality also plays a critical role
in lower levels of memory hierarchy such as virtual memory and
file systems where spatial locality manifests as usage patterns of
memory pages and disk sectors instead of cache blocks. In this
paper we focus on modeling intra-block spatial locality in a way
that can be extended to adjacent-block locality. For brevity, we use
the term spatial locality to mean intra-block spatial locality unless
we specify otherwise.
The preceding notions of spatial locality are not quantitative
enough for practical use. In particular, a programmer cannot use
them to measure the aggregate spatial locality, to identify locations
in a program that may benefit from locality improvement, and to
identify the potential by which spatial locality can be improved.
Numerous techniques have been developed to improve spatial locality. Example models include loop cost [18] at the program level, and access frequency [22], pairwise affinity [6], hot
streams [8], and hierarchical reference affinity [32, 35] at the trace
level. Most techniques show how to improve locality but not how
much locality can be improved. When a program does not improve,
there is no general test to check whether it is due to the limitation
of our technique or whether the spatial locality is already perfect
and admits no improvement.
Another common metric is miss rate — if a new data layout
leads to fewer cache misses, it must have better spatial locality.
It turns out that miss rate is not a complete measure because one
can improve spatial locality without changing the miss rate (see
Section 2.4). A more serious limitation is that the metric evaluates rather than predicts: a programmer cannot easily judge the
quality of a data layout without trying other alternatives. Changing data layout for large and complex code is time consuming and
error prone. After much labor and with or without a positive result, the programmer returns to the starting point facing the same
uncertainty. The problem is worse with contemporary applications
because much of the code may come from external libraries. Poor
(a) reuse distances
100
75
75
% miss rate
% references
8
8
8
2 0 1 2
a b c a a c b
100
50
25
0
0
1
2
3
reuse distance
(b) reuse signature
50
25
0
0
1
2
3
cache size
(c) miss-rate curve
Figure 1. Example reuse distances, reuse signature, and miss-rate curve
spatial locality may arise inside a library or from the interaction
between programmer code and library code.
In this paper, we define spatial locality based on the distance
of data reuse. Figure 1 illustrates reuse distance as our chief locality metric. In an execution, the reuse distance of a data access
is the number of distinct data elements accessed between this and
the previous access to the same data. Figure 1(a) shows an example trace and the reuse distance of each element. The concept was
defined originally by Mattson et al. in 1970 as one of stack distances [17]. The histogram of all reuse distances in an execution
trace forms its reuse signature, as shown in Figure 1(b) for the example trace. Reuse signature can be used to calculate the miss rates
for fully associative LRU cache of all sizes [17] and can be used to
estimate the effect of limited cache associativity [27]. The miss rate
of all cache sizes can be presented as a miss-rate curve, as shown
in Figure 1(c) for the example trace.
The basic idea of the paper is as follows. A reuse signature
includes the effect of both temporal and spatial locality. If we
change the granularity of data and measure the reuse signature
again, temporal locality should stay the same because the access
order is the same. Any change in the reuse signature is the effect of
spatial locality. Our new spatial model is based on this observation.
To measure intra-block spatial locality, we change data-block size
from half cache-block size to full cache-block size. To estimate
adjacent-block spatial locality, we change data-block size from
cache-block size to twice of that size.
Our model monitors the change of every reuse distance. The
precision allows an analysis tool to identify components of spatial locality. We consider two types of components. Program components are divided by program constructs such as functions and
loops. An analysis can identify causes of poor spatial locality in
program code. Behavior components are divided by the length of
reuse distance. An analysis can focus the evaluation of spatial locality on memory references that have poor temporal locality, which
is useful since these are the references that cause cache misses.
Measuring the change in every reuse distance is costly. The
paper explores two ways of ameliorating the problem. The first is
using small input sizes, and the second is using sampling.
The new model has a number of limitations. It assumes a fixed
computation order and does not consider computation reordering,
which can significantly improve spatial locality in both regular and
irregular code (e.g. [11, 18, 29]). The behavior reported in training
runs may or may not happen in actual executions. The location
of a locality problem does not mean its solution. In fact, optimal
data layout is not only an NP-hard problem but also impossible
to approximate within a constant factor (if P is not NP) [21]. We
intend our solution to be a part of the toolbox used by programmers.
The rest of the paper is organized as follows. Section 2 describes
the new model. Section 3 describes the profiling analysis for the
new model. The result of evaluation is reported in Section 4, including the cost of the analysis and the experience from a user study.
Finally, Section 5 discusses related work and Section 6 summarizes.
2.
Component Model of Spatial Locality
We define spatial locality by the change of reuse distance as a
function of data-block sizes. Consider contiguous memory access,
which has the best spatial locality for sequential computation. Assume we traverse an array twice, and the data-block size is one
array element. The reuse distance of every access in the second
traversal is equal to the array size minus one. If we double the datablock size, the reuse distance is reduced to zero for every other
memory access because of spatial reuse. Next we describe a model
based on measuring the change of reuse distance.
2.1
Effective Spatial Reuse
In our analysis, reuse distance is measured for different data-block
sizes. We refer to them as measurement block sizes or measurement
sizes in short. Our model is based on the change of reuse distance
when the measurement size is doubled. Without loss of generality,
consider data x and y of size b that belong to the same 2b block.
Consider a reuse of x and its reuse distance. The reuse distance
may change in two ways when the measurement size doubles from
b to 2b. The difference is whether y is accessed between the two x
accesses. We call such y access an intercept.
• No intercept — If y is not accessed between the two x accesses,
the reuse distance is changed from the number of distinct bblocks to the number of distinct 2b-blocks between the two x
accesses.
• Intercept — If y is accessed one or more times in between, the
reuse distance is changed to the number of distinct 2b-blocks
between the last y access and the second x access.
Without intercepts, the reuse distance, measured by the number
of distinct data blocks, can be reduced at most to half of its original
length when the measurement size is doubled. The distance does
not actually decrease if it is measured by the number of bytes. If
the reuse of x is a miss in cache of b-size blocks, it likely remains
a miss in cache of 2b-size blocks.
In comparison, an intercept can shorten a reuse distance to
any length. The best case is zero as it happens for accesses in a
contiguous data traversal as mentioned earlier. Figure 2 shows an
example intercept. At block size b, the two x accesses are connected
by a temporal reuse. At block size 2b, the intercept causes a spatial
reuse and shortens the original reuse distance.
!"#$%&"'()*$+!#"+(,-(
!"#$%&"'()*$+!#"+(,-(
P
!"###"$"###"!
!
!!"###"$"###"!
$
!
$
!
(a) The original reuse when data-block size is b
!"-.$+'"-*!'/!#"+(,-(
!"-.$+'"-*!'/!#"+(,-(
!"###"$"###"!
!%$
!%$
!"###"$"###"!
!%$
!%$
!%$
!%$
(b) The shortened reuse when data-block size is doubled to 2b
Figure 2. Example spatial reuse. Data X and Y have size b and
reside in the same 2b block. When the data-block size is 2b, the
original reuse in Part (a) is shortened by an intercept as shown in
Part (b).
An effective spatial reuse is one whose reuse distance is reduced
sufficiently so the access is changed from a cache miss to a cache
hit. We consider two criteria for effective spatial reuse.
• Machine-independent criterion — A memory access has effec-
tive spatial reuse if its reuse distance is reduced by a factor of 8
or more when the measurement size doubles. The threshold is
picked because 8 is a power of two and close to being an order
of magnitude reduction.
• Machine-dependent criterion — An access has effective spatial
reuse if its reuse distance is reduced below a cache-size threshold, for example, converting an L1 cache miss into an L1 hit.
2.2
Spatial-locality Score
A reuse signature is a pair < R, P >, where R is a series of bins
with consecutive ranges of reuse distance, ri = [di , di+1 ), and P
is a series of probabilities pi . Each < ri , pi > pair shows that pi
portion of reuses have distances between di and di+1 . In statistical
terms, a randomly selected reuse distance has probability pi to be
between di and di+1 . We use logarithmic bin sizes, in particular,
di+1 = 2di (i ≥ 0).
We use a distribution map to record how the distribution of
reuse distances changes from one measurement size to another.
Numerically it is a matrix whose rows and columns consist of bins
of reuse distances. Each cell pij is a probability showing that a
b-block distance in the ith distance bin (rib ) has probability pij to
become a 2b-block distance in the jth distance bin (rj2b ). When read
by rows, the distribution map shows the spread of distances in a bblock bin into 2b-block bins. When read by columns, the map (with
additional bookkeeping) shows what distances in b-block bins fall
into the same 2b-block bin.
Taking a row-based view of a distribution map, we can calculate
the probability for a memory access in bin i to have effective
spatial reuse. The best case (or the highest probability) is 0.5 in
contiguous access, because half of the data accesses have effective
spatial reuses. The spatial-locality score, SLQ, is this probability
normalized to the best case. Normally the locality score takes a
value between 0 and 1. Zero means no spatial reuse, and one means
perfect spatial reuse.
For machine-independent scoring, the accesses with effective
spatial reuse are whose reuse distance is reduced by a factor of 8 or
more. The locality score is defined as follows.
pij
(1)
0.5
The definition is machine independent and allows spatiallocality scoring based on very small inputs. Usually small inputs
are not effective in cache simulation studies. Program data may fit
in cache for a small input, making memory problems invisible. A
slight change in input size may cause a large change in cache performance, if a large group of reuse distances cross the cache-size
threshold. The machine-independent scoring avoids the sensitivity
to particular cache sizes and enables efficient analysis through the
use of small program inputs.
The locality score in the machine-dependent case can be defined
similarly. The score is sensitive to the program and machine parameters, but the effect of spatial reuse is measured precisely when the
parameters are fixed.
SLQ(i) =
2.3
j=0...i−3
Spatial Locality Components
Spatial locality score can be defined for any sub-group of memory
accesses in a program. A group of memory accesses is a component
of the overall score. We consider two types of grouping.
Program components We measure the spatial locality score for
program constructs such as functions and loops. We then rank
program components by their contribution to poor spatial locality.
Behavior components We group memory accesses by their reuse
distance. The length of reuse distance for b-block size is considered
the temporal locality at this granularity. Spatial locality scoring can
be done separately for accesses with different temporal locality. If
we divide temporal and spatial locality into two groups, good and
bad, we have four types of locality components: the first has good
temporal and good spatial locality, the second has good temporal
but poor spatial locality, the third has poor temporal but good
spatial locality, and finally the last has poor temporal and poor
spatial locality.
The division of behavior components and the scoring may use
machine-dependent or machine-independent criteria.
• Machine-independent components — We define a trough as
the bin whose size is smaller than its immediate left and right
neighbors. A peak is the group of bins between any two closest
troughs. We consider each peak in the reuse signature as a
group. The effective spatial reuse is one whose reuse distance
is reduced by a factor of 8.
• Machine-dependent components — Since the basic cache pa-
rameters are used by the programmer in performance analysis, it makes sense to compute spatial locality scores based on
these parameters. We consider reuse distances between sizes of
two consecutive cache levels a component (adding the last level
as the cache of infinite size). The effective spatial reuse is one
whose reuse distance is reduced below the smaller cache size.
2.4
Adjacent-block Spatial Locality
Miss rate is not a complete measure of spatial locality when
prefetching is considered. The spatial locality quality for two data
layouts may differ even though they incur the same number of
cache misses. A concrete example was described by White et al.
in 2005 [31]. They studied the effect of data layout transformations in a large (282 files and 68,000 lines C++), highly tuned and
hand optimized mesh library used in the Lawrence Livermore National Laboratory, and found that a data transformation increased
the number of useful prefetches by 30% and reduced the load latency from 3.2 cycles to 2.8 cycles (a 7% overall performance gain),
without reducing the number of (L1/L2) cache misses [31]. In contrast, two other transformations, although reducing the number of
loads and branches by 20% and 9%, resulted in a higher load latency of 4.4 cycles because the transformations caused the misses
to scatter in non-adjacent memory blocks and interfered with hardware prefetching.
The result from White et al. shows the effect of adjacent-block
spatial locality. With prefetching, not all cache misses are equal.
The misses on consecutive memory blocks cost less. If we view
two consecutive memory blocks as a unit, then adjacent-block
locality becomes an instance of intra-block spatial locality for the
large block size. To evaluate the effect of data layout on hardware
prefetching, we compute the same spatial locality score but based
on memory blocks of size twice the size of cache block. The spatiallocality score can be used to measure adjacent-block spatial locality
as it is for intra-block spatial locality.
2.5
All Block Size Score
Spatial locality is so far defined by the change of reuse signature
between two measurement block sizes. We can measure the change
for all possible block sizes and compute an aggregate metric by
weighing the score from each pair of consecutive sizes with a linear
decay. In particular, the score for all block sizes is defined as:
P
b
b −b
all i SLQ (i)pi ]2
P
(2)
−b
all b 2
where SLQb (i) is the spatial locality score of bin i for block size
b, and pbi is the probability of bin i for block size b. The weighting
ensures that the all-block-size score is between 0 and 1. We have
conducted experiments in which the measurement block size ranges
from 4 bytes for integer programs or 8 bytes for floating-point
programs to 213 or 8KB. The cumulative score, however, is difficult
to interpret because of the weighing process. We discuss all block
size results in Section 4.1.3.
P
SLQ =
3.
all b [
Spatial Locality Profiling
Reuse distance analysis carries a significant overhead that renders
its use largely impractical for relatively long running programs.
With a typical slow down factor of a couple hundred, a five-minute
program takes more than twenty nine hours. The overhead of largescale analysis is too high for use in interactive software development cycles. We have developed two ways to reduce the analysis
time: to use full analysis but on a smaller input or to use sampling.
We use the sampling-based tool for interactive analysis. In our future work, we plan to parallelize the profiling analysis and improve
its speed by using multiple processors [13].
3.1
Full Analysis
For full analysis we augment a reuse-distance analyzer by running
two instances in parallel for two block sizes. For each memory access, the analyzer computes reuse distances for the two block sizes
and based on the difference, it classifies a access as an effective spatial reuse or not an effective spatial reuse. A typical reuse-distance
analyzer uses a hash table to store the last access time and a subtrace to record the last access of each data element. Our new analyzer stores two hash tables and two sub-traces, one for each block
size. With the compression-tree algorithm [9], the space cost of
each sub-trace is logarithmic to the total data size. The hash table
size is linear to the number of data elements being accessed, which
is half as many for the larger block size as for the smaller block
size. We have built full analysis in two tools — one at the binary
level with Valgrind and the other at the source-level with Gcc.
The full-trace analysis itself does not show which part of the
program is responsible for poor spatial locality. We have extended
the locality model to identify program code and data with spatiallocality problems.
CCT-based program analysis During locality profiling, the analyzer determines for each memory access, whether it is an effective
spatial reuse. In addition, the analyzer constructs a calling context
tree [1] by observing the entering and exit of each function at run
time, maintaining a record of the call stack, and attributing the access count for each unique calling context.
For spatial-locality ranking, the analyzer records two basic metrics. The first is size, measured by the number of memory accesses.
The second is quality, measured by the portion of the memory accesses that are effective spatial reuses. The final results is about the
calling contexts that have the worst quality with non-trivial size,
measured in both inclusive and exclusive counts. The analyzer can
take customized level one and level two cache sizes as parameters
to find out functions with the worst spatial locality. The Valgrindbased tool has trouble recognizing some exits of some functions,
which is required for CCT. Only the Gcc-based tool is implemented
with CCT.
3.2
Sampling Analysis
The overhead of full analysis comes from recording every access,
passing the information to the run-time analyzer, and then computing reuse distances. To reduce the cost, we have integrated the
new model to a sampling-based tool — Suggestion of Locality Optimization (SLO), developed by Beyls and D’Hollander at Ghent
University [4]. SLO uses reservoir sampling [14], which has two
distinct properties. First, it keeps a bounded number of samples
in reservoir, so the collection rate drops as a program execution
lengthens. Second, locality analysis is performed after an execution finishes. The processing overhead is proportional to the size of
the reservoir and independent of the length of the trace. SLO shows
consistent analysis speed, typically within 15 minutes for our tests.
In the current implementation, our addition makes it take twice as
long.
4.
Evaluation
This section first reports a series of measurements by the full
analysis (Valgrind-based tool by default) and then discusses our
experience from a user study.
4.1
Full analysis results
For full analysis we have both the dynamic binary instrumentor using Valgrind (version 3.2.2) [20] and the source-level instrumentor
using the GCC compiler to collect data access trace and measure
reuse distances using the analyzer described in Section 3.1. We set
the precision of the reuse-distance analyzer to 99.9%.
We have applied our tools on all integer programs from
SPEC2000 [28] that we could successfully build and run. In addition, we tested swim to evaluate the effect of a data-layout transformation and milc to try analysis on a larger program from the new
SPEC2006 [28] suite. To measure the effect of different inputs, we
have collected results for multiple reference inputs and different
size inputs, in particular the test and train inputs used by the benchmark set. All of the C/C++ programs are compiled using the GCC
compiler with the “-O3” flag, and the Fortran programs using “f95
-O5”. The version of the GNU compiler is 4.1.2. The executions,
28 in total, have different characteristics, as shown in Table 1. The
data size ranges from less than 1MB to over 80MB, and the trace
length, measured by the number of memory accesses, ranges from
3.4 million to 400 billion.
programs
inputs
art test
art train
art ref1
(test)
(train)
-scanfile c756hel.in
-trainfile1 a10.img
-trainfile2 hc.img
-stride 2 -startx
110 -starty 200 -endx
160 -endy 240
-objects 10
-scanfile c756hel.in
-trainfile1 a10.img
-trainfile2 hc.img
-stride 2 -startx 470
-starty 140 -endx 520
-endy 180 -objects 10
i.compressed
i.source
< crafty.in
<inp.in
i.compressed
i.combined
input.source 60
input.log 60
input.graphic 60
input.random 60
input.program 60
test
train
inp.in
< su3imp.in
test
train
ref
< swim.in
< swim.in
train
ref
train
ref
art ref2
bzip2 train
bzip2 ref
crafty ref
equake ref
gzip test
gzip train
gzip ref1
gzip ref2
gzip ref3
gzip ref4
gzip ref5
mcf test
mcf train
mcf ref
milc ref
parser test
parser train
parser ref
swim ref
swim.opt ref
twolf train
twolf ref
vpr train
vpr ref
data size
(bytes)
2.4e+6
2.7e+6
3.7e+6
trace len.
5.9e+8
1.5e+10
1.1e+10
• Components with good temporal locality — Two components
3.7e+6
1.2e+10
3.5e+7
1.0e+8
1.3e+6
5.0e+7
9.3e+5
1.1e+7
4.2e+7
3.9e+7
6.5e+7
7.4e+7
5.2e+7
2.8e+6
8.2e+7
8.0e+7
7.2e+8
2.1e+7
5.3e+7
8.3e+8
2.0e+8
2.0e+8
3.0e+6
1.1e+6
7.0e+5
3.8e+6
1.6e+10
2.2e+10
5.0e+10
5.9e+10
6.6e+8
1.0e+10
1.5e+10
7.7e+9
2.4e+10
1.9e+10
2.6e+10
3.4e+6
2.2e+9
1.8e+10
4.0e+11
7.9e+8
2.0e+9
7.9e+10
9.2e+10
9.2e+10
3.4e+9
1.1e+11
2.6e+9
2.1e10
All Benchmark Results
Our analysis has identified 16 components in the 28 executions of
the 12 programs with ref inputs 1 , including the two components
(in the reuse signature) for each run of the 4 programs, equake,
mcf, swim and swim.opt, and one for each of the other 8 programs.
Figure 3 shows two weighted attributes for each spatial locality
component: spatial locality score and temporal reuse distance. The
temporal reuse distance results are based on the block size of 64
bytes and the spatial locality scores are based on the changes of
the reuse signatures with block size doubled from 64 bytes to
128 bytes. In the names of components, we use ‘c’ for multiple
components in a single input and ‘r’ for the same component in
multiple inputs with the same program. For example, swim-c2 is
1 The
crafty and equake-c1 (13% of 16) have good temporal locality
because they have short reuse distances (shorter than 256 blocks
or 16KB).
• Components with good spatial locality — Five components
Table 1. The input, data size, and length of 28 executions of 11
benchmarks
4.1.1
the second component of the swim execution, and gzip-r3 is the
(only) component of the third input of gzip.
The x-axis of Figure 3 shows the weighted average reuse distance of each component. The range of the reuse distance differs
from component to component and program to program. But different inputs of the same program show similar reuse distance as in
gzip and art.
Based on the summarized results, we classify the locality of the
16 components into four categories.
two versions of swim are different enough to be treated as two
programs.
(31% of 16), equake-c2, mcf-c2, swim-c2, swim.opt-c1 and
swim.opt-c2, have almost perfect spatial locality (a score greater
than 0.97).
• Components with poor spatial locality — A component has a
serious spatial locality problem if it meets the following three
conditions.
The component has a significant size (component sizes are
shown in Figure 4),
It has long reuse distances (poor temporal locality), and
It is low in spatial locality quality (poor spatial locality).
Seven components (44% of 16), art, mcf-c1, milc, parser, swimc1, twolf and vpr meet these conditions. They contain between
5.13% to 33% accesses. Their reuse distance ranges from 64KB
to 2MB. Their spatial locality score is between 0.250 and 0.657.
Art has identical components with two inputs, suggesting a
static data access pattern and a good chance for compiler optimization.
• Components with possible spatial locality problems — The
remaining two components (13% of 16) meet some but not all
three conditions. Gzip with different inputs has the low spatial
locality scores of 0.140 and 0.387. However, the component
in all inputs has relatively short reuse distances. While their
sizes are from 5.82% to 21.5% of their references, almost all
have a reuse distance of less than 8K blocks or half mega-bytes,
which fits in the level-two cache of most modern machines. The
component of Bzip2 with the reference input has relatively long
reuse distances, 12K blocks, and a low locality score, 0.32, but
the size is only 2.3%, below our 5% threshold. It is interesting
that the two compression programs appear in the same category.
They are likely tuned by their designers to make the most use
of cache, hence showing a borderline status.
4.1.2
The Effect of Input Size
Table 2 compares the locality components of different size inputs.
All but one program show consistency in the component size,
the spatial locality score, or both. The locality component in the
three inputs of art all has a size of 33%, although interestingly
the locality score decreases. The component in the two inputs of
vpr has similar locality scores, although the size differs. Most
programs, gzip, mcf, parser (train and ref), and twolf have similar
component size and locality in all inputs. For example, the first
component in the three inputs of mcf has a spatial locality score
between 0.38 and 0.41, and the second component between 0.99
and 1.00. Bzip2 is an exception, where both the size and locality
score differ significantly between the train and reference inputs.
In comparison, the temporal locality is almost never similar
among the inputs of any program except for gzip. In parser, the
'&!B80)=B9<.-=:4,03,9=9<.,9308=:.</)
:;36%<40!.$
)*,+-)!.$ :;36!.$
6.7!.$
"
#%(
:;36%<40!."
639.
parser
,/0!/"
,/0!/$
#%'
twolf
#%&
1234!/!
)*+,-)!."
0;<97
1234!/"
vpr
bzip:;36!."
#%$
./,708
#
!
6.7!."
1234!/5
1234!/$
1234!/&
"#
"!
$#
$!
9<1$=/)+:)=>3:0,?.)=@0)64</,9=9<.,9308A
64−byte block spatial locality score
Figure 3. The spatial and temporal locality of spatial locality components. The y-axis shows the spatial-locality score, the higher the better.
The score of one means perfect spatial locality. The x-axis shows the reuse distance in a logarithmic scale. The further to the right, the poorer
is the temporal locality. Data layout transformations are most cost effective when targeted for components on the bottom-right half of the
plot, which have poor spatial and temporal locality.
1
0.8
It suggests that spatial locality depends on the specified block size,
which is in contrast to the stable locality quality for the same block
size with different input sizes.
swim.opt−c2
eqauke−c2 swim−c2
mcf−c2
4.1.4
swim.opt−c1
milc
art−r1
art−r2
0.6
0.4
equake−c1
gzip−r5
twolf
mcf−c1
gzip−r1
swim−c1
0.2
gzip−r3
gzip−r4
0
0
10
gzip−r2
crafty
20
30
40
50
component size (% refs)
Figure 4. The size of spatial-locality components shown in Figure 3
average reuse distance sometimes decreases when the input size
increases. The test inputs of bzip2, twolf, and vpr do not show any
locality component in our analysis.
4.1.3
The Effect of Data Block Sizes
The preceding results are for a single change of data block size. We
have examined the components for block sizes from 16 to 128.
As the size of data blocks increases, the spatial locality of
the four components changes in three different patterns. Art-r2
increases from 0.21 to 0.74, swim-c1 and gzip-r4 decrease from 0.7
to 0.13 and from 0.25 to 0.11 respectively, and mcf-c1 alternates
between 0.3 and 0.5, The lack of consistency may be due to the
nature of the computations and the manual tuning by programmers.
Effect of Array Regrouping for Swim
Swim is a floating-point benchmark program from SPEC2000. It
simulates shallow water using a two-dimensional grid, represented
by a set of 14 arrays. We use two versions — the original version
and the version after array regrouping, which is designed to improve spatial locality [23, 35].
Figure 5 shows the spatial locality score for both versions when
the measurement block size increases from 32 bytes to 64 bytes.
The score for each bin is marked by a cross for the original version
and by a downward triangle for the transformed version. The size
of the bin is show by the size of the circle enclosing the mark. The
plot does not group bins, so each bin is one component. There are
two components with reuse distance larger than 32 blocks that are
of a significant size, as pointed out on the graph.
The component model shows the effect of array regrouping on
Swim. The first component, which accounts for 5.1% and 4.4%
(bin 11 and 12) of memory accesses in two versions, has been
improved from below 0.2 to close to perfect. The second component is almost identical (0.99) for the two versions. The early result shows that array regrouping improved performance by 14% on
IBM Power4 [23]. For this study, we compared GCC-compiled 64bit binaries on 3.2GHz Intel Xeon and observed 8.1% performance
improvement. With the new spatial locality model, we now see that
the improvement is due to better spatial locality in about 4% memory accesses.
On the specific machine we tested with 64-byte cache line, the
L1 cache size is 32K and L2 cache size is 1M. Let’s assume fullyassociative cache with cache block size 64, the predicted cache miss
rates of the original swim benchmark are 10.4% and 5.33% at the
two cache levels respectively. The cache miss rates for the optimized version are 9.7% and 5.33%. Hence the performance improvement mainly benefits from fewer L1 cache misses. However,
we should point out that the 6.7% reduction in L1 miss rate may
1.0
0.4
0.6
component 1
improved
significantly
0.2
Spatial Locality Score
0.8
component 2
has perfect
locality in
both versions
0.0
swim
swim.opt
5
10
15
20
25
30
Block Reuse Distance (base 2)
0.4
0.6
Similar improvement
for component 1, but
the score is lower
at 128B than at 64B
0.2
Spatial Locality Score
0.8
1.0
Figure 5. The effect of array regrouping on the spatial-locality score of each reuse-distance bin of Swim. The improvement comes mainly
from better spatial locality for the first component.
0.0
swim
swim.opt
5
10
15
20
25
30
Block Reuse Distance (base 2)
Figure 6. The effect of array regrouping on adjacent-block spatial
locality, measured by the spatial-locality score when the measurement block size increases from 64 bytes to 128 bytes.
not completely explain the 8.1% performance improvement. Our
spatial locality model at 128-byte block size shows good spatial
reuse. This suggests that the optimized version also benefits from
prefetching due to better adjacent-block locality. Figure 6 shows
the effect of adjacent-block spatial locality. Most of the texts in the
graph is too small to see, but they are the same as those in Figure 5.
The program Swim demonstrates three useful features of the
model. First, the model is based on components, so it can reveal
different locality patterns within the same application. Second, the
model is based on different data block sizes. It can evaluate either
cache-block reuse or prefetching effect. Finally, it shows the potential for improvements. After array regrouping, little opportunity
remains for further improvement.
4.1.5
Analysis Time
The time cost of Gcc-based analyzer is around 350 times that of
the normal execution, especially significant for long executions.
For example, the reference input of twolf has 110 billion memory
accesses, which takes 5 minutes 40 seconds in a normal execution
but over 32 hours to analyze. However, as we have observed from
the results in Section 4.1.2, we can identify locality components
and their spatial locality quality using much smaller inputs. Table 3
compares the analysis time needed for a large-enough input and the
time taken for the full analysis of the reference input.
The timing results show that at most the analysis time needed is
around one hour in twolf and vpr. For other programs, crafty and
parser, take half an hour; art and gzip use under 15 minutes; and
mcf needs only 57 seconds with a very small number of accesses.
In the current implementation, we let the compiler insert a
function call for each memory reference in a program. The purpose
of the call is to store the data address in a buffer, and when the
buffer is full, invoke reuse-distance computation in a batch. We are
in the process of re-implementing the GCC-based instrumentor so
it inserts inlined, and pre-optimized code instead of function calls.
4.1.6
A User Study
Computational methods are heavily used today in natural language
translation (NLP) both in research and in publicly accessible (online) systems. Most methods build large-scale probabilistic models
mapping the syntax and semantics structure from the source language to the target language. The translation quality depends completely on the structure and the parameters of the model, which are
obtained through exhaustive training analysis over as many sentences as available. A corpus typically contains many articles in the
two languages.
The NLP group at Rochester has built an analyzer [33], which
is typically trained in 10 iterations, over 70,000 sentence pairs (in
parallel) per iteration, at an average speed about 4 seconds per sentence pair per iteration on PC clusters (an improvement from over
1200 CPU hours per iteration reported in the original publication).
For research the model is being improved as frequently as computationally possible. This analyzer consumes perhaps the most cycles
on department computer servers.
Our effort was in part spurred by a request from the NLP group.
They have hand-optimized the code, about 2200 lines in C++, as
much as they could but were unsure about the memory perfor-
program
art test
art train
art ref 1
bzip2 train
bzip2 ref
gzip test
gzip train
gzip ref 1
mcf test
mcf train
mcf ref
parser test
parser train
parser ref
swim train
swim ref
twolf train
twolf ref
vpr train
vpr ref
1
2
1
2
1
2
1
2
1
1
1
2
1
2
component
size
33%
33%
33%
1.1%
2.3%
3.1%
3.0%
4.3%
22%
3.1%
37%
3.8%
42%
2.1%
1.3%
2.8%
5.2%
5.3%
5.1%
5.2%
5.1%
5.2%
8.0%
8.0%
5.0%
8.4%
spatial
locality
1.00
0.86
0.65
0.74
0.32
0.21
0.25
0.33
0.40
1.00
0.41
0.99
0.38
0.99
0.80
1.00
0.80
0.70
0.25
1.00
0.25
0.99
0.51
0.47
0.21
0.27
temporal
locality
23K
25K
31K
23K
12K
2.7K
2.5K
2.0K
3.6K
66K
13K
1.1M
33K
2.8M
2.9K
38K
14K
11K
2.0K
446K
2.9K
2.1M
5K
10K
3.3K
8.2K
Table 2. Comparison of locality components in different size inputs
program
art
crafty
gzip
mcf
parser
twolf
vpr
large enough input
input prof/exe time
test
11m8s / 4s
test
37m18s / 3s
test
11m50s / 2s
test
57s / 14s
train
38m8s / 6s
train
67m52s / 9s
train
61m30s / 8s
ref input
prof/exe time
6h23m / 5m9s
18h56m / 1m15s
4h24m / 32s
8h34m / 5m50s
25h36m / 3m58s
32h11m / 5m40s
9h24m / 1m27s
likelihood numbers from the revised program were no more than
1
different than the original. However, the running time was
2300
reduced from 40.1 seconds to 37.4 seconds for a 6-sentence run.
An improvement of over 7% is obtained by only 6 lines of code
change — all in the library code.
This user study demonstrates the practical value of a spatiallocality model. First, a small change in spatial locality may have
significant performance impact. Second, trace-based model can be
used to analyze programs of arbitrary size and complexity to capture aggregate and composite behavior. Most applications today use
components from external sources, and the tool can analyze external code for users. Finally, the user interface assists a programmer
who can improve an application based on high-level understanding
and algorithmic changes that go beyond the limit of pure automatic
techniques.
4.2
Sampling-based Tool
For sampling, we have integrated our spatial-locality analyzer into
the SLO tool developed by Beyls and D’Hollander for temporal
locality analysis [4]. We call the combined system SLOR. The spatial component reuses the original implementation of reservoir sampling. The methods to determine the number of samples to skip are
nearly identical. The samples, on the other hand, are completely
different for spatial locality analysis. SLO samples individual memory accesses, while SLOR collects samples of consecutive basic
blocks for spatial locality analysis.
We have built a graphical user interface (GUI) to interactively
display spatial locality information for users. It is based on the GUI
system of SLO, which displays temporal locality results including
reuse paths and suggestions of computation transformation [4]. The
temporal results are still retained under the tab “Temporal”, as
shown in the upper left corner of the screen shot in Figure 7.
To present spatial locality, we have added two more tabs. The
“Spatial” tab, selected in the screen shot, shows the list of ten
program statements with the worst spatial locality. The ranking can
be parameterized by cache sizes with or without a calling context
tree, which a user can specify in text fields. The ranking is shown
by the first column, colored by different degrees of redness. The
table shows the location of statements, the spatial-locality score,
and the contribution of these statements to the total number of
poor spatial reuses. When a user selects one of the statements, the
relevant code is displayed. In this example, three of the worst ten
statements appear in the same loop in Mcf, bringing attention to the
small program piece in the midst of thousands lines of code.
Table 3. Comparison of analysis time between large enough inputs
and the reference inputs
5. Related Work
mance, which they recognized as the greatest factor in running
time. Once we built the Gcc-based context sensitive analyzer, we
applied the tool on their code the next day. Here is a short account
of what happened on that day.
Our analyzer, after hours of training in the previous evening,
showed the ranked list 2 . The worst ranked function had about
10 statements, and less than 1% of their memory references had
poor spatial reuse. The function was part of a library commonly
used in NLP community to improve numerical stability and running
speed by representing and computing floating point numbers using
integer exponents. The poor spatial reuse was due to the access to
different numbers and a table lookup. Working together with the
NLP group, we reduced the table size by reducing the number of
entries and reducing the size of each entry from 4-byte integer to
2-byte integer. The results differed only marginally — the reported
Spatial locality was first modeled using the notion of working
set. Bunt and Murphy considered two choices [5]. By examining
different page sizes, the first model quantified the change in reuse
signatures in terms of its fit to a Bradford-Zipf distribution. The
second model measured the frequency when a group of h pages
were accessed by n consecutive times. The locality increased with
h, which means that the smaller the working set is, the better the
spatial locality. Somewhat similar to the first model, many studies
have examined the effect of different page sizes and cache block
sizes. Weinberg et al. defined a spatial locality score ranging from
0 or worst to 1 or best, which is based on physical closeness of data
elements accessed in each time window of size w [30]. It uses a
combination of the working set and the spatial distance. Murphy
and Kogge estimated spatial locality by the portion of data used in
64-byte memory blocks in each interval of 1000 instructions [19].
Berg and Hagersten defined spatial locality without using fixedsize windows but by the change in the miss rate when the cacheblock size increases [3]. To enable fast measurement, they used
sampling and approximated reuse distance using time. Our model
2 The
study was done before we implemented sampling analysis.
Figure 7. The GUI interface of the new analysis displaying the spatial-locality ranking of program code. The list of ten program statements
in Mcf that contribute most to the poor spatial locality are shown in the left half of the screen shot. A user can view these statements in
program code, as shown in the right half of the screen shot.
uses the same high-level idea but defines spatial locality by components rather than for the whole program. As result, we record the
change of individual reuse distance and identify behavior components based on the length of the reuse distance. Berg and Hagersten
found that Swim had good overall spatial locality, while we show in
this paper that the program has a component with poor spatial locality, which could be improved and lead to significant performance
gain.
Ding and Zhong used a similar component-based analysis for
predicting the change of whole-program locality across data inputs.
They divided all data accesses of a program into a fixed number
of bins and modeled the pattern in each part by examining reuse
signatures from two different runs [9]. Shen et al. improved their
method by allowing mixed pattern inside each bin and by using
linear regression on more than two inputs [26]. They reported an
average accuracy of over 94% when predicting the (change in)
reuse signature for a new input. The technique was later used
to predict the cache miss rate across program inputs [34]. Marin
and Mellor-Crummey gave an adaptive method based on recursive
division for partitioning the data accesses of a program [15]. They
augmented the model to predict not just the miss rate but program
performance and to consider non-fully associative cache [16, 27].
Fang et al. showed that a linear distribution (rather than a uniform
distribution) inside each bin gave a better precision for integer
code [10].
While these studies developed parameterized models for different access patterns, the goal was to better model their combined
effect rather than to study them individually. They did not distinguish between temporal and spatial locality.
Sampling-based measurement of reuse distance has been tested
as part of a continuous program optimization system [7]. The sampling is made with hardware and operating system support. The
sampling accuracy is checked using a statistical technique, the
Hellinger affinity kernel [7]. Approximating reuse distance with
time distance is used in the SLO tool [4] and systematically studied
as a statistical problem [25]. Recent improvements including extension to arbitrary-scale histograms and implementation using the
memory-management unit (MMU) [24].
Function and loop based sampling have been developed, where
the program is cloned and the execution switches periodically between the normal execution in the original code and the slower execution in the instrumented clone [2, 12]. Our sampling scheme is
based on basic blocks instead of high-level loop and function call
constructs. As a result, the samples from the previous technique
align with the program structure. The alternation with the original
code makes analysis almost as efficient as uninstrumented execution. For statistical profiling of data accesses, however, we need
to take samples at arbitrary times during execution and take samples of an arbitrary length. The block-based sampling is used in
SLO [4]. We extended it to collect not just individual memory accesses but streams of accesses from consecutively executed basic
blocks.
6.
Summary
In this paper, we have presented a new model of spatial locality based on how reuse distance changes as a function of datablock size. We have defined machine-dependent and machineindependent score of spatial locality and divided the overall score
into either program or behavior level components. The new model
is implemented in three tools. The first two performs full-trace analysis using binary- and source-level intrumentation. The third uses
sampling analysis based on the SLO tool.
Using these analyzers, we have identified 16 components from
11 commonly used benchmarks. Among these 2 have good temporal locality, 5 have good spatial locality, and 7 have poor spatial locality. We have examined the effect of inputs and data block
sizes and shown that analysis time can be reduced by either using
smaller inputs or sampling. Most benchmarks require no more than
one hour of analysis time.
We have used the model-based tool to explain the effect of a data
transformation and estimate the potential for further improvement.
We have developed an interactive tool for program tuning. In a
user study, the tool helped to identify a small routine that had poor
spatial locality. The user was able to improve program performance
by 7% by changing only 6 lines of code.
Acknowledgment
We wish to thank Hao Zhang and Dan Gildea at Rochester for providing the opportunity of our user study, Yaoqing Gao and Roch
Archambault at IBM Toronto Lab for their advice and feedback,
Linlin Chen and Guy Steele for the help with the paper submission,
and the anonymous reviewers for their comments on the presentation.
Xiaoming Gu and Chengliang Zhang were students at the University of Rochester supported by the IBM Center for Advanced
Studies Fellowship during the course of this study. Ian Christo-
pher was supported by an NSF REU grant (a supplement of CCR0219848). Additional funding came from NSF (contract no. CNS0720796 and CNS-0509270).
References
[1] G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware
performance counters with flow and context sensitive profiling. In
Proceedings of the ACM SIGPLAN Conference on Programming
Language Design and Implementation, pages 85–96, 1997.
[2] M. Arnold and B. G. Ryder. A framework for reducing the cost of
instrumented code. In Proceedings of the ACM SIGPLAN Conference
on Programming Language Design and Implementation, Snowbird,
Utah, June 2001.
[3] E. Berg and E. Hagersten. Fast data-locality profiling of native
execution. In Proceedings of the International Conference on
Measurement and Modeling of Computer Systems, pages 169–180,
2005.
[4] K. Beyls and E. D’Hollander. Discovery of locality-improving
refactoring by reuse path analysis. In Proceedings of HPCC. Springer.
Lecture Notes in Computer Science Vol. 4208, pages 220–229, 2006.
[5] R. B. Bunt and J. M. Murphy. Measurement of locality and the
behaviour of programs. The Computer Journal, 27(3):238–245, 1984.
[6] B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data
placement. In Proceedings of the Eighth International Conference
on Architectural Support for Programming Languages and Operating
Systems (ASPLOS-VIII), San Jose, Oct 1998.
[7] C. Cascaval, E. Duesterwald, P. F. Sweeney, and R. W. Wisniewski.
Multiple page size modeling and optimization. In Proceedings of the
International Conference on Parallel Architecture and Compilation
Techniques, St. Louis, MO, 2005.
[8] T. M. Chilimbi. Efficient representations and abstractions for
quantifying and exploiting data reference locality. In Proceedings
of the ACM SIGPLAN Conference on Programming Language Design
and Implementation, Snowbird, Utah, June 2001.
[9] C. Ding and Y. Zhong. Predicting whole-program locality with reuse
distance analysis. In Proceedings of the ACM SIGPLAN Conference
on Programming Language Design and Implementation, San Diego,
CA, June 2003.
[10] C. Fang, S. Carr, S. Onder, and Z. Wang. Instruction based
memory distance analysis and its application to optimization. In
Proceedings of the International Conference on Parallel Architecture
and Compilation Techniques, St. Louis, MO, 2005.
[11] H. Han and C.-W. Tseng. Exploiting locality for irregular scientific
codes. IEEE Transactions on Parallel and Distributed Systems,
17(7):606–618, 2006.
[12] M. Hirzel and T. M. Chilimbi. Bursty tracing: A framework for lowoverhead temporal profiling. In Proceedings of ACM Workshop on
Feedback-Directed and Dynamic Optimization, Dallas, Texas, 2001.
[13] K. Kelsey, T. Bai, and C. Ding. Fast track: a software system
for speculative optimization. In Proceedings of the International
Symposium on Code Generation and Optimization, 2009.
[14] K.-H. Li. Reservoir-sampling algorithms of time complexity
o(n(1+log(n/n))). ACM Transactions on Mathematical Software,
20(4):481–493, December 1994.
[15] G. Marin and J. Mellor-Crummey. Cross architecture performance
predictions for scientific applications using parameterized models. In
Proceedings of the International Conference on Measurement and
Modeling of Computer Systems, New York City, NY, June 2004.
[16] G. Marin and J. Mellor-Crummey. Scalable cross-architecture
predictions of memory hierarchy response for scientific applications.
In Proceedings of the Symposium of the Las Alamos Computer Science
Institute, Sante Fe, New Mexico, 2005.
[17] R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger. Evaluation
techniques for storage hierarchies. IBM System Journal, 9(2):78–117,
1970.
[18] K. S. McKinley, S. Carr, and C.-W. Tseng. Improving data locality
with loop transformations. ACM Transactions on Programming
Languages and Systems, 18(4):424–453, July 1996.
[19] R. C. Murphy and P. M. Kogge. On the memory access patterns of supercomputer applications: Benchmark selection and its implications.
IEEE Transactions on Computers, 56(7):937–945, 2007.
[20] N. Nethercote and J. Seward. Valgrind: a framework for heavyweight
dynamic binary instrumentation. In Proceedings of the ACM
SIGPLAN Conference on Programming Language Design and
Implementation, pages 89–100, 2007.
[21] E. Petrank and D. Rawitz. The hardness of cache conscious data
placement. In Proceedings of ACM Symposium on Principles of
Programming Languages, Portland, Oregon, January 2002.
[22] M. L. Seidl and B. G. Zorn. Segregating heap objects by reference
behavior and lifetime. In Proceedings of the Eighth International
Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS-VIII), San Jose, Oct 1998.
[23] X. Shen, Y. Gao, C. Ding, and R. Archambault. Lightweight reference
affinity analysis. In Proceedings of the 19th ACM International
Conference on Supercomputing, pages 131–140, Cambridge, MA,
June 2005.
[24] X. Shen and J. Shaw. Scalable implementation of efficient locality
approximation. In J. N. Amaral, editor, Proceedings of the Workshop
on Languages and Compilers for Parallel Computing, pages 202–216,
2008.
[25] X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation
using time. In Proceedings of the ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages, pages 55–61, 2007.
[26] X. Shen, Y. Zhong, and C. Ding. Regression-based multi-model
prediction of data reuse signature. In Proceedings of the 4th Annual
Symposium of the Las Alamos Computer Science Institute, Sante Fe,
New Mexico, November 2003.
[27] A. J. Smith. On the effectiveness of set associative page mapping and
its applications in main memory management. In Proceedings of the
2nd International Conference on Software Engineering, 1976.
[28] Spec cpu benchmarks. http://www.spec.org/benchmarks.html#cpu.
[29] M. M. Strout, L. Carter, and J. Ferrante. Compile-time composition
of run-time data and iteration reorderings. In Proceedings of the
ACM SIGPLAN Conference on Programming Language Design and
Implementation, pages 245–257, San Diego, CA, June 2003.
[30] J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely. Quantifying locality in the memory access patterns of hpc applications. In
Proceedings of Supercomputing, 2005.
[31] B. S. White, S. A. McKee, B. R. de Supinski, B. Miller, D. Quinlan,
and M. Schulz. Improving the computational intensity of unstructured
mesh applications. In Proceedings of the 19th ACM International
Conference on Supercomputing, pages 341–350, Cambridge, MA,
June 2005.
[32] C. Zhang, C. Ding, M. Ogihara, Y. Zhong, and Y. Wu. A hierarchical
model of data locality. In Proceedings of the ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Languages, Charleston,
SC, January 2006.
[33] H. Zhang and D. Gildea. Stochastic lexicalized inversion transduction
grammar for alignment. In Proceedings of the 43rd Annual Meeting
on Association for Computational Linguistics, pages 475–482, 2005.
[34] Y. Zhong, S. G. Dropsho, X. Shen, A. Studer, and C. Ding. Miss
rate prediction across program inputs and cache configurations. IEEE
Transactions on Computers, 56(3):328–343, March 2007.
[35] Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array regrouping
and structure splitting using whole-program reference affinity. In
Proceedings of the ACM SIGPLAN Conference on Programming
Language Design and Implementation, June 2004.
Download