All-Window Profiling of Concurrent Executions Chen Ding and Trishul Chilimbi Microsoft Research

advertisement
All-Window Profiling of Concurrent Executions
Chen Ding and Trishul Chilimbi
All-Window
Profiling
Concurrent Executions
Microsoftof
Research
†‡
†
†
‡
Computer Science Department, University of Rochester
Chen Ding†‡ and Trishul Chilimbi†
‡ Computer Science Department, University of Rochester
† Microsoft Research
We first motivate all-window profiling by examining the effect of
Categories
and Subject
Descriptors
C.4 [performance
of sysfootprint and
interleaving
in a concurrent
execution,
then present
tems];
D.3.4
[programming
languages]:
processors
the basic algorithm for approximate all-window profiling, and finally discuss
related
work.measurement, performance
General
Terms
80
Keywords data footprint, thread interleaving, concurrent systems
Submitted to a conference.
Copyright is held by the author/owner(s).
PPoPP’08, February 20–23, 2008, Salt Lake City, Utah, USA.
ACM 978-1-59593-960-9/08/0002.
40
footprint
20
Given any window over an execution trace, the footprint is the
amount of Abstract
data being accessed in the window. Footprint is a basic metric This
of program
and has
beenfor
used
to compute
the in a
paper firstlocality
demonstrates
the need
all-window
profiling
life-time ofconcurrent
data in cache,
thatthen
is, how
a program
steps out
its old and
execution,
presents
an approximate
algorithm,
data [1], and
to compute
the effect
finally
discusses related
work.of cache sharing by multiple
threads, that is, how they step over each other [3, 6].
Consider1.an execution
trace of a commercial server application.
Footprints
The execution consists of 22 concurrent threads for a total of 1.7
For a window over an execution trace, the footprint is the amount
billion memory accesses. Figure 1 shows the instruction footprints
of data being accessed in the window. Footprint is a basic metric
for thread 40, which accounts for 29% of instruction accesses.
of program locality and has been used to compute the life-time
Both axes
use blocks
a fine-grained
scale
we[1]
call
of data
once theylogarithmic
are loaded into
cache
as 8-wide
well as the
histogram, effect
where
the logarithmic
by [3,
dividing
each
of refines
cache sharing
by multiple scale
programs
7]. Figuratively,
power-of-two
size
bin
into
8
equal-size
sub-bins
(if
the
size
of
the and
the footprint determines how a program treads out its old data
bin is no smaller
than programs
8). In the
bins
50,in 100,
how multiple
stepfigure,
over each
other
cache.150, and
200 representWe
ranges
319],method
[22.5K,
1.8M
],
used a[288,
sampling
to 24.6K],
collect the[1.7M,
data for
the paper.
and [126M,
] respectively.
At134M
each memory
access, with equal probability the method picks a
We usedrange
a sampling
to collect
thethe
data
shown
and
r and a method
window size
x within
range.
The in
sizethis
x uniquely
other figures
in the paper.
At eachwhich
memory
access,
probdetermines
the window,
includes
the with
curentequal
access
and the
− 1 accesses.
the algorithm
measures
volume
ability we previous
picked axrange
r t and Then
a window
size within
the the
range.
of data in the
the window.
timewindow,
ranges areusing
sampled
Then we measured
volume Itofensures
data inthat
thealltime
equally. However,
since thethe
total
number
of windows
for a trace
the same algorithm
for measuring
reuse
distance.
We ensured
n
of ranges
length nare
is sampled
O(n), theequally.
samplingHowever,
rate is only
= n1 . The
that all time
then2sampling
2
n
exceedingly
low=sampling
rate raises
the question
whether the
we can
. This
motivated
us to develop
rate was low,
n(n−1)/2
n−1
measure
all
O(n)
windows
to
verify
the
accuracy
of
sampling.
all-window profiling.
Consider an execution trace of a commercial server application.
For a window of x instruction accesses by thread 40, shown on
The execution consists of 22 concurrent threads for a total of 1.7
the x-axis, its footprint or the volume of accessed instructions is
billion memory accesses. Figure 1 shows the instruction footprints
given by the
averageforlifetime.
example
for
fory-axis.
thread It40,shows
whichthe
accounts
29% of For
instruction
accesses.
a cache size
of 128
8KB for 64-byte
blocks,
Both
axesblocks
use a or
fine-grained
logarithmic
scale the
we average
call 8-wide
time for thread
40 to access
this much
around 800scale
in- is
histogram,
where each
bin ofinstruction
the (base 2)is logarithmic
struction accesses.
In 8a equal-size
machine with
a shared
divided into
sub-bins
(when cache,
the bin the
sizefootprint
is no smaller
from threadthan
40 8).
hasThe
a significant
effect
other threads.
faster
x-axis shows
200 on
logarithmic
ranges The
between
0 and
(200/8+2)
27 more likely it causes eviction of other
it cumulates
its
footprint,
the
2
= 2 or 134 million instruction accesses. The ythreads’ instruction
data.
axis showsor90
logarithmic ranges for the footprint up to 9,750
instruction blocks.
The five curves show the cumulative distribution of footprints:
for each time window of size x, up to 0%, 5%, 50%, 95%, or 100%
of footprints have a size under the y value marked by these five
curves. The middle curve is labeled “expected” as it shows the
100%
95%
expected
5%
0%
60
FOOTPRINT WINDOWS
0
1.
Server thread 40 instruction footprint
0
50
100
150
200
time
Figure 1. The 8-wide histogram for the instruction footprint of
1
server1:
thread
measured
by sampling
at rate
Figure
The40,
8-wide
histogram
for the
instruction
footprints of
n
1
)
server thread 40, measured by sampling at a rate O( n
median value. The other four curves show more extreme cases. For
example,
the “expected”
curvewindow
shows that
half
of the
windowsFigure
of
Many
windows
for a given
size
were
sampled.
1
800 instruction
accesses
instruction
blocks
(8KB)The
or top
shows
the distribution
of touched
sampled128
footprints
in five
curves.
and the
“0%” show
curve shows
that there
longfootprints.
periods of The
andless,
bottom
curves
the largest
and existed
smallest
time, 100K accesses, where very few data, about 10 blocks, were
second and fourth show the upper watermark for 95% and 5% footaccessed.
prints. The middle curve shows the median footprint. The disThe median footprint is not smooth and has small bumps and
tribution
wealth
of information.
example,
the botbreaks. contains
However, athe
area between
“5%” and For
“95%”
shows that
tom
curve
shows
there
existed
long
periods
of
time,
100K
accesses,
the middle 90% of footprints follow a smooth, consistent upward
where
aboutslows
10 blocks,
were
accessed.
trend.very
Thefew
rate data,
of increase
down and
takes
a shallower slope
The
is 25K
not instruction
smooth and
has many
bumps and
as themedian
window footprint
size reaches
accesses.
The bi-linear
breaks.
However,
range
of theofmiddle
90% of footprints
shape is
common the
in the
histogram
reuse distances,
where the followpoint
a smooth,
consistent
of increase
slows adown
of the knee
gives thetrend.
size ofThe
the rate
working
set and signals
when
the window
size
accesses.curve
Therepresents
bi-linear shape
change
of locality.
Thereaches
knee in25K
the footprint
not is
common
inofthe
histogram
of reuse
which
shows
a change
locality
but a change
ofdistances,
interference.
It shows
thatthe
the knee
interference
by an
the application.
thread drops down
or rate
the of
working
set of
The over
kneelonger
in theperiods
footprint,
of execution.
This isnot
expected
but the
program
however,
represents
a change
of question
localityfor
buteach
a change
ofisinterwhere and
how much
ofof
interference
changes.
Wethread
will notdrops
ference.
It shows
thatthe
therate
rate
interference
by the
haveover
an complete
answer unless
we can measure
the footprint
down
longer periods
of execution.
This is intuitive
but for
the imall windows.
portant
question is when and how the rate of interference changes,
and this can be measured accurately by all-window profiling.
2.
Thread Interleaving
2.In modern
THREAD
INTERLEAVING
concurrent
applications, the execution of threads may
not interleave
uniformly.
is generally the
case for client
The
threads in
modernThis
multi-threaded
applications
doapplinot intercations
with asymmetrical
functions the
for each
where
some
leave
uniformly.
This is generally
case thread,
for client
applications
may execute
morecarry
instructions
than of
others.
Even for
sym-other
where
one or ten
fewtimes
threads
out most
the work
while
metric
server
workloads,
the
relative
rate
of
execution
of
parallel
threads are invoked periodically. Even for server workloads, the
threads may change from one phase to another. The degree of inrelative rate of execution of parallel threads may change from one
terleaving strongly affects the use of shared resources such as cache
and memory.
Using the same n1 sampling rate as in the case of footprint, we
have measured the interleaving between two threads, thread 40 and
thread 88, in the execution trace mentioned before. Since the two
200
Server threads 40 and 88 interleaving
100
150
4.
100%
95%
expected
5%
0%
0
50
thread 40 time
The Ding-Zhong algorithm ensures that the sum is between c and
using periodic
compression.
there
no need tothe
organize
100% of
the actualHowever,
footprint.
Tois measure
footprint, we just
the time
ranges
in
a
search
tree,
unlike
the
case
of
footprint
use this sum for all windows starting in range rbmea. Since there are
surement.
O(log j) ranges, the counting of i windows takes O(log j) rather
0
50
100
150
200
thread 88 time
Figure 2. The 8-wide histogram of the interleaving between server
1
threads2:40The
and 88,
measured
by sampling
at rate
Figure
8-wide
histogram
of the
interleavings
between
n
1
server threads 40 and 88, measured by sampling at a rate O( n
)
threads are active server threads, one may expect uniform interleaving. However, the sampling result in Figure 2 shows that uniphase to another. The degree of interleaving strongly affects the
form interleaving is an exception rather than the norm. It happens
use of shared resources such as cache and memory.
only for 5% of windows of a size larger than 100 accesses. In most
Using
themedian
samedegree
sampling
rate as the
in the
previous
case of
cases, the
of interleaving
is zero,
meaning
that only
footprint,
weishave
measured
the without
interleaving
between
two threads,
one thread
executing.
However,
all-window
statistics,
we
thread
40say
and
88, in the
theinterleaving
same execution
We thought
cannot
forthread
sure whether
is trulytrace.
imbalanced
in
that
the and
twowhether
actively
a server workload,
execuallsince
windows
theexecute
overall imbalance
is the same their
as what
tions
be interleaved
we should
observe from
the samples.fairly uniformly. To our surprise, the
result, shown in Figure 2, suggests that the uniform interleaving
happens only for 5% of windows that are larger than 100 accesses.
3. Approximate All-Window Profiling
In most cases, the median degree of interleaving is zero, meaning
Given
n-element
execution
trace
, t2 , . . . , tn , However,
the basic algothat
onlyan
one
of the two
threads
wast1executing.
this may
traverses
trace from leftsince
to right.
each element
ti , it only
be rithm
a result
of ourthemeasurement
we At
sample
on average
all n
thepossible
windowswindows.
ending at Accurate
ti . The c-approximate
analysis
onecounts
in each
knowledge of
this is exguarantees
that
the
measured
result
for
each
window
be
between
c
tremely important in modeling the effect of concurrent executions.
and 100% of the actual result, where c is between 0 and 1.
The trick of the analysis is to count multiple windows at each
This is done by ALL-WINDOW
building on the idea of an approximate
profil3. step.APPROX.
PROFILING
ing
algorithm
by
Ding
and
Zhong
[4].
For
each
t
i , the algorithm
Given an n-element execution trace t1 , t2 , . . . , tn , the basic almaintains a division of the trace t1 , t2 , . . . , ti in O(log i) time
gorithm traverses the trace from left
to right. At each element t , it
ranges, r1 , . . . , rk . It keeps track of the total count, either the num- i
counts all the
i windows ending at ti . The c-approximate analysis
ber of data blocks or instructions, for each time range. A backward
guarantees
that
thefrom
measured
for cumulative
a window count
be between
traversal of
them
rk to r1result
gives the
for win-c and
100%
thebegin
actual
where
is between
0 and
1. of each
dowsofthat
in result,
ri and end
at tic. This
cumulative
count
The
trickfor
ofallthe
analysis
is toincount
multiple
windows
at each
ri is used
windows
starting
ri . Hence
the algorithm
counts
step.
is done
by building
idea of an approximate proall iThis
windows
in O(log
i) insteadon
of the
i steps.
filing To
algorithm
Ding and
[4].
each ti , algorithm
the algorithm
profile allby
footprints,
weZhong
build on
theFor
Ding-Zhong
maintains
a division
ofinthe
. . .algorithm
, ti that has
O(log
directly. At
each point
thetrace
trace tt1i ,,the
keeps
O(logi)i)partitions,
r1 ,organized
. . . , rk , where
k istree.
in O(log
i), such
that
information
ranges
in a search
Each range
stores
thethe
number
of
in these
rangesmade
can be
summarized
by aAn
constant
values
last accesses
during
the time range.
elementnumber
a has itsof
last
access compromising
in time range r ifthe
a isprecision.
accessed during
time range
but not in
without
It counts
the ir windows
againi)till
time i. of
The
idea of storing last accesses is due to Bennett
O(log
instead
i steps.
andprofiling
Kruskal [2]
the use ofwe
search
due
to Olken [5].algorithm
To
alland
footprints,
buildtree
onisthe
Ding-Zhong
The
algorithm
maintains
the
partition
of
time
as follows.
directly. It represents the trace of the first ithranges
element
accessing j
As it goes
the trace, it creates
new time
range for
eachrepredistinct
data through
in an approximate
tree ofaO(log
j) nodes,
each
access.
Periodically,
it
stops
and
compresses
the
time
ranges.
senting a range of the partial trace. Each range stores the By
number
the length
of the during
period that
to betime.
proportional
to log i,ifitan elof choosing
last accesses
happening
For example,
bounds
theaccessed
cost of each
compression
in O(log
i) and
the amortized
ement
a is
during
the time
in range
r and
has not been
cost for each access in O(1). The exact formula for the periodic
accessed again by time ti , then its last access is in r. The idea of
compression is the same as the one used by Ding and Zhong [4],
storing
accesses
backprecision
to Bennett
whichlast
depends
on thedates
desirable
c. and Kruskal [2].
ForFor
a window
starting
at
an
element
in range
rb and
thread interleaving, the algorithm similarly
divides
the ending
trace at
range
(containing
current
element
footprint
computed
into raklogarithmic
number
of time
rangesi),
andthe
maintains
the is
division
by the sum of the last-access counts of regions from rb+1 to rk .
than O(i) steps. Hence the overall cost is O(n log m) where m is
Related
Work
the number
of distinct data accessed by the trace. This is also the
cost
forcounted
maintaining
the of
approximate
tree for
[4],different
hence it is the total
Agarwal
et al.
the number
cold-start misses
cost forstarting
all-window
profiling
of of
footprints.
size windows
from the
beginning
a trace [1]. For timesharing environments,
Suh et la. used we
the footprints
evaluate
For thread interleaving,
guaranteetothe
errorthe
to be less than
effect of
scheduling
quantumsize
on cache
[7]. Chandra
et al.The algorithm
1−c
of the window
of thelocality
interleaved
execution.
modeled
the parallel
execution
the locality
of one
thread
divides
the trace
up to iwhere
into O(log
i) ranges.
Each
range stores the
was affected
by the
footprint
of another
[3].
two of executed
execution
counts,
which
have a thread
counter
forThe
thelast
number
methods
approximated
average
footprint
solving
recursive
instructions
fortheeach
thread
up tobythe
starta of
the range. For all
equation.
Let E[wstarting
average
footprint
for ainterleaving
window of size
t ] be the from
windows
range
rb , the
is estimated as
t, and M
) be the average
miss rate
cachecounts
of size fand
(estimated
the(fdifference
between
the for
current
the counts stored at
from the
For each
access, maintained
the footprint based on the
rb .reuse
Thesignature).
ranges need
to bememory
dynamically
either increments by one or stays the same depending on whether
precision c in a similar way as in the Ding-Zhong algorithm [4], so
the accessed data is new or not. This is equivalent to checking
the precision is guaranteed.
whether the access is a miss in an cache with infinite size. The
expected footprint at time t + 1 can then be computed from the
4.at t RELATED
WORK
footprint
as follows
Agarwal et al. counted the number of cold-start misses for difstarting
E[wferent
E[wwindows
(E[wthe
1)M (E[wtof
]) a trace [1]. For
t+1 ] =size
t ](1 − M (E[w
t ]) +from
t ] +beginning
time-sharing environments, Suh et al. used the footprints to evaluSuh et ate
al. simplified
a differential
equationon[7].
Chandra
et [6]. Chanthe effectitofinto
scheduling
quantum
cache
locality
al. computed the recursive relation bottom up. A third technique,
dra et al. modeled the parallel execution where the locality of one
recently developed by Shen et al., estimated the footprint using
thread
is affected
bythe
thedistribution
footprintofofreuse
another
statistical
equations
based on
timesthread
[6]. [3]. The last
methods
tried
to approximate
averagebut
footprint
with the
Thetwo
previous
methods
compute
or estimate the
the average
not
following
recursive
equation. isLet
footprint
t ] be the
the complete
distribution.
A drawback
thatE[w
the average
canaverage
be
forinfluenced
a window
size
t, values.
and MThe
(f )problem
be theisaverage
strongly
by aoffew
large
inherent miss rate for
cache
of size
f (estimated
from the
reuse all
signature),
since the
previous
methods
do not actually
measure
windows.then
Our new algorithm, though not yet implemented, would be able to
overcome this limitation.
E[wt+1 ] = E[wt ](1 − M (E[wt ]) + (E[wt ] + 1)M (E[wt ])
Suh et al. simplified it into a differential equation that has a soAcknowledgments
lution [6]. Chandra et al. computed the recursive relation in a
The authors wish to thank Bao Bin at Rochester and the reviewers
bottom-up
A third which
technique,
developed
by Shen
of PPOPP
2008 forfashion.
their comments,
helpedrecently
to improve
the
et al., estimated the footprint using statistical equations based only
presentation.
on the distribution of reuse times [5].
The previous methods are limited because they do not guarantee
References
the accuracy nor provide the exact distribution (in addition to the
[1] A. average).
Agarwal, J. L.
Hennessy,
and
M. Horowitz.
Cache performance
The
average
footprint
summarizes
basically O(n) values
of operating system and multiprogramming workloads. ACM
with a single number. The average can be misled by few large
Transactions on Computer Systems, 6(4):393–431, 1988.
values. For example, one footprint of 10000 would have the effect
[2] B. T.
V. J. Kruskal.
LRUSecond,
stack processing.
IBMprevious
Journal methods do
ofBennett
1000 and
footprints
of 10.
since the
of Research and Development, pages 353–357, July 1975.
not measure the footprint of all windows, they do not guarantee
[3] D. the
Chandra,
F. Guo,ofS.the
Kim,
and Y. Our
Solihin.
inter-although not yet
accuracy
result.
newPredicting
algorithm,
thread cache contention on a chip multi-processor architecture. In
implemented, would be able to overcome these limitations.
Proceedings of the International Symposium on High Performance
Computer Architecture (HPCA), 2005.
5. andREFERENCES
[4] C. Ding
Y. Zhong. Predicting whole-program locality with reuse
[1] A. Agarwal, J. L. Hennessy, and M. Horowitz. Cache performance of
distance analysis. In Proceedings of the ACM SIGPLAN Conference
operating system and multiprogramming workloads. ACM TOCS,
on Programming Language Design and Implementation, San Diego,
6(4):393–431, 1988.
CA, June 2003.
[2] B. T. Bennett and V. J. Kruskal. LRU stack processing. IBM JRD,
[5] F. Olken.pages
Efficient
methodsJuly
for calculating
the success function of fixed
353–357,
1975.
space
policies.
Technical
Lawrence inter-thread
[3]replacement
D. Chandra,
F. Guo,
S. Kim,Report
and Y.LBL-12370,
Solihin. Predicting
Berkeleycache
Laboratory,
1981. on a chip multi-processor architecture. In
contention
of HPCA,
[6] X. Shen,Proceedings
J. Shaw, B. Meeker,
and2005.
C. Ding. Locality approximation
[4]time.
C. Ding
and Y. Zhong.
Predicting
whole-program
locality with reuse
using
In Proceedings
of the ACM
SIGPLAN-SIGACT
Symposium
distance
analysis. In Languages,
Proceedings
of PLDI,
on Principles
of Programming
pages
55–61, San
2007.Diego, CA, June 2003.
[5]Suh,
X. S.
Shen,
J. Shaw,
B.Rudolph.
Meeker,Analytical
and C. Ding.
approximation
[7] G. E.
Devadas,
and L.
cacheLocality
models with
usingtotime.
Proceedings
POPL, pages
55–61, on
2007.
applications
cacheInpartitioning.
In of
International
Conference
[6] G. E. Suh,
S. Devadas,
and L. Rudolph. Analytical cache models with
Supercomputing,
pages
1–12, 2001.
applications to cache partitioning. In Proceedings of ICS, pages 1–12,
2001.
Download