All-Window Profiling of Concurrent Executions Chen Ding and Trishul Chilimbi All-Window Profiling Concurrent Executions Microsoftof Research †‡ † † ‡ Computer Science Department, University of Rochester Chen Ding†‡ and Trishul Chilimbi† ‡ Computer Science Department, University of Rochester † Microsoft Research We first motivate all-window profiling by examining the effect of Categories and Subject Descriptors C.4 [performance of sysfootprint and interleaving in a concurrent execution, then present tems]; D.3.4 [programming languages]: processors the basic algorithm for approximate all-window profiling, and finally discuss related work.measurement, performance General Terms 80 Keywords data footprint, thread interleaving, concurrent systems Submitted to a conference. Copyright is held by the author/owner(s). PPoPP’08, February 20–23, 2008, Salt Lake City, Utah, USA. ACM 978-1-59593-960-9/08/0002. 40 footprint 20 Given any window over an execution trace, the footprint is the amount of Abstract data being accessed in the window. Footprint is a basic metric This of program and has beenfor used to compute the in a paper firstlocality demonstrates the need all-window profiling life-time ofconcurrent data in cache, thatthen is, how a program steps out its old and execution, presents an approximate algorithm, data [1], and to compute the effect finally discusses related work.of cache sharing by multiple threads, that is, how they step over each other [3, 6]. Consider1.an execution trace of a commercial server application. Footprints The execution consists of 22 concurrent threads for a total of 1.7 For a window over an execution trace, the footprint is the amount billion memory accesses. Figure 1 shows the instruction footprints of data being accessed in the window. Footprint is a basic metric for thread 40, which accounts for 29% of instruction accesses. of program locality and has been used to compute the life-time Both axes use blocks a fine-grained scale we[1] call of data once theylogarithmic are loaded into cache as 8-wide well as the histogram, effect where the logarithmic by [3, dividing each of refines cache sharing by multiple scale programs 7]. Figuratively, power-of-two size bin into 8 equal-size sub-bins (if the size of the and the footprint determines how a program treads out its old data bin is no smaller than programs 8). In the bins 50,in 100, how multiple stepfigure, over each other cache.150, and 200 representWe ranges 319],method [22.5K, 1.8M ], used a[288, sampling to 24.6K], collect the[1.7M, data for the paper. and [126M, ] respectively. At134M each memory access, with equal probability the method picks a We usedrange a sampling to collect thethe data shown and r and a method window size x within range. The in sizethis x uniquely other figures in the paper. At eachwhich memory access, probdetermines the window, includes the with curentequal access and the − 1 accesses. the algorithm measures volume ability we previous picked axrange r t and Then a window size within the the range. of data in the the window. timewindow, ranges areusing sampled Then we measured volume Itofensures data inthat thealltime equally. However, since thethe total number of windows for a trace the same algorithm for measuring reuse distance. We ensured n of ranges length nare is sampled O(n), theequally. samplingHowever, rate is only = n1 . The that all time then2sampling 2 n exceedingly low=sampling rate raises the question whether the we can . This motivated us to develop rate was low, n(n−1)/2 n−1 measure all O(n) windows to verify the accuracy of sampling. all-window profiling. Consider an execution trace of a commercial server application. For a window of x instruction accesses by thread 40, shown on The execution consists of 22 concurrent threads for a total of 1.7 the x-axis, its footprint or the volume of accessed instructions is billion memory accesses. Figure 1 shows the instruction footprints given by the averageforlifetime. example for fory-axis. thread It40,shows whichthe accounts 29% of For instruction accesses. a cache size of 128 8KB for 64-byte blocks, Both axesblocks use a or fine-grained logarithmic scale the we average call 8-wide time for thread 40 to access this much around 800scale in- is histogram, where each bin ofinstruction the (base 2)is logarithmic struction accesses. In 8a equal-size machine with a shared divided into sub-bins (when cache, the bin the sizefootprint is no smaller from threadthan 40 8). hasThe a significant effect other threads. faster x-axis shows 200 on logarithmic ranges The between 0 and (200/8+2) 27 more likely it causes eviction of other it cumulates its footprint, the 2 = 2 or 134 million instruction accesses. The ythreads’ instruction data. axis showsor90 logarithmic ranges for the footprint up to 9,750 instruction blocks. The five curves show the cumulative distribution of footprints: for each time window of size x, up to 0%, 5%, 50%, 95%, or 100% of footprints have a size under the y value marked by these five curves. The middle curve is labeled “expected” as it shows the 100% 95% expected 5% 0% 60 FOOTPRINT WINDOWS 0 1. Server thread 40 instruction footprint 0 50 100 150 200 time Figure 1. The 8-wide histogram for the instruction footprint of 1 server1: thread measured by sampling at rate Figure The40, 8-wide histogram for the instruction footprints of n 1 ) server thread 40, measured by sampling at a rate O( n median value. The other four curves show more extreme cases. For example, the “expected” curvewindow shows that half of the windowsFigure of Many windows for a given size were sampled. 1 800 instruction accesses instruction blocks (8KB)The or top shows the distribution of touched sampled128 footprints in five curves. and the “0%” show curve shows that there longfootprints. periods of The andless, bottom curves the largest and existed smallest time, 100K accesses, where very few data, about 10 blocks, were second and fourth show the upper watermark for 95% and 5% footaccessed. prints. The middle curve shows the median footprint. The disThe median footprint is not smooth and has small bumps and tribution wealth of information. example, the botbreaks. contains However, athe area between “5%” and For “95%” shows that tom curve shows there existed long periods of time, 100K accesses, the middle 90% of footprints follow a smooth, consistent upward where aboutslows 10 blocks, were accessed. trend.very Thefew rate data, of increase down and takes a shallower slope The is 25K not instruction smooth and has many bumps and as themedian window footprint size reaches accesses. The bi-linear breaks. However, range of theofmiddle 90% of footprints shape is common the in the histogram reuse distances, where the followpoint a smooth, consistent of increase slows adown of the knee gives thetrend. size ofThe the rate working set and signals when the window size accesses.curve Therepresents bi-linear shape change of locality. Thereaches knee in25K the footprint not is common inofthe histogram of reuse which shows a change locality but a change ofdistances, interference. It shows thatthe the knee interference by an the application. thread drops down or rate the of working set of The over kneelonger in theperiods footprint, of execution. This isnot expected but the program however, represents a change of question localityfor buteach a change ofisinterwhere and how much ofof interference changes. Wethread will notdrops ference. It shows thatthe therate rate interference by the haveover an complete answer unless we can measure the footprint down longer periods of execution. This is intuitive but for the imall windows. portant question is when and how the rate of interference changes, and this can be measured accurately by all-window profiling. 2. Thread Interleaving 2.In modern THREAD INTERLEAVING concurrent applications, the execution of threads may not interleave uniformly. is generally the case for client The threads in modernThis multi-threaded applications doapplinot intercations with asymmetrical functions the for each where some leave uniformly. This is generally case thread, for client applications may execute morecarry instructions than of others. Even for sym-other where one or ten fewtimes threads out most the work while metric server workloads, the relative rate of execution of parallel threads are invoked periodically. Even for server workloads, the threads may change from one phase to another. The degree of inrelative rate of execution of parallel threads may change from one terleaving strongly affects the use of shared resources such as cache and memory. Using the same n1 sampling rate as in the case of footprint, we have measured the interleaving between two threads, thread 40 and thread 88, in the execution trace mentioned before. Since the two 200 Server threads 40 and 88 interleaving 100 150 4. 100% 95% expected 5% 0% 0 50 thread 40 time The Ding-Zhong algorithm ensures that the sum is between c and using periodic compression. there no need tothe organize 100% of the actualHowever, footprint. Tois measure footprint, we just the time ranges in a search tree, unlike the case of footprint use this sum for all windows starting in range rbmea. Since there are surement. O(log j) ranges, the counting of i windows takes O(log j) rather 0 50 100 150 200 thread 88 time Figure 2. The 8-wide histogram of the interleaving between server 1 threads2:40The and 88, measured by sampling at rate Figure 8-wide histogram of the interleavings between n 1 server threads 40 and 88, measured by sampling at a rate O( n ) threads are active server threads, one may expect uniform interleaving. However, the sampling result in Figure 2 shows that uniphase to another. The degree of interleaving strongly affects the form interleaving is an exception rather than the norm. It happens use of shared resources such as cache and memory. only for 5% of windows of a size larger than 100 accesses. In most Using themedian samedegree sampling rate as the in the previous case of cases, the of interleaving is zero, meaning that only footprint, weishave measured the without interleaving between two threads, one thread executing. However, all-window statistics, we thread 40say and 88, in the theinterleaving same execution We thought cannot forthread sure whether is trulytrace. imbalanced in that the and twowhether actively a server workload, execuallsince windows theexecute overall imbalance is the same their as what tions be interleaved we should observe from the samples.fairly uniformly. To our surprise, the result, shown in Figure 2, suggests that the uniform interleaving happens only for 5% of windows that are larger than 100 accesses. 3. Approximate All-Window Profiling In most cases, the median degree of interleaving is zero, meaning Given n-element execution trace , t2 , . . . , tn , However, the basic algothat onlyan one of the two threads wast1executing. this may traverses trace from leftsince to right. each element ti , it only be rithm a result of ourthemeasurement we At sample on average all n thepossible windowswindows. ending at Accurate ti . The c-approximate analysis onecounts in each knowledge of this is exguarantees that the measured result for each window be between c tremely important in modeling the effect of concurrent executions. and 100% of the actual result, where c is between 0 and 1. The trick of the analysis is to count multiple windows at each This is done by ALL-WINDOW building on the idea of an approximate profil3. step.APPROX. PROFILING ing algorithm by Ding and Zhong [4]. For each t i , the algorithm Given an n-element execution trace t1 , t2 , . . . , tn , the basic almaintains a division of the trace t1 , t2 , . . . , ti in O(log i) time gorithm traverses the trace from left to right. At each element t , it ranges, r1 , . . . , rk . It keeps track of the total count, either the num- i counts all the i windows ending at ti . The c-approximate analysis ber of data blocks or instructions, for each time range. A backward guarantees that thefrom measured for cumulative a window count be between traversal of them rk to r1result gives the for win-c and 100% thebegin actual where is between 0 and 1. of each dowsofthat in result, ri and end at tic. This cumulative count The trickfor ofallthe analysis is toincount multiple windows at each ri is used windows starting ri . Hence the algorithm counts step. is done by building idea of an approximate proall iThis windows in O(log i) insteadon of the i steps. filing To algorithm Ding and [4]. each ti , algorithm the algorithm profile allby footprints, weZhong build on theFor Ding-Zhong maintains a division ofinthe . . .algorithm , ti that has O(log directly. At each point thetrace trace tt1i ,,the keeps O(logi)i)partitions, r1 ,organized . . . , rk , where k istree. in O(log i), such that information ranges in a search Each range stores thethe number of in these rangesmade can be summarized by aAn constant values last accesses during the time range. elementnumber a has itsof last access compromising in time range r ifthe a isprecision. accessed during time range but not in without It counts the ir windows againi)till time i. of The idea of storing last accesses is due to Bennett O(log instead i steps. andprofiling Kruskal [2] the use ofwe search due to Olken [5].algorithm To alland footprints, buildtree onisthe Ding-Zhong The algorithm maintains the partition of time as follows. directly. It represents the trace of the first ithranges element accessing j As it goes the trace, it creates new time range for eachrepredistinct data through in an approximate tree ofaO(log j) nodes, each access. Periodically, it stops and compresses the time ranges. senting a range of the partial trace. Each range stores the By number the length of the during period that to betime. proportional to log i,ifitan elof choosing last accesses happening For example, bounds theaccessed cost of each compression in O(log i) and the amortized ement a is during the time in range r and has not been cost for each access in O(1). The exact formula for the periodic accessed again by time ti , then its last access is in r. The idea of compression is the same as the one used by Ding and Zhong [4], storing accesses backprecision to Bennett whichlast depends on thedates desirable c. and Kruskal [2]. ForFor a window starting at an element in range rb and thread interleaving, the algorithm similarly divides the ending trace at range (containing current element footprint computed into raklogarithmic number of time rangesi), andthe maintains the is division by the sum of the last-access counts of regions from rb+1 to rk . than O(i) steps. Hence the overall cost is O(n log m) where m is Related Work the number of distinct data accessed by the trace. This is also the cost forcounted maintaining the of approximate tree for [4],different hence it is the total Agarwal et al. the number cold-start misses cost forstarting all-window profiling of of footprints. size windows from the beginning a trace [1]. For timesharing environments, Suh et la. used we the footprints evaluate For thread interleaving, guaranteetothe errorthe to be less than effect of scheduling quantumsize on cache [7]. Chandra et al.The algorithm 1−c of the window of thelocality interleaved execution. modeled the parallel execution the locality of one thread divides the trace up to iwhere into O(log i) ranges. Each range stores the was affected by the footprint of another [3]. two of executed execution counts, which have a thread counter forThe thelast number methods approximated average footprint solving recursive instructions fortheeach thread up tobythe starta of the range. For all equation. Let E[wstarting average footprint for ainterleaving window of size t ] be the from windows range rb , the is estimated as t, and M ) be the average miss rate cachecounts of size fand (estimated the(fdifference between the for current the counts stored at from the For each access, maintained the footprint based on the rb .reuse Thesignature). ranges need to bememory dynamically either increments by one or stays the same depending on whether precision c in a similar way as in the Ding-Zhong algorithm [4], so the accessed data is new or not. This is equivalent to checking the precision is guaranteed. whether the access is a miss in an cache with infinite size. The expected footprint at time t + 1 can then be computed from the 4.at t RELATED WORK footprint as follows Agarwal et al. counted the number of cold-start misses for difstarting E[wferent E[wwindows (E[wthe 1)M (E[wtof ]) a trace [1]. For t+1 ] =size t ](1 − M (E[w t ]) +from t ] +beginning time-sharing environments, Suh et al. used the footprints to evaluSuh et ate al. simplified a differential equationon[7]. Chandra et [6]. Chanthe effectitofinto scheduling quantum cache locality al. computed the recursive relation bottom up. A third technique, dra et al. modeled the parallel execution where the locality of one recently developed by Shen et al., estimated the footprint using thread is affected bythe thedistribution footprintofofreuse another statistical equations based on timesthread [6]. [3]. The last methods tried to approximate averagebut footprint with the Thetwo previous methods compute or estimate the the average not following recursive equation. isLet footprint t ] be the the complete distribution. A drawback thatE[w the average canaverage be forinfluenced a window size t, values. and MThe (f )problem be theisaverage strongly by aoffew large inherent miss rate for cache of size f (estimated from the reuse all signature), since the previous methods do not actually measure windows.then Our new algorithm, though not yet implemented, would be able to overcome this limitation. E[wt+1 ] = E[wt ](1 − M (E[wt ]) + (E[wt ] + 1)M (E[wt ]) Suh et al. simplified it into a differential equation that has a soAcknowledgments lution [6]. Chandra et al. computed the recursive relation in a The authors wish to thank Bao Bin at Rochester and the reviewers bottom-up A third which technique, developed by Shen of PPOPP 2008 forfashion. their comments, helpedrecently to improve the et al., estimated the footprint using statistical equations based only presentation. on the distribution of reuse times [5]. The previous methods are limited because they do not guarantee References the accuracy nor provide the exact distribution (in addition to the [1] A. average). Agarwal, J. L. Hennessy, and M. Horowitz. Cache performance The average footprint summarizes basically O(n) values of operating system and multiprogramming workloads. ACM with a single number. The average can be misled by few large Transactions on Computer Systems, 6(4):393–431, 1988. values. For example, one footprint of 10000 would have the effect [2] B. T. V. J. Kruskal. LRUSecond, stack processing. IBMprevious Journal methods do ofBennett 1000 and footprints of 10. since the of Research and Development, pages 353–357, July 1975. not measure the footprint of all windows, they do not guarantee [3] D. the Chandra, F. Guo,ofS.the Kim, and Y. Our Solihin. inter-although not yet accuracy result. newPredicting algorithm, thread cache contention on a chip multi-processor architecture. In implemented, would be able to overcome these limitations. Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2005. 5. andREFERENCES [4] C. Ding Y. Zhong. Predicting whole-program locality with reuse [1] A. Agarwal, J. L. Hennessy, and M. Horowitz. Cache performance of distance analysis. In Proceedings of the ACM SIGPLAN Conference operating system and multiprogramming workloads. ACM TOCS, on Programming Language Design and Implementation, San Diego, 6(4):393–431, 1988. CA, June 2003. [2] B. T. Bennett and V. J. Kruskal. LRU stack processing. IBM JRD, [5] F. Olken.pages Efficient methodsJuly for calculating the success function of fixed 353–357, 1975. space policies. Technical Lawrence inter-thread [3]replacement D. Chandra, F. Guo, S. Kim,Report and Y.LBL-12370, Solihin. Predicting Berkeleycache Laboratory, 1981. on a chip multi-processor architecture. In contention of HPCA, [6] X. Shen,Proceedings J. Shaw, B. Meeker, and2005. C. Ding. Locality approximation [4]time. C. Ding and Y. Zhong. Predicting whole-program locality with reuse using In Proceedings of the ACM SIGPLAN-SIGACT Symposium distance analysis. In Languages, Proceedings of PLDI, on Principles of Programming pages 55–61, San 2007.Diego, CA, June 2003. [5]Suh, X. S. Shen, J. Shaw, B.Rudolph. Meeker,Analytical and C. Ding. approximation [7] G. E. Devadas, and L. cacheLocality models with usingtotime. Proceedings POPL, pages 55–61, on 2007. applications cacheInpartitioning. In of International Conference [6] G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with Supercomputing, pages 1–12, 2001. applications to cache partitioning. In Proceedings of ICS, pages 1–12, 2001.