Proposal for Power and Energy analysis of different prefetching mechanisms for Linked Data Structures NIKHIL JAGDALE KARAN KHANNA AMAN KUMAR Page 1 of 5 I. PROBLEM DEFINITION AND MOTIVATION 3 II. BRIEF SURVEY OF RELATED WORK 3 III. SOME INITIAL IDEAS 3 IV. EXPERIMENTAL METHODOLOGY/ SETUP 4 V. TIMETABLE 4 VI. REFERENCES 5 Page 2 of 5 I. Problem definition and motivation Memory latencies in today’s processor are in the order of hundreds of processor clock cycles. In programs that contain linked data structures (pointer intensive programs), the penalty of cache misses can be costly. Several pre-fetching techniques for pointer-intensive programs exist that improve performance, but at an expense of increased energy consumption. It is unclear, that which of these techniques gives an optimum trade-off between performance improvement & energy consumption. Further the cognizance of which pre-fetching technique is most suited for low energy consumption is vital to domains such as embedded systems that are striving to keep up with the increased demand for computational performance while maintain longer battery lives in portable devices. II. Brief survey of related work Although there is a lot of effort going on with respect to data pre-fetching techniques in the research community, few seem to have characterised existing techniques with respect to energy consumption. The most related research in this area to the best of our knowledge seems to be by Yao Guo et. al [1], which evaluates a set of hardware based data pre-fetching techniques from energy stand point. Their study covers two sequential pre-fetching techniques, stride pre-fetching, and dependence based prefetching of which only the latter most deals with irregular applications/ Linked-data structures. However we acknowledge that it is not possible for any single investigation to explore all proposed techniques from energy perspective, and also feel that it would be interesting to analyse many more different techniques not covered in this study. In this project, we intend to focus exclusively on comparative energy and performance evaluation of pre-fetching approaches that deal with irregular applications/ Linked data structures, covering both: hardware and software approaches. We further want to use our results, along with any existing results such as those from [1], and evaluate the suitability of one or more of these techniques in hand held embedded devices such as smart-phones, where minimizing energy consumption assumes relatively high importance. Some of the interesting approaches we have short-listed to this point to be considered as a part of our analysis include but are not limited to the following: Software based Greedy pre-fetching [2], Memory side Correlation pre-fetching [3], Address Value Delta prediction [4], Content-directed pre-fetching [5]. III. Some initial ideas Based on our understanding, we intend to select three different pre-fetching techniques that are proposed for accelerating pointer-intensive applications after consultation with course faculty, and adapt a common minimum simulation environment to those three. We then aim to execute a preselected set of pointer-intensive benchmark applications in the simulated environment, and the three adapted versions with pre-fetching support built into them. In each of the tests, we would gather application execution times [which will reflect relative execution speedups], and total energy consumed in executing the applications. Based on how our simulated environment is organised, it might be important to isolate energy contributions from components that are not impacted by pre-fetching. However, there is a counter argument to this, in that pre-fetching reduces execution time, and therefore any components not related to pre-fetching such as main memory, LCD, SD card etc. also come into the larger systemwide energy equation; if application finishes earlier, these unrelated modules also have the opportunity to go to sleep earlier. A related study [1] does not take any such effects in account, by considering energy analysis of only the memory sub-system of the processor. There are some interesting metrics related to pre-fetching, which may be correlated to energy consumption trends, during the course of this investigation. These are: Pre-fetch coverage: Ratio of the number of misses reduced due to pre-fetches, over the total number of pre-fetches that will occur in the absence of pre-fetching. Pre-fetch precision: Ratio of the number of distinct pre-fetched cache lines that are accessed by at-least one demand request after being pre-fetched and before being replaced out, to the number of pre-fetched cache lines. Page 3 of 5 Pre-fetch pollution: is defined as a ratio of the number of those demand misses that are caused by interference due to pre-fetching & will not occur without pre-fetch interference over the number of misses that will occur without pre-fetching [6]. Energy cost of performance: Performance improvement per unit increase in energy consumption, where performance improvement may be defined as percentage reduction in execution time of a benchmark. IV. Experimental methodology/ setup 1. The best simulation environment for our purposes would be the one that represents most closely the more ubiquitous ARM based architectures being used for embedded mobile applications, such as smart phones, audio players etc.. We need to finalize on this in consultation with the course faculty. 2. Modifications for energy and performance instrumentation: We have been suggested to consider WATTCH and SIMPLESCALAR for our purposes. [Need more investigation on the same] 3. Prospect modifications for implementing software assisted greedy pre-fetching: a. Addition of mechanism to support pre-fetch instruction in the simulator, if need be. b. Add pre-fetch instructions in the benchmark suit, or alternatively explore the possibility of using Todd Mowry’s version of compiler that automates this part. 4. Prospect modifications for implementing memory side correlation pre-fetching: a. Add a memory side pre-fetch engine between the main memory and the L2 cache that uses passive push pre-fetching. b. Modify L2 cache to receive lines that it has not requested. 5. Prospect modifications for implementing AVD: a. Add AVD prediction logic consisting of adders, comparators & a prediction table. 6. Benchmark suits: From researching several related works, it appears that Olden is the most commonly used benchmark suit for pointer intensive applications. We intend to stick with this as of now. V. Timetable Milestone Due M1 10/13 M2 11/01 M3 11/19 Description 1. Investigate & Freeze suitability of WATTCH and/or SIMPLESCALAR 2. Investigate & freeze the closest simulator, modify it to match a typical ARM based Smartphone architecture 3. Decide upon suitable metrics that can be collected and analysed by this experiment. 4. Modify Olden to include pre-fetch instructions/or use Mowry's compiler (in this case see how it can be modified for ARM platforms) 1. Modify and incorporate different techniques into the simulated environment 2. Run a first level of simulation and collect metrics 3. Review any short-falls in any measurement process and take appropriate actions 4. Review if all metrics identified still make sense or their needs to be a change at some point. 1. Do regressive simulations (different benchmarks, and all selected techniques) and collect data 2. Analyse data and information (Comparative analysis of various pre-fetch techniques.) 3. Use this information, and information from related work to comment upon the suitability of one or other approach for battery powered mobile computing platform. Page 4 of 5 VI. References [1] Energy Characterization of Hardware-Based Data Prefetching, Yao Guo et. al., 2004 [2] Automatic Compiler-Inserted Prefetching for Pointer-Based Applications by Chi-Keung Luk and Todd C. Mowry [3] Using a User-Level Memory Thread for Correlation Prefetching by Yan Solihin et. al. [4] Address-Value Delta (AVD) Prediction by Onur Mutlu, Hyesoon Kim, Yale N. Patt [5] Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems by Eiman Ebrahimi, Onur Mutlu, Yale N. Patt [6] An Adaptive Data Prefetcher for High-Performance Processors by Yong Chen et.al, http://www.cs.iit.edu/~scs/psfiles/ccgrid10-adaptpf.pdf [7] Push vs. pull - Data movement for linked data structures by Chia-Lin Yang and Alvin R. Lebeck Page 5 of 5