Based on the paper by :Jie Tang, Shaoshan Liu,Zhimin Gu,Chen Liu and Jean-Luc Gaoudiot,Fellow, IEEE Computer Architecture Letters Volume 10 Issue 1 Introduction Motivation and Background Previous Work Methodology Prefetcher Performance Energy Efficiency Energy Consumption Analysis Energy Efficiency Model Conclusion Data prefetching, is the process of fetching data that is needed in the program in advance, before the instruction that requires it is executed. It removes apparent memory latency. Data prefetching has been a successful technique in modern high-performance computing platforms. It was found however, that prefetching significantly increases power consumption. Embedded Mobile Systems typically have constraints for space , cost and power. This means that they can’t afford power consuming processes. Hence, prefetching was considered unsuitable for Embedded Systems. Embedded mobile systems have now come to be powered by powerful processors such as dual core processors like the Tegra2 by Nvidia Smart phone applications include web browsing, multimedia, gaming, Webtop control all of which require a very high performance from the computing system. To meet the this requirement, methods such as prefetching which were earlier shunned, can now be used. With better, more power efficient technology, the energy consumption behavior may have also changed. Due to this reason, we have decided to study and model the energy efficiency of different types of prefetchers. Over the years, the main bottleneck that was preventing the speeding up of systems, has been the slowness of memory and not the processor speed. Prefetching date can be implemented in hardware by observing fetching patterns, such as prefetching the most recently used data first. Sequential prefetching takes advantage of spatial locality in the memory. Tagged prefetching associates a tag bit with every memory block and prefetches based on that value Stride-based prefetching detects the stride pattern in the address stream like fetches from different iterations from the same loop. Stream prefetchers try to capture sequential nearby misses and prefetch an entire block at a time. Correlated prefetchers issue prefetches based on the previously recorded correlations between addresses of cache misses. There had been some studies focusing on improving energy efficiency in hardware prefetching: PARE is one of these techniques, which constructs a poweraware hardware prefetching engine. By categorizing memory accesses into different groups It uses a table with indexed hardware history which is continuously updated and different memory fetches are categorized, and the prefetching decisions are based on the information in this table. Modern embedded mobile Table 1 Benchmark Set systems execute a wide Xerces-C++ SAX variety of workloads The first set includes two XML data processing benchmarks taken from Xerces-C++ They are implementing event-based parsing which is data centric (SAX) and tree-based parsing model which is document centric (DOM). DOM Media Bench II JPEG2000 Encode JPEG2000 Decode H.264 Encode H.264 Decode PARSEC Fluidanimate Freqmine The second set is taken from MediaBench II which provides application level benchmarks, representing multimedia and entertainment workloads , based on the ISO JPEG-2000 and ISO JPEG-2000 standard. It also has the H.264 Video Compression standards. The third set is taken from the PARSEC(Princeton Application Repository for Shared-Memory Computers) benchmark for multithreaded processors which is used in many gaming applications. Cache hierarchy indicates the level of cache that the prefetcher covers. Prefetching degree shows whether the prefetching degree of the prefetcher is static or dynamically adjusted. Trigger L1 and Trigger L2 respectively show what triggers the prefetch. P 1 P 2 P 3 P 4 P 5 P 6 Table 2 Summary of Prefetchers cache prefetching hierarchy degree trigger L1 trigger L2 L1 & L2 Dynamic miss access L1 Static miss N/A L1 & L2 Dynamic miss miss L2 Static N/A miss L1 & L2 Dynamic miss miss L2 Static N/A access To study the performance of the selected prefetchers, we use CMP$IM,a cache simulator, to model highperformance embedded systems. It is a Pin based multi-core cache simulator Simulation parameters are shown in Table 3, which resembles modern Smartphone and e-book systems Table 3 Simulation Parameters Frequency 1 GHz Issue Width 4 Instruction Window 128 entries L1 Data Cache 32KB, 8-way, 1cycle L1 Inst. Cache 32 KB, 8-way, 1cycle L2 Uniform Cache 512 KB, 16-way, 20 cycles Memory Latency 256 MB, 200 cycles To study the impact of prefetching on energy consumption of memory subsystem, we use CACTI to model energy parameter of different technology implementations. In a simulator, a hardware prefetcher can be defined by a set of hardware tables, its output is in the form of tables of data, hence it’s energy consumption can be modeled. Prefetching techniques are effective on improving performance by more than 5% on average. In detail, the effectiveness of prefetchers depends on both prefetching technique itself and natures of applications. P3 results in the best average performance because it’s the most aggressive prefetcher. JPEG2000 decoding and encoding programs can receive up to 22% of performance improvement due to its streaming feature. Fig1 Performance Improvement We study the energy efficiency of both 90 nm and 32 nm technologies. The results are summarized in Figures 2 and 3 respectively. The baseline for comparison is energy consumption without any prefetcher, thus a positive number shows that with the prefetcher the system dissipates more energy. For instance, 0.1 means that with the prefetcher, the system dissipates 10% more energy compared to baseline. In 90nm technology, most prefetchers significantly increase overall energy consumption, which confirms the findings of previous studies. Thus, in 90 nm technology, only very conservative prefetchers can be energy efficient. Fig 2 90nm Fig 3 32nm In 32 nm technology, P4 is still the most energy efficient prefetcher, reducing overall energy by almost 4% on average; when running JPEG 2000 Decode, it achieves close to 10% energy saving. P2 and P3 are still the most energy-inefficient prefetchers due to their aggressiveness. However, in the worst case they only consume 25% extra energy, a four-fold reduction compared to the 90 nm implementations. Thus most prefetchers are able to provide performance gain with less than 5% energy overheads; and P1 and P4 even result in 2% to 5% energy reductions. In equation 1, the total energy consumption consists of two contributors: static energy (Estatic) and dynamic energy (Edynamic) Nm is the number of read/write memory accesses Edynamic = number of read/write accesses with the energy dissipated on the bus & memory subsystem of each access (E’m ). (Estatic) is production of overall execution time (t) and the system static power consumption (Pstatic). When prefetchers accelerate the process, the reduced execution time reduces the static energy consumption. However, prefetchers generate significant amount of extra memory subsystem accesses leading to pure dynamic overheads. Equation 1: E = Estatic+Edynamic= (Pstatic x t)+(Nm x E’m) Table 4 Energy Category Dynamic memory dynamic activities of the memory subsystem Static memory memory subsystem static power consumption Dynamic prefetch dynamic activities of the prefetcher Static prefetch prefetcher hardware static power consumption Fig 4 In 90 nm technology, dynamic energy contributes to up to 66% of the total energy consumption: 14% from the pre-fetcher and 52% from the memory subsystem. Static energy only accounts for 34% of the total energy consumption. Hence, although the prefetchers are able to reduce execution time, there leaves little room for total energy saving, leading to energy inefficiency for most pre-fetchers in 90 nm implementations. In 32 nm technology, static energy contributes over 66% of the total energy consumption: 65% from the memory subsystem, and 1% from the prefetcher hardware. Dynamic energy is far less compared to static. 32 nm technology, prefetchers become energyefficient in many different cases. We propose an analytical model to evaluate efficiency. Equation 2: Eno-pref > Epref (?) To simplify the model, we assume there is only one level in the memory subsystem. Compared to Eno-pref, Epref has two more contributors: static energy and dynamic energy consumption coming from prefetcher hardware. Equation 3: Pm-static*t1+Nm1xE’m>Pm-staticxt2+Nm2xE’m+Pp-staticxt2+NpxE’p Equation 4: (t1-t2)/t1 > [(Nm2-Nm1)*E’m+Np*E’p+Pp-static*t2]/Pm-static*t1 The left-hand side shows the performance gain as a result of prefetching. The dividend of right-hand side contains three terms: energy overhead incurred by the extra memory accesses ;dynamic energy; and static energy consumption. The divisor of the right-hand side represents the static energy of the original design without prefetching As summarized in Equation 5, if a prefetcher needs to be energy efficient, the performance gain (G) it brings must be greater than the ratio of the energy overhead (Eoverhead) it incurs over the original static energy (Eno-pref-static). Equation 5: G> Eoverhead/Eno-perf-static Equation 6 : EEI=G - Eoverhead/Eno-perf-static We define a metric Energy Efficiency Indicator (EEI) in Equation 6. A positive EEI indicates the prefetcher is energy-efficient and vice versa. We have validated the analytical results with the empirical results shown in table, thus indicating the simplicity and effectiveness of our analytical models. P1 P2 90 nm -0.1 -0.5 P3 P4 P5 P6 -0.69 0.03 -0.27 -0.31 32 nm 0.03 -0.05 -0.07 0.05 0.00 -0.14 With a new trend in highly capable embedded mobile applications, it seems conducive to implement high-performance techniques-> PREFETCHING They do not seem to put a burden on energy consumption and should thus be implement A simple analytical model has been demonstrated to estimate the effects of prefetching and to effectively calculate it. System designers can estimate the energy efficiency of their hardware prefetcher designs and make changes accordingly.