15th International Conference on Advanced Computing and Communications Memory Design and Exploration for Low-power Embedded Applications : A Case Study on Hash Algorithms Debojyoti Bhattacharya, Avishek Saha Department of Computer Science and Engineering IIT Kharagpur, WB 721302, India {debojyoti, avishek}@cse.iitkgp.ernet.in Abstract sign space exploration is computationally intensive and has received a lot of attention from the research community. Usually, a design-simulate-analyze methodology is used to achieve an optimal cache performance [8], [1], [11], [12]. One approach is to exhaustively simulate all possible cache configurations to find the optimal solution. Other approaches use a one-pass technique, in which numerous cache configurations are evaluated simultaneously during a single simulation run [9], [6]. While these techniques reduce the time taken to obtain cache performance metrics for a given cache configuration, they do not solve the problem of design space exploration in general. This is primarily due to the fact that the problem of cache design space exploration is computationally intensive and exploring all possible options of the available design parameters can blow up the search space. Two possible approaches have been suggested to solve this problem in [7]. One approach tries to solve the problem by ignoring a part of the design space by using an iterative heuristic. The second approach tries to identify and avoid the redundancies in the computation involved in exhaustive design space exploration. In this paper, we take the second approach. Here, we suggest a cache design space exploration algorithm, specific to cryptographic hash applications. Our intuition is, individual applications may offer further scope of optimizing the design-space exploration scheme, typical to those types of application. This motivates us to study the cache exploration results on two families of cryptographic hash functions. A detailed analysis of the results suggest trends, which prove useful in the formulation of our proposed exploration algorithm. Based on our experimental results, we posit that for typical applications, like hash function, we can follow a design exploration strategy, whereby, we first select an optimal L1 I-cache configuration. Once the L1 I-cache has been fixed, we next find an optimal L1 D-cache configuration and proceed similarly for an L2 Ucache, if it is required at all. In this way, we select a high speed low-power optimal cache configuration. The cache parameters explored are cache line size, degree of associa- Constraints imposed on various resources in embedded computing make it a challenging design space. One such important constraint is memory. Proper cache design can overcome this memory bottleneck. In our paper, we propose a methodology for cache design space exploration specific to cryptographic hash functions. The proposed methodology finds a speed-power optimized cache configuration. We also describe the experimental procedure towards formulation of the proposed exploration algorithm. Experiments are performed on two cryptographic hash functions, namely, SHA − 1 and M D5. Our approach tries to reduce the exploration search space and hence is better than traditional exhaustive search. 1 Introduction Recent years has witnessed a huge growth in use of embedded systems for cryptographic applications. Cryptographic algorithms can also be implemented as hardware (ASIC) or software modules. Hardware approaches are preferred for their high speed but have low time-tomarket. Software implementations, on the other hand, are extremely energy/co-mputationally inefficient. Embedded systems provide moderately high speed and are specialized for small-size, low-power operations. They also offer the added advantages of low implementation time and re-programmability. Thus, continuously evolving cryptographic standards, lesser-time-to-implement on embedded processors, re-programmability and shorter time-to-market make embedded systems the preferred platform for implementation. Some work has been done on design and optimization of processor cache hierarchy [10], [13], [5]. However, not much work has been done on application-specific cache design for low-power devices. The problem of cache de- 0-7695-3059-1/07 $25.00 © 2007 IEEE DOI 10.1109/ADCOM.2007.88 479 Table 1. Machine configuration tivity and total cache size. Improvements in cache performance are measured in terms of power consumption at the expense of silicon area and miss rate. Our approach has the added advantage of generating a reduced search space. The remainder of this paper is organized as follows. In Section 2, we outline our proposed algorithm for memory space exploration. In Section 3, we describe our methodology and simulation framework. Section 4 presents our results and analyses them. Finally in Section 5, we conclude with some final remarks and future direction of research. Processor Core 1 Decode Width 1 Inorder Integer Unit 1 1 Mem Bus Width 4 Memory Hierarchy Memory Latency 24 cycles L1 IC Size - 32KB, Asso - 8, Block - 32B L1 DC Size - 32KB, Asso - 8, Block - 32B L1 Cache Latency 1 cycle L2 UC None L2 Ucache Latency 6 cycles (if introduced) Processor Clock 233 MHz Frequency Issue Width Issue FP Unit 2 Background In this section we provide the necessary background for cryptographic hash functions. A cryptographic hash function h maps a bitstring of arbitrary finite length into strings of fixed length. A one-way hash function must provide both preimage resistance and second preimage resistance, .i.e., it must be computationally infeasible to find, respectively, any input which hashes to any pre-specified output, and any second input which has the same output as any specified input. Custom designed iterative hash functions are the most popular hash functions currently used. Among them, the two most popular families are the SHA family and the MD family. We select two representative hash functions from these families, namely, SHA-1 [3] and MD5 [2]. Both are iterated hash functions and operate on a round by round basis. Both SHA-1 and MD5 use a compression function as their basic building block. We will denote a cache of size x Kbytes having block size y bytes and associativity z as [xK, yb, z]. The speed of execution will be denoted by the number of processor cycles. Higher the number of processor cycles, lower the speed. 3.2 Experimental Procedure Through our exploration strategy, we aim to obtain a power-optimized cache configuration without compromising the sp-eed or performance of the chosen application. This requires us to set realistic initial bounds on the lowest power that can be consumed and the highest speed that can be achieved. This motivates us to perform some experiments to set the afore-mentioned bounds. The experiments and corresponding observations are described below: 3 Our Approach 1. Initially, we try to figure out the best cache size and block size combination for each of the L1 I-cache, L1 D-cache and L2 U-cache. For each combination, we observe the average power and cycles consumed. When we observe the variations in L1 I-cache configuration, L1 D-cache and L2 U-cache are kept fixed at some preset base configuration. Similar is the case, while varying L1 D-cache and L2 U-cache. Once the cache and block size for a particular cache has been fixed, we next vary its associativity. In this section, we propose a semi-exhaustive strategy, which has the advantages of an exhaustive exploration, but at the same time brings down the huge computational cost involved. Our strategy is termed as ‘semi-exhaustive’ because it performs a few initial exhaustive explorations and then gradually narrows down the search space in the later stages. 3.1 Simulation Framework 2. Next, we carry out similar experiments on L1 U-cache. Our experimental framework is based on the simplescalar tool-set sim − panalyzer [4] targeted for ARM processor. The simplescalar framework generates a simulator targeted for a parameterized superscalar processor. Our chosen baseline pr-ocessor is given in Table 1. We choose three performance metrics namely, cache size, number of processor cycles and average power consumption. The parameters varied are cache size, block size and associativity. The experiments are performed on a test input file of size 1MB. For practical purpose, our chosen cache sizes were 2KB, 4KB, 8KB, 16KB, 32KB and 64KB and block sizes were 8b, 16b and 32b. The chosen associativity values were 1, 2, 4 and 8. The following observations were made: 1. The fastest speeds,i.e., the lowest number of processor cycles are obtained with the base setting of L1 cache with a particular configuration of L2 U-cache. 480 and 0.1206 mW, as our desired parameter bounds for speed and power, respectively. Results from Fig.1(a)-1(d) are only useful for setting the parameter bounds. Next, we proceed towards selecting an optimal cache configuration which tries to reach the specified levels of performance. It is evident from Fig.1(a) and 1(b), that power variations with changes in L1 I-cache configurations is much more as compared to those for L1 Dcache. So, we first try to select the optimal L1 I-Cache configuration. Once the cache size having the lowest power consumption has been selected, we next vary the associativity to test whether the power consumed can be reduced further. This is evident from Fig.1(a) and Fig.2. 2. The lowest average power consumption is obtained for some L1 U-cache configuration. 3. Keeping the base configuration fixed for L1 D-cache, varying only L1 I-cache gives more power variation when compared to varying only L1 D-cache keeping L1 I-cache fixed at base configuration. In both the cases, no L2 cache is present. As previously mentioned, based on these experiments, we aim to fix the parameter bounds. Observation1 and Obse − rvation2 imply that the fastest speed is achieved for L2 U-cache configurations and the lowest power is consumed for L1 U-cache configurations. So, we select the highest speed from L2 U-cache configuration and the lowest power consumed from L1 U-cache configurations, as our desired performance levels. Subsequently, we fine tune the different parameters of our cache hierarchy, so as to obtain the specified levels of performance. Another interesting observation is mentioned in 3. It is observed that, varying L1 I-cache configuration results in hi-gher fluctuation of power consumption, as compared to those of L1 D-cache configuration variations. These provides us with an opportunity to narrow down our search space and make huge savings in exploration time. As variation of power with changes in L1 I-cache is more, we first aim to figure out an optimal L1 I-cache configuration. Based on this optimal L1 I-cache configuration, we next select the best L1 D-cache configuration. Finally, based on these optimal L1 I-cache and D-cache configurations we select an optimal L2 U-cache con- figuration, if required at all. This approach spares us the burden of selecting an optimal configuration among all possible combinations of various cache and block sizes for all combinations of L1 I-cache, L1 D-cache and L2 U-cache. Thus, our proposed methodology works in fixing an optimal L1 I-cache first, then subsequently the L1 D-cache and finally the L2 unified cache, if required. 3.3 Processor Cycles (x 10e7) L1 I Cache 6.7 CS2BL16 6.5 6.3 6.1 5.9 5.7 5.5 1 2 4 8 Associativity L1 I Cache 1.2 CS2BL16 Power (in mW) 1.0 0.8 0.6 0.4 0.2 0 1 2 4 8 Associativity Figure 2. Cache Simulation Results, for L1 ICache Associativity Finally, we select a L1 I-cache configuration of [2K,16b,2] with L1 D-cache and L2 U-cache at their base configurations. This configuration consumes the lowest average power of 0.1712 mW and processor cycles of 55996825. Speed of execution is within .25% of the highest speed and average power consumption is nearly 42% higher than the lowest power consumption result available. In an effort to further reduce the power consumption, we next vary the L1 D-cache configuration with L1 I-cache fixed at its new configuration. Fig.3(a) and Fig.3(b) present the results for the next two steps of selecting an optimal L1 D-cache. As already explained for L1 I-cache, we proceed in a similar manner and select [4K,8b,2] as the optimal L1 D-Cache configuration. The average power consumption is .1121 mW and number of processor cycles is 57546994. The speed of execution is nearly 3% lower than the highest speed but the power consumption is significantly low, which is nearly 7.05% lower than our initial bound. Finally, we perform the same experiments for L2 U-cache (keeping L1 I and D-cache at their new configurations) and identify [16K,16b,8] as the lowest power consumption configuration. The number of processor cycles is 55897061 and average power consumption in Case Study I: SHA-1 In this section, we give a detailed analysis of the results of SHA-1. We also explain our proposed experimental approach. The results of the experiments on the base configuration of SHA-1 are presented in Fig.1(a)-1(d). The highest speed is achieved for base configuration L1 I and D-cache and [64K,-8b,8] L2 U-cache. The processor cycles and average power (in mW) consumed by SHA-1 for this particular configuration is 55858267 and 0.5808, respectively. Similarly, the lowest possible power consumed has the configuration [2K,32b,8] L1 U-cache. In this configuration, the average power consumed is 0.1206 mW and cycles consumed is 60839194. So, we set 55858267 cycles 481 L1 D Cache [Associativity=8] Processor Cycles (x 10e7) Processor Cycles (x 10e7) L1 I Cache [Associativity=8] 6.7 BL=8 BL=16 BL=32 6.5 6.3 6.1 5.9 5.7 5.5 2 4 8 16 32 Cache Size (in KB) 6.7 BL=8 BL=16 BL=32 6.5 6.3 6.1 5.9 5.7 5.5 64 2 4 8 16 1.2 1.0 1.0 0.8 0.6 0.4 BL=8 BL=16 BL=32 0.2 0.8 0.6 0.4 BL=8 BL=16 BL=32 0.2 0 0 2 4 8 16 32 Cache Size (in KB) 64 2 4 8 16 32 (b) L2 U Cache [Associativity=8] Processor Cycles (x 10e7) Processor Cycles (x 10e7) L1 U Cache [Associativity=8] 6.7 BL=8 BL=16 BL=32 6.5 6.3 6.1 5.9 5.7 5.5 2 4 8 16 32 Cache Size (in KB) 6.7 BL=8 BL=16 BL=32 6.5 6.3 6.1 5.9 5.7 5.5 64 2 4 8 16 L1 U Cache [Associativity=8] 32 Cache Size (in KB) 64 L2 U Cache [Associativity=8] 1.2 1.0 1.0 Power (in mW) 1.2 0.8 0.6 0.4 BL=8 BL=16 BL=32 0.2 64 Cache Size (in KB) (a) Power (in mW) 64 L1 D Cache [Associativity=8] 1.2 Power (in mW) Power (in mW) L1 I Cache [Associativity=8] 32 Cache Size (in KB) 0.8 0.6 0.4 BL=8 BL=16 BL=32 0.2 0 0 2 4 8 16 32 Cache Size (in KB) 64 2 4 8 16 32 Cache Size (in KB) (c) 64 (d) Figure 1. Cache Simulation Results, for (a) L1 ICache Size, (b) L1 DCache Size, (c) L1 UCache Size, (d) L2 UCache Size mW is .0948. The speed is only .07% lower than the highest speed and power consumption is nearly 21.4% lower than our initial bound. Hence, in this case our suggested optimal cache configuration is given in Table 2. Table 3. Results for MD5 Cycles Table 2. Optimal Cache configuration for SHA-1 Cache Name L1 I-cache L1 D-cache L2 U-cache 3.4 Cache Size (in Kbytes) 2 4 16 Block Size (in Bytes) 16 8 16 Associa-tivity 2 2 8 Case Study II: MD5 Table 3 provides the results of our experiments on MD5 hash function. The first two rows give the initial parameter bounds. Our suggested optimal cache configuration for MD5 is denoted in bold. It has been observed that introduction of L2 U-cache improves speed by .04%, but increases power consumption by more than 2%, compared to the optimal cache configuration. Hence, our suggested optimal cache configuration contains no L2 U-cache. 22174351 Power Consumption (in mW) .5719 222353914 22187700 .1811 .2321 22190960 .1531 22181948 .1561 Cache Configuration L1 Cache Default L2 UCache [64K, 8b, 4] L1 UCache [8K, 8b, 8] L1 DCache Default No L2 UCache L1 ICache [8K, 32b, 4] No L2 UCache L1 ICache [8K, 32b, 4] L1 DCache [4K, 16b, 8] L1 ICache [8K, 32b, 4] L1 DCache [4K, 16b, 8] L2 UCache [4K, 16b, 8] 4 Proposed Methodology Section 3 presents some experiments and observations. Based on these observations, we propose a new cache design space exploration methodology. Over the past few years, some cache design space exploration methodologies have been suggested. However, most of them have ignored 482 old (Th). For the lowest power, speed should be within threshold limit of the highest speed that can be achieved. Step 1 generates the results for all possible combination of cache and block sizes for L1 IC, L1 DC, L1 UC and L2 UC. In each case, the caches other than the one under consideration are fixed at their base configurations. From the results of Step 1(c) and 1(d), we select two cache configurations - (a) highest speed configuration (SPEEDMAX), and, (b) lowest power consuming configuration (POWERMIN). SPEEDVAR contains the speed corresponding to POWERMIN configuration. If, SPEEDVAR is within the specified threshold value of SPEEDMAX, we select POWERMIN as the optimal cache configuration. Else, we select the lowest power consumption from Step 1(a) and 1(b). The corresponding cache and block size are fixed and associativity results are generated. Next, we fix L1 IC/DC configuration and update POWERMIN in Explore. Explore [Algorithm 1] updates POWERMIN, in case, the current configurations has a lower power consumption than that of POWERMIN. It also updates TEMPPOWER, which serves as the lower bound of power consumption in the next iteration. Explore fixes the configuration of the concerned cache to that of POWERVAR. In a similar manner, Steps 3(c)-(e) and Steps 3(f)-(h) select the optimal cache configuration for L1 IC/DC and L2 UC, respectively. If the desired speed is not achieved, then we proceed toward the next iteration with the next lowest power from 1(a) and 1(b) as POWERVAR. Complexity Analysis : Let C be the no. of cache sizes, B be the no. of block sizes and A be the no. of associativity values. We measure the complexity based on the number of simulations that we have to perform. Our approach has a complexity k(5(CB + A) + A), where k is the no. of iterations. The worst case value of k is CB. This is much better than exhaustive search which has a worst-case complexity of (3(C + B + A))!. Processor Cycles (x 10e7) L1 D Cache [Associativity=8] 6.7 BL=8 BL=16 BL=32 6.5 6.3 6.1 5.9 5.7 5.5 2 4 8 16 32 Cache Size (in KB) 64 L1 D Cache [Associativity=8] 1.2 BL=8 BL=16 BL=32 Power (in mW) 1.0 0.8 0.6 0.4 0.2 0 2 4 8 16 32 64 Cache Size (in KB) (a) Processor Cycles (x 10e7) L1 D Cache 6.7 CS4BL8 6.5 6.3 6.1 5.9 5.7 5.5 1 2 4 8 Associativity L1 D Cache 1.2 CS4BL8 Power (in mW) 1.0 0.8 0.6 0.4 0.2 0 1 2 4 8 Associativity (b) Figure 3. Cache Simulation Results, for (a) L1 DCache Size, (b) L1 DCache Associativity the importance of energy/power, as an important performance metric. Also, the main aim of most of these suggested methodologies is to improve the speed of the concerned application. Our methodology, on the other hand tries to narrow down the search space and thus reduce the computational cost involved. Moreover, our approach aims to achieve a low-power cache configuration, while maintaining a moderately high speed. 5 Conclusions and Future Work In this paper, we have proposed a ‘semi-exhaustive’ methodology for application-specific cache design space exploration. Traditionally, exhaustive strategies are used to select an optimal cache configuration from an available design space. Exhaustive strategies guarantee optimality. But an obvious disadvantage for exhaustive strategies lies in possible ‘blow-up’ of the search-space. Thus, our strategy offers a speed and power optimized cache configuration with the added advantage of reduced search space. The experimental approach toward the proposed methodology has been described in detail and results have been presented for cryptographic hash functions, SHA-1 and MD5. Future work lies in applying and validating this methodology for other class of benchmark applications, namely, consumer, networking, telecomm, etc. Algorithm 1 Explore Function Input: POWERVAR, SPEEDVAR, Cache under consideration Output: Set configuration for cache under consideration 1. if (POWERVAR < POWERMIN) and (SPEEDVAR ≤ (1+Th)*SPEEDMAX) then POWERMIN ← POWERVAR endif 2. TEMPPOWER = MIN(TEMPOWER, POWERVAR) 3. Fix configuration of cache to that of POWERVAR Algorithm 1 and 2 describes the proposed memory exploration scheme. The input to our algorithm is the base configuration of the processor and percentage speed thresh- 483 References Algorithm 2 Proposed memory exploration scheme Input: Percentage Speed Threshold (Th) and Base Cache Configuration Output: Optimal Cache Configuration with lowest power and moderate speed Assumption: Other than the cache under consideration, all others are in last set configurations, where ever applicable. Initializations: TEMPPOWER = INFINITY [1] Interactive cacti. http://www.ece.ubc.ca/ stevew/cacti/. [2] Md5 specifications. http://theory.lcs.mit.edu/ rivest/Rivest-MD5.txt. [3] Sha-1 specifications. http://csrc.nist.gov/CryptoToolkit/tkhash.html. [4] Simplescalar toolset. www.simplescalar.com. [5] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In MICRO 33: Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, pages 245– 257, 2000. [6] D. Kirovski, C. Lee, M. Potkonjak, and W. Mangione-Smith. Synthesis of power efficient systems-on-silicon. In Asian South Pacific Design Automation Conference, 1998. [7] X. Li, H. S. Negi, T. Mitra, and A. Roychoudhury. Design space exploration of caches using compressed traces. In ICS ’04: Proceedings of the 18th ACM Annual International Conference on Supercomputing, pages 116–125, 2004. [8] Y. Li and J. Henkel. A framework for estimation and minimizing energy dissipation of embedded hw/sw systems. In DAC ’98: Proceedings of the 35th annual conference on Design automation, pages 188–193, 1998. [9] R. L. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78–117, 1970. [10] P. Petrov and A. Orail. Towards effective embedded processors in codesigns: customizable partitioned caches. In CODES ’01: Proceedings of the ninth international symposium on Hardware/software codesign, pages 79–84, 2001. [11] T. Sato. Evaluating trace cache on moderate-scale processors. IEEE Computer, 147(6), 2000. [12] W. Shiue and C. Chakrabarti. Memory exploration for low power embedded systems. Journal of VLSI Signal Processing, pages 167–178, Nov 2001. [13] C. Su and A. Despain. Cache design trade-offs for power and performance optimization: A case study. In International Symposium on Low Power Electronics and Design, 1995. 1. for (cache sizes) do for (block sizes) do (a) Generate results for L1 IC (b) Generate results for L1 DC (c) Generate results for L1 UC (d) Generate results for L2 UC endfor endfor 2. POWERMIN ← Lowest power from Step 1(c) SPEEDMAX ← Highest speed from Step 1(d) SPEEDVAR ← Speed corresponding to POWERMIN 3. if (SPEEDVAR > (1+Th)*SPEEDMAX) then (a) Fix L1 IC or L1 DC cache configuration POWERVAR ← Lowest power from 1(a) or 1(b) SPEEDVAR ← Speed corresponding to POWERVAR OptConfig ← Explore(POWERVAR,SPEEDVAR,L1IC/DC) (b) Generate associativity results for cache configuration 3(a) POWERVAR ← Lowest power SPEEDVAR ← Speed corresponding to POWERVAR OptConfig ← Explore(POWERVAR,SPEEDVAR,L1IC/DC) (c) Fix the L1 cache configuration, other than the one in 3(a) for (cache sizes) do for (block sizes) do Generate results for L1 IC or L1 DC endfor endfor (d) POWERVAR ← Lowest power from 3(c) SPEEDVAR ← Speed corresponding to POWERVAR OptConfig ← Explore(POWERVAR,SPEEDVAR,L1IC/DC) (e) Generate associativity results for cache configuration 3(d) POWERVAR ← Lowest power SPEEDVAR ← Speed corresponding to POWERVAR OptConfig ← Explore(POWERVAR,SPEEDVAR,L1IC/DC) (f) Fix L2 UC cache configuration for (cache sizes) do for (block sizes) do Generate results for L2 UC endfor endfor (g) POWERVAR ← Lowest power from 3(f) SPEEDVAR ← Speed corresponding to POWERVAR OptConfig ← Explore(POWERVAR,SPEEDVAR,L2UC) (h) Generate associativity results for cache configuration 3(g) POWERVAR ← Lowest power SPEEDVAR ← Speed corresponding to POWERVAR OptConfig ← Explore(POWERVAR,SPEEDVAR,L2UC) (i) if (POWERMIN not updated) POWERMIN ← TEMPPOWER repeat endif endif 4. Set the cache configuration for POWERMIN as the optimal configuration 484