Memory Design and Exploration for Low-power Embedded Applications : A

advertisement
15th International Conference on Advanced Computing and Communications
Memory Design and Exploration for Low-power Embedded Applications : A
Case Study on Hash Algorithms
Debojyoti Bhattacharya, Avishek Saha
Department of Computer Science and Engineering
IIT Kharagpur, WB 721302, India
{debojyoti, avishek}@cse.iitkgp.ernet.in
Abstract
sign space exploration is computationally intensive and has
received a lot of attention from the research community.
Usually, a design-simulate-analyze methodology is used to
achieve an optimal cache performance [8], [1], [11], [12].
One approach is to exhaustively simulate all possible cache
configurations to find the optimal solution. Other approaches use a one-pass technique, in which numerous
cache configurations are evaluated simultaneously during a
single simulation run [9], [6]. While these techniques reduce the time taken to obtain cache performance metrics
for a given cache configuration, they do not solve the problem of design space exploration in general. This is primarily
due to the fact that the problem of cache design space exploration is computationally intensive and exploring all possible options of the available design parameters can blow up
the search space. Two possible approaches have been suggested to solve this problem in [7]. One approach tries to
solve the problem by ignoring a part of the design space by
using an iterative heuristic. The second approach tries to
identify and avoid the redundancies in the computation involved in exhaustive design space exploration. In this paper,
we take the second approach.
Here, we suggest a cache design space exploration algorithm, specific to cryptographic hash applications. Our
intuition is, individual applications may offer further scope
of optimizing the design-space exploration scheme, typical
to those types of application. This motivates us to study the
cache exploration results on two families of cryptographic
hash functions. A detailed analysis of the results suggest
trends, which prove useful in the formulation of our proposed exploration algorithm. Based on our experimental
results, we posit that for typical applications, like hash function, we can follow a design exploration strategy, whereby,
we first select an optimal L1 I-cache configuration. Once
the L1 I-cache has been fixed, we next find an optimal L1
D-cache configuration and proceed similarly for an L2 Ucache, if it is required at all. In this way, we select a high
speed low-power optimal cache configuration. The cache
parameters explored are cache line size, degree of associa-
Constraints imposed on various resources in embedded
computing make it a challenging design space. One such
important constraint is memory. Proper cache design can
overcome this memory bottleneck. In our paper, we propose
a methodology for cache design space exploration specific
to cryptographic hash functions. The proposed methodology finds a speed-power optimized cache configuration.
We also describe the experimental procedure towards formulation of the proposed exploration algorithm. Experiments are performed on two cryptographic hash functions,
namely, SHA − 1 and M D5. Our approach tries to reduce the exploration search space and hence is better than
traditional exhaustive search.
1 Introduction
Recent years has witnessed a huge growth in use of
embedded systems for cryptographic applications. Cryptographic algorithms can also be implemented as hardware (ASIC) or software modules. Hardware approaches
are preferred for their high speed but have low time-tomarket. Software implementations, on the other hand, are
extremely energy/co-mputationally inefficient. Embedded
systems provide moderately high speed and are specialized for small-size, low-power operations. They also offer the added advantages of low implementation time and
re-programmability. Thus, continuously evolving cryptographic standards, lesser-time-to-implement on embedded
processors, re-programmability and shorter time-to-market
make embedded systems the preferred platform for implementation.
Some work has been done on design and optimization
of processor cache hierarchy [10], [13], [5]. However, not
much work has been done on application-specific cache
design for low-power devices. The problem of cache de-
0-7695-3059-1/07 $25.00 © 2007 IEEE
DOI 10.1109/ADCOM.2007.88
479
Table 1. Machine configuration
tivity and total cache size. Improvements in cache performance are measured in terms of power consumption at the
expense of silicon area and miss rate. Our approach has the
added advantage of generating a reduced search space.
The remainder of this paper is organized as follows. In
Section 2, we outline our proposed algorithm for memory
space exploration. In Section 3, we describe our methodology and simulation framework. Section 4 presents our results and analyses them. Finally in Section 5, we conclude
with some final remarks and future direction of research.
Processor Core
1
Decode Width
1
Inorder
Integer Unit
1
1
Mem Bus Width
4
Memory Hierarchy
Memory Latency
24 cycles
L1 IC
Size - 32KB, Asso - 8, Block - 32B
L1 DC
Size - 32KB, Asso - 8, Block - 32B
L1 Cache Latency
1 cycle
L2 UC
None
L2 Ucache Latency
6 cycles
(if introduced)
Processor Clock
233 MHz
Frequency
Issue Width
Issue
FP Unit
2 Background
In this section we provide the necessary background for
cryptographic hash functions. A cryptographic hash function h maps a bitstring of arbitrary finite length into strings
of fixed length. A one-way hash function must provide both
preimage resistance and second preimage resistance, .i.e.,
it must be computationally infeasible to find, respectively,
any input which hashes to any pre-specified output, and any
second input which has the same output as any specified
input.
Custom designed iterative hash functions are the most
popular hash functions currently used. Among them, the
two most popular families are the SHA family and the MD
family. We select two representative hash functions from
these families, namely, SHA-1 [3] and MD5 [2]. Both are
iterated hash functions and operate on a round by round basis. Both SHA-1 and MD5 use a compression function as
their basic building block.
We will denote a cache of size x Kbytes having block
size y bytes and associativity z as [xK, yb, z]. The speed
of execution will be denoted by the number of processor
cycles. Higher the number of processor cycles, lower the
speed.
3.2
Experimental Procedure
Through our exploration strategy, we aim to obtain a
power-optimized cache configuration without compromising the sp-eed or performance of the chosen application.
This requires us to set realistic initial bounds on the lowest power that can be consumed and the highest speed that
can be achieved. This motivates us to perform some experiments to set the afore-mentioned bounds. The experiments
and corresponding observations are described below:
3 Our Approach
1. Initially, we try to figure out the best cache size and
block size combination for each of the L1 I-cache,
L1 D-cache and L2 U-cache. For each combination,
we observe the average power and cycles consumed.
When we observe the variations in L1 I-cache configuration, L1 D-cache and L2 U-cache are kept fixed at
some preset base configuration. Similar is the case,
while varying L1 D-cache and L2 U-cache. Once the
cache and block size for a particular cache has been
fixed, we next vary its associativity.
In this section, we propose a semi-exhaustive strategy,
which has the advantages of an exhaustive exploration, but
at the same time brings down the huge computational cost
involved. Our strategy is termed as ‘semi-exhaustive’ because it performs a few initial exhaustive explorations and
then gradually narrows down the search space in the later
stages.
3.1
Simulation Framework
2. Next, we carry out similar experiments on L1 U-cache.
Our experimental framework is based on the simplescalar tool-set sim − panalyzer [4] targeted for ARM
processor. The simplescalar framework generates a simulator targeted for a parameterized superscalar processor. Our
chosen baseline pr-ocessor is given in Table 1. We choose
three performance metrics namely, cache size, number of
processor cycles and average power consumption. The parameters varied are cache size, block size and associativity.
The experiments are performed on a test input file of size
1MB.
For practical purpose, our chosen cache sizes were
2KB, 4KB, 8KB, 16KB, 32KB and 64KB and block
sizes were 8b, 16b and 32b. The chosen associativity values were 1, 2, 4 and 8. The following observations were
made:
1. The fastest speeds,i.e., the lowest number of processor
cycles are obtained with the base setting of L1 cache
with a particular configuration of L2 U-cache.
480
and 0.1206 mW, as our desired parameter bounds for speed
and power, respectively.
Results from Fig.1(a)-1(d) are only useful for setting the
parameter bounds. Next, we proceed towards selecting an
optimal cache configuration which tries to reach the specified levels of performance. It is evident from Fig.1(a) and
1(b), that power variations with changes in L1 I-cache configurations is much more as compared to those for L1 Dcache. So, we first try to select the optimal L1 I-Cache
configuration. Once the cache size having the lowest power
consumption has been selected, we next vary the associativity to test whether the power consumed can be reduced
further. This is evident from Fig.1(a) and Fig.2.
2. The lowest average power consumption is obtained for
some L1 U-cache configuration.
3. Keeping the base configuration fixed for L1 D-cache,
varying only L1 I-cache gives more power variation
when compared to varying only L1 D-cache keeping
L1 I-cache fixed at base configuration. In both the
cases, no L2 cache is present.
As previously mentioned, based on these experiments,
we aim to fix the parameter bounds. Observation1 and
Obse − rvation2 imply that the fastest speed is achieved
for L2 U-cache configurations and the lowest power is consumed for L1 U-cache configurations. So, we select the
highest speed from L2 U-cache configuration and the lowest power consumed from L1 U-cache configurations, as
our desired performance levels. Subsequently, we fine tune
the different parameters of our cache hierarchy, so as to obtain the specified levels of performance.
Another interesting observation is mentioned in 3. It is
observed that, varying L1 I-cache configuration results in
hi-gher fluctuation of power consumption, as compared to
those of L1 D-cache configuration variations. These provides us with an opportunity to narrow down our search
space and make huge savings in exploration time. As variation of power with changes in L1 I-cache is more, we
first aim to figure out an optimal L1 I-cache configuration.
Based on this optimal L1 I-cache configuration, we next select the best L1 D-cache configuration. Finally, based on
these optimal L1 I-cache and D-cache configurations we
select an optimal L2 U-cache con- figuration, if required
at all. This approach spares us the burden of selecting an
optimal configuration among all possible combinations of
various cache and block sizes for all combinations of L1
I-cache, L1 D-cache and L2 U-cache. Thus, our proposed
methodology works in fixing an optimal L1 I-cache first,
then subsequently the L1 D-cache and finally the L2 unified cache, if required.
3.3
Processor Cycles (x 10e7)
L1 I Cache
6.7
CS2BL16
6.5
6.3
6.1
5.9
5.7
5.5
1
2
4
8
Associativity
L1 I Cache
1.2
CS2BL16
Power (in mW)
1.0
0.8
0.6
0.4
0.2
0
1
2
4
8
Associativity
Figure 2. Cache Simulation Results, for L1 ICache Associativity
Finally, we select a L1 I-cache configuration of
[2K,16b,2] with L1 D-cache and L2 U-cache at their base
configurations. This configuration consumes the lowest
average power of 0.1712 mW and processor cycles of
55996825. Speed of execution is within .25% of the highest
speed and average power consumption is nearly 42% higher
than the lowest power consumption result available. In an
effort to further reduce the power consumption, we next
vary the L1 D-cache configuration with L1 I-cache fixed
at its new configuration.
Fig.3(a) and Fig.3(b) present the results for the next two
steps of selecting an optimal L1 D-cache. As already explained for L1 I-cache, we proceed in a similar manner and
select [4K,8b,2] as the optimal L1 D-Cache configuration.
The average power consumption is .1121 mW and number
of processor cycles is 57546994. The speed of execution is
nearly 3% lower than the highest speed but the power consumption is significantly low, which is nearly 7.05% lower
than our initial bound. Finally, we perform the same experiments for L2 U-cache (keeping L1 I and D-cache at their
new configurations) and identify [16K,16b,8] as the lowest
power consumption configuration. The number of processor cycles is 55897061 and average power consumption in
Case Study I: SHA-1
In this section, we give a detailed analysis of the results
of SHA-1. We also explain our proposed experimental approach. The results of the experiments on the base configuration of SHA-1 are presented in Fig.1(a)-1(d). The
highest speed is achieved for base configuration L1 I and
D-cache and [64K,-8b,8] L2 U-cache. The processor cycles and average power (in mW) consumed by SHA-1 for
this particular configuration is 55858267 and 0.5808, respectively. Similarly, the lowest possible power consumed
has the configuration [2K,32b,8] L1 U-cache. In this configuration, the average power consumed is 0.1206 mW and
cycles consumed is 60839194. So, we set 55858267 cycles
481
L1 D Cache [Associativity=8]
Processor Cycles (x 10e7)
Processor Cycles (x 10e7)
L1 I Cache [Associativity=8]
6.7
BL=8
BL=16
BL=32
6.5
6.3
6.1
5.9
5.7
5.5
2 4
8
16
32
Cache Size (in KB)
6.7
BL=8
BL=16
BL=32
6.5
6.3
6.1
5.9
5.7
5.5
64
2 4
8
16
1.2
1.0
1.0
0.8
0.6
0.4
BL=8
BL=16
BL=32
0.2
0.8
0.6
0.4
BL=8
BL=16
BL=32
0.2
0
0
2 4
8
16
32
Cache Size (in KB)
64
2 4
8
16
32
(b)
L2 U Cache [Associativity=8]
Processor Cycles (x 10e7)
Processor Cycles (x 10e7)
L1 U Cache [Associativity=8]
6.7
BL=8
BL=16
BL=32
6.5
6.3
6.1
5.9
5.7
5.5
2 4
8
16
32
Cache Size (in KB)
6.7
BL=8
BL=16
BL=32
6.5
6.3
6.1
5.9
5.7
5.5
64
2 4
8
16
L1 U Cache [Associativity=8]
32
Cache Size (in KB)
64
L2 U Cache [Associativity=8]
1.2
1.0
1.0
Power (in mW)
1.2
0.8
0.6
0.4
BL=8
BL=16
BL=32
0.2
64
Cache Size (in KB)
(a)
Power (in mW)
64
L1 D Cache [Associativity=8]
1.2
Power (in mW)
Power (in mW)
L1 I Cache [Associativity=8]
32
Cache Size (in KB)
0.8
0.6
0.4
BL=8
BL=16
BL=32
0.2
0
0
2 4
8
16
32
Cache Size (in KB)
64
2 4
8
16
32
Cache Size (in KB)
(c)
64
(d)
Figure 1. Cache Simulation Results, for (a) L1 ICache Size, (b) L1 DCache Size, (c) L1 UCache Size, (d) L2 UCache Size
mW is .0948. The speed is only .07% lower than the highest
speed and power consumption is nearly 21.4% lower than
our initial bound. Hence, in this case our suggested optimal
cache configuration is given in Table 2.
Table 3. Results for MD5
Cycles
Table 2. Optimal Cache configuration for
SHA-1
Cache
Name
L1 I-cache
L1 D-cache
L2 U-cache
3.4
Cache Size
(in Kbytes)
2
4
16
Block Size
(in Bytes)
16
8
16
Associa-tivity
2
2
8
Case Study II: MD5
Table 3 provides the results of our experiments on MD5
hash function. The first two rows give the initial parameter bounds. Our suggested optimal cache configuration for
MD5 is denoted in bold. It has been observed that introduction of L2 U-cache improves speed by .04%, but increases
power consumption by more than 2%, compared to the optimal cache configuration. Hence, our suggested optimal
cache configuration contains no L2 U-cache.
22174351
Power Consumption (in mW)
.5719
222353914
22187700
.1811
.2321
22190960
.1531
22181948
.1561
Cache
Configuration
L1 Cache Default
L2 UCache [64K, 8b, 4]
L1 UCache [8K, 8b, 8]
L1 DCache Default
No L2 UCache
L1 ICache [8K, 32b, 4]
No L2 UCache
L1 ICache [8K, 32b, 4]
L1 DCache [4K, 16b, 8]
L1 ICache [8K, 32b, 4]
L1 DCache [4K, 16b, 8]
L2 UCache [4K, 16b, 8]
4 Proposed Methodology
Section 3 presents some experiments and observations.
Based on these observations, we propose a new cache design space exploration methodology. Over the past few
years, some cache design space exploration methodologies
have been suggested. However, most of them have ignored
482
old (Th). For the lowest power, speed should be within
threshold limit of the highest speed that can be achieved.
Step 1 generates the results for all possible combination of
cache and block sizes for L1 IC, L1 DC, L1 UC and L2
UC. In each case, the caches other than the one under consideration are fixed at their base configurations. From the
results of Step 1(c) and 1(d), we select two cache configurations - (a) highest speed configuration (SPEEDMAX), and,
(b) lowest power consuming configuration (POWERMIN).
SPEEDVAR contains the speed corresponding to POWERMIN configuration. If, SPEEDVAR is within the specified
threshold value of SPEEDMAX, we select POWERMIN as
the optimal cache configuration. Else, we select the lowest
power consumption from Step 1(a) and 1(b). The corresponding cache and block size are fixed and associativity
results are generated. Next, we fix L1 IC/DC configuration and update POWERMIN in Explore. Explore [Algorithm 1] updates POWERMIN, in case, the current configurations has a lower power consumption than that of POWERMIN. It also updates TEMPPOWER, which serves as
the lower bound of power consumption in the next iteration. Explore fixes the configuration of the concerned cache
to that of POWERVAR. In a similar manner, Steps 3(c)-(e)
and Steps 3(f)-(h) select the optimal cache configuration for
L1 IC/DC and L2 UC, respectively. If the desired speed is
not achieved, then we proceed toward the next iteration with
the next lowest power from 1(a) and 1(b) as POWERVAR.
Complexity Analysis : Let C be the no. of cache sizes,
B be the no. of block sizes and A be the no. of associativity
values. We measure the complexity based on the number of
simulations that we have to perform. Our approach has a
complexity k(5(CB + A) + A), where k is the no. of iterations. The worst case value of k is CB. This is much better
than exhaustive search which has a worst-case complexity
of (3(C + B + A))!.
Processor Cycles (x 10e7)
L1 D Cache [Associativity=8]
6.7
BL=8
BL=16
BL=32
6.5
6.3
6.1
5.9
5.7
5.5
2 4
8
16
32
Cache Size (in KB)
64
L1 D Cache [Associativity=8]
1.2
BL=8
BL=16
BL=32
Power (in mW)
1.0
0.8
0.6
0.4
0.2
0
2 4
8
16
32
64
Cache Size (in KB)
(a)
Processor Cycles (x 10e7)
L1 D Cache
6.7
CS4BL8
6.5
6.3
6.1
5.9
5.7
5.5
1
2
4
8
Associativity
L1 D Cache
1.2
CS4BL8
Power (in mW)
1.0
0.8
0.6
0.4
0.2
0
1
2
4
8
Associativity
(b)
Figure 3. Cache Simulation Results, for (a) L1 DCache
Size, (b) L1 DCache Associativity
the importance of energy/power, as an important performance metric. Also, the main aim of most of these suggested methodologies is to improve the speed of the concerned application. Our methodology, on the other hand
tries to narrow down the search space and thus reduce the
computational cost involved. Moreover, our approach aims
to achieve a low-power cache configuration, while maintaining a moderately high speed.
5 Conclusions and Future Work
In this paper, we have proposed a ‘semi-exhaustive’
methodology for application-specific cache design space
exploration. Traditionally, exhaustive strategies are used
to select an optimal cache configuration from an available
design space. Exhaustive strategies guarantee optimality.
But an obvious disadvantage for exhaustive strategies lies
in possible ‘blow-up’ of the search-space. Thus, our strategy offers a speed and power optimized cache configuration
with the added advantage of reduced search space. The experimental approach toward the proposed methodology has
been described in detail and results have been presented for
cryptographic hash functions, SHA-1 and MD5.
Future work lies in applying and validating this methodology for other class of benchmark applications, namely,
consumer, networking, telecomm, etc.
Algorithm 1 Explore Function
Input: POWERVAR, SPEEDVAR, Cache under consideration
Output: Set configuration for cache under consideration
1. if (POWERVAR < POWERMIN) and
(SPEEDVAR ≤ (1+Th)*SPEEDMAX) then
POWERMIN ← POWERVAR
endif
2. TEMPPOWER = MIN(TEMPOWER, POWERVAR)
3. Fix configuration of cache to that of POWERVAR
Algorithm 1 and 2 describes the proposed memory exploration scheme. The input to our algorithm is the base
configuration of the processor and percentage speed thresh-
483
References
Algorithm 2 Proposed memory exploration scheme
Input: Percentage Speed Threshold (Th) and Base Cache Configuration
Output: Optimal Cache Configuration with lowest power and moderate
speed
Assumption: Other than the cache under consideration, all others are in
last set configurations, where ever applicable.
Initializations: TEMPPOWER = INFINITY
[1] Interactive cacti. http://www.ece.ubc.ca/ stevew/cacti/.
[2] Md5 specifications. http://theory.lcs.mit.edu/ rivest/Rivest-MD5.txt.
[3] Sha-1 specifications. http://csrc.nist.gov/CryptoToolkit/tkhash.html.
[4] Simplescalar toolset. www.simplescalar.com.
[5] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and
S. Dwarkadas. Memory hierarchy reconfiguration for energy
and performance in general-purpose processor architectures.
In MICRO 33: Proceedings of the 33rd annual ACM/IEEE
international symposium on Microarchitecture, pages 245–
257, 2000.
[6] D. Kirovski, C. Lee, M. Potkonjak, and W. Mangione-Smith.
Synthesis of power efficient systems-on-silicon. In Asian
South Pacific Design Automation Conference, 1998.
[7] X. Li, H. S. Negi, T. Mitra, and A. Roychoudhury. Design space exploration of caches using compressed traces. In
ICS ’04: Proceedings of the 18th ACM Annual International
Conference on Supercomputing, pages 116–125, 2004.
[8] Y. Li and J. Henkel. A framework for estimation and minimizing energy dissipation of embedded hw/sw systems. In
DAC ’98: Proceedings of the 35th annual conference on Design automation, pages 188–193, 1998.
[9] R. L. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation
techniques for storage hierarchies. IBM Systems Journal,
9(2):78–117, 1970.
[10] P. Petrov and A. Orail. Towards effective embedded processors in codesigns: customizable partitioned caches. In
CODES ’01: Proceedings of the ninth international symposium on Hardware/software codesign, pages 79–84, 2001.
[11] T. Sato. Evaluating trace cache on moderate-scale processors. IEEE Computer, 147(6), 2000.
[12] W. Shiue and C. Chakrabarti. Memory exploration for low
power embedded systems. Journal of VLSI Signal Processing, pages 167–178, Nov 2001.
[13] C. Su and A. Despain. Cache design trade-offs for power and
performance optimization: A case study. In International
Symposium on Low Power Electronics and Design, 1995.
1. for (cache sizes) do
for (block sizes) do
(a) Generate results for L1 IC
(b) Generate results for L1 DC
(c) Generate results for L1 UC
(d) Generate results for L2 UC
endfor
endfor
2. POWERMIN ← Lowest power from Step 1(c)
SPEEDMAX ← Highest speed from Step 1(d)
SPEEDVAR ← Speed corresponding to POWERMIN
3. if (SPEEDVAR > (1+Th)*SPEEDMAX) then
(a) Fix L1 IC or L1 DC cache configuration
POWERVAR ← Lowest power from 1(a) or 1(b)
SPEEDVAR ← Speed corresponding to POWERVAR
OptConfig ← Explore(POWERVAR,SPEEDVAR,L1IC/DC)
(b) Generate associativity results for cache configuration 3(a)
POWERVAR ← Lowest power
SPEEDVAR ← Speed corresponding to POWERVAR
OptConfig ← Explore(POWERVAR,SPEEDVAR,L1IC/DC)
(c) Fix the L1 cache configuration, other than the one in 3(a)
for (cache sizes) do
for (block sizes) do
Generate results for L1 IC or L1 DC
endfor
endfor
(d) POWERVAR ← Lowest power from 3(c)
SPEEDVAR ← Speed corresponding to POWERVAR
OptConfig ← Explore(POWERVAR,SPEEDVAR,L1IC/DC)
(e) Generate associativity results for cache configuration 3(d)
POWERVAR ← Lowest power
SPEEDVAR ← Speed corresponding to POWERVAR
OptConfig ← Explore(POWERVAR,SPEEDVAR,L1IC/DC)
(f) Fix L2 UC cache configuration
for (cache sizes) do
for (block sizes) do
Generate results for L2 UC
endfor
endfor
(g) POWERVAR ← Lowest power from 3(f)
SPEEDVAR ← Speed corresponding to POWERVAR
OptConfig ← Explore(POWERVAR,SPEEDVAR,L2UC)
(h) Generate associativity results for cache configuration 3(g)
POWERVAR ← Lowest power
SPEEDVAR ← Speed corresponding to POWERVAR
OptConfig ← Explore(POWERVAR,SPEEDVAR,L2UC)
(i) if (POWERMIN not updated)
POWERMIN ← TEMPPOWER
repeat
endif
endif
4. Set the cache configuration for POWERMIN as the optimal configuration
484
Download