Energy-Aware Design Space Exploration of RegisterFile for Extensible Processors Amir Yazdanbakhsh

advertisement
Energy-Aware Design Space Exploration of
RegisterFile for Extensible Processors
Amir Yazdanbakhsh1, Mehdi Kamal2, Mostafa E. Salehi1, Hamid Noori1, and Sied Mehdi Fakhraie1
1
Silicone Intelligence and VLSI Systems Laboratory
Low-Power High-Performance Nanosystems Laboratory
School of Electrical and Computer Engineering, University of Tehran
Tehran 14395-515, Iran
{a.yazdanbakhsh, m.kamal}@ece.ut.ac.ir, mersali@ut.ac.ir, noori@cad.ut.ac.ir, fakhraie@ut.ac.ir
2
Abstract - This paper describes an energy-aware methodology
that identifies custom instructions for critical code segments,
given the available data bandwidth constraint between custom
logic and a base processor. Our approach enables designers to
optionally constrain the number of input and output operands
for custom instructions to reach the acceptable performance
considering the energy dissipation of the registerfile. We describe
a design flow to identify promising area, performance, and power
tradeoffs. We study the effect of custom instruction I/O
constraints and registerfile input/output (I/O) ports on overall
performance and energy usage of the registerfile. Our
experiments show that, in most cases, the solutions with the
highest performance are not identified with relaxed I/O
constraints. Results for packet-processing benchmarks covering
cryptography and lookup applications are shown, with speed-ups
between 25% and 40%, and energy reduction between 20% and
30%.
Keywords: custom instruction, extensible processors, network
processors, embedded processors, registerfile, performance, energy
dissipation.
I.
INTRODUCTION
Embedded systems are special purpose systems which
perform specific tasks with predefined performance and power
requirements. Using a general purpose processor for such
systems almost results in a design that does not meet the
performance and power demands of the application. On the
other hand, ASIC design cycle is too costly and too timeconsuming for the embedded application market. Recent
developments in customized processors significantly improve
the performance metrics of a general purpose processor by
coupling it with an application specific hardware and address
the issues of ASICs such as lack of flexibility, high design and
manufacturing cost and time.
Maximizing the performance while minimizing the chip
area and power consumption is usually the main goal of
embedded processor design. Designers carefully analyze the
characteristics of the target applications and fine tune the
implementation to achieve the best tradeoffs. The most popular
strategy is to build a system consisting of a number of
specialized application-specific functional units coupled with a
low-cost and optimized general-purpose processor as a base
processor with basic instruction set (e.g ARM[1] or MIPS [2]).
978-1-4244-7938-2/10/$26.00 ©2010 IEEE
273
The base processor is augmented with custom-hardware units
that implement application-specific instructions (custom
instructions).
There are a number of benefits in augmenting the core
processor with new instructions. First, the system is
programmable and support modest changes to the application,
such as bug fixes or incremental modifications to a standard.
Second, the computationally intensive portions of applications
in the same domain are often similar in structure. Thus, the
customized processors can often be generalized to have
applicability across a set of applications.
In recent years, customized extensible processors offer the
possibility of extending the instruction set for a specific
application. A customized processor consists of a
microprocessor core that is tightly coupled with functional
units (FUs) that facilitates critical parts of the application to be
implemented in hardware using a specialized instruction set. In
the context of customized processors, hardware/software
partitioning is done at the instruction-level. Basic blocks of the
application are transformed into data-flow graphs (DFGs),
where the graph nodes represent operations similar to those in
assembly languages, and the edges represent data dependencies
between the nodes. Instruction set extension exploits a set of
custom instructions (CIs) to achieve considerable performance
improvements by executing the hot-spots of the application on
hardware.
Extension of the instruction set with new CIs can be
divided into instruction generation and instruction selection
phases. Given the application code, instruction generation
consists of clustering some basic operations into larger and
more complex operations. These complex operations are
entirely or partially identified by subgraphs which can cover
the application graph. Once the subgraphs are identified, these
are considered as single complex operations and they pass
through a selection process. Generation and selection are
performed with the use of a guide function and a cost function
respectively, which take into account constraints that the new
instructions have to satisfy for hardware implementation.
Partitioning an application into base-processor instructions
and CIs is done under certain constraints. First, there is a
limited area available in the custom logic. Second, the data
bandwidth between the base processor and the custom logic is
limited, and the data transfer costs have to be explicitly
evaluated. Next, only a limited number of input and output
operands can be encoded in a fixed-length instruction word.
Since the speed-up obtainable by custom instructions is limited
by the available data bandwidth between the base processor
and the custom logic, extending the core registerfile to support
additional read and write ports improves the data bandwidth.
However, additional ports result in increased registerfile area,
power consumption, and cycle time.
Results in [3]-[6] show that the major power in the
processors datapath is consumed in the registerfile. So reducing
the power consumption of the registerfile has a great impact on
overall power consumption and temperature of the processor.
On the other hand, augmenting the processor with CIs and
hence, improving the performance requires an increase in the
data bandwidth of the registerfile of the base processor.
Although extending the registerfile ports increases the power
consumption of the system, we expect lower energy dissipation
by lowering accesses to the registerfile. This paper presents a
systematic approach for generating and selecting the most
profitable CI candidates. Our investigations show that
considering the architectural constraints in the custom
instruction selection leads to improvements in the total
performance and energy dissipation of the registerfile.
The remainder of this paper is organized as follows: in the
following section, we discuss some existing work and state the
main contributions of this paper. Section III describes the
overall approach of the work with the aid of motivational
examples. In Sections IV, we talk about the experimental setup,
and in Section V, the experimental results for some packet
processing benchmarks are presented. Finally, Section VI
concludes this paper with some consideration in future works.
II.
RELATED WORK
The bandwidth of registerfile is a limitation in improving
the performance of customized processors. Many techniques
have been proposed to cope with this limitation. The Tensilica
Xtensa [7] uses custom state registers to explicitly move
additional input and output operands between the base
processor and custom units. Use of shadow registers [8] and
exploitation of forwarding paths of the base processor [9] can
improve the data bandwidth.
Kim [10] developed two techniques for reducing the
number of registerfile ports. The techniques are based on a prefetch technique with pre-fetch buffer and request queue, and
delayed write-back technique. Park [11] proposed two different
techniques for reducing the read and write ports of the
registerfiles including bypass hint technique and the register
banking to decrease the required read and write ports,
respectively. Karuri [12] developed a cluster-based registerfile
to overcome the number of access port limitations. The
proposed technique is adapted from the idea of VLIW
architectures, and uses more area in comparison with
conventional registerfiles. All of the investigated techniques
tried to deal with registerfile I/O constraints by changing the
registerfile architecture. On the other hand, there are some
techniques that address the CI selection. However, improving
the CI with these techniques has effects on the processors
performance.
274
Sun et al. [13] imposed no explicit constraints on the
number of input and output operands for CIs and formulated CI
selection as a mathematical optimization problem. Atasu et
al. [14] introduced constraints on the number of input and
output operands for subgraphs and showed that exploiting
constraint-based techniques could significantly reduce the
exponential search space. The work of Atasu et al. showed that
clustering based approaches (e.g., [15]) or single output
operand restriction, (e.g., [16]) could severely reduce the
achievable speed-up using custom instructions. In [17], Biswas
et al. proposed a heuristic which is based on input and output
constraints. This approach does not evaluate all feasible
subgraphs. Therefore, an optimal solution is not guaranteed.
In [18], Bonzini and Pozzi derived a polynomial bound on the
number of feasible subgraphs while the number of inputs and
outputs for the subgraphs are fixed. However, the complexity
grows exponentially as the I/O constraints are relaxed. [19] Yu
and Mitra [19] enumerate only connected subgraphs having up
to four input and two output operands and do not allow
overlapping between selected subgraphs. Atasu [20] proposed a
technique for CI identification. The proposed technique
generates a set of candidate templates by using integer linear
programming (ILP), the best CIs are selected based on the
Knapsack model. Besides, the authors studied the effect of the
registerfile ports on the performance of the selected CIs.
Additionally, the researches that work on CI identification
techniques, have shown that increasing the number of input and
output ports of the selected CI, improves the performance [20].
Although there has been a large amount of work in the
literature to improve the performance and automation of CI
generation, little has been done to examine the power
consumption of the generated custom instructions. Monitoring
this power behavior may provide new directions in the ASIP
design. Bonzini [21] augmented Wattch [22] with a model of
the power consumption of functional units (using a
combination of RTL and gate-level power modeling). They
have shown that a well-designed CI set may reduce register
and memory accesses, and hence, improve the overall power
consumption of an embedded system. Zyuban [23] did a
complete research on energy consumption of the registerfiles
under different number of read and write ports. Additionally,
the impact of the technology and different accessing techniques
to registerfiles were investigated.
In this paper, we target customized architectures for
exploring the data bandwidth between the base processor and
the custom logic considering the available registerfile ports.
Our main contributions are as follows: 1) a novel methodology
for extracting CIs , which optionally constrains the number of
input and output operands of custom instructions and explicitly
evaluates the data-transfer costs in terms of latency, area, and
power consumption; 2) an evaluation of the impact of power,
area, and registerfile port constraints on the execution cycle
count of a packet-processing benchmark.
III.
MOTIVATIONAL EXAMPLE
In this paper, we use I/O constraints to control the
granularity of the CIs and manage the power consumption of
the registerfile. When the I/O constraints are tight, we are more
likely to identify fine-grain CIs (i.e. including small number of
operation nodes). Relaxation of the constraints results in
coarse-grain custom instructions which are likely to provide
higher speed-up, although at the expense of increased area and
power consumption.
Percentage of Power Dissipation
The MIPS processor datapath is synthesized by Synopsys®
design compiler and the power of each pipeline stage is
evaluated by Synopsys® Power Compiler in 90nm technology.
Registerfile is considered as a separate hardware (HW) block.
The registerfile is accessed in decode stage for reading the
operands of instruction and in writeback stage to write the
results. As Figure 1 shows nearly 55% of the processor
datapath power is consumed by registerfile. These results
motivate us to consider registerfile power consumption in the
custom instruction selection algorithm.
60%
50%
40%
30%
20%
10%
Beside the area overhead, power consumption is another
main topic in designing the application specific processors.
Based on the Figure 1 registerfile has the highest power
dissipation among other HW blocks in MIPS. On the other
hand, as the number of I/O ports has effects on the area
overhead, the power consumption of the registerfile is also
affected by the I/O ports. Since in extensile processors, custom
instructions have different number of read and write operands,
the power consumption of the registerfile depends on the I/O
ports of the registerfile and the I/O ports of the CIs as well. So,
the power consumption of the registerfile on different access
patterns of CIs is not similar. Hence, in the following
paragraph, we will show the power per access (PPAC) of a
register file under different number of I/Os.
Figure 3 depicts the power consumption per access in
registerfiles of different (RFin, RFout) versus custom
instructions of different number of input ports (i.e. CIin) and
output ports (i.e. CIout). The results are normalized to the power
consumption of the base registerfile model (i.e (2,1)). The
power of the base model is the power per an access of a CI
with two read and one write to the (2,1) registerfile. So, the
value of each point is calculated by the following equation:
0%
Fetch
Decode
Execute
Memory
Register
Access
WB
MIPS blocks
FIGURE 1. POWER CONSUMPTION OF THE HW BLOCKS IN MIPS PROCESSOR
We have modeled and synthesized registerfile with
different input and output ports by Synopsys® Design
Compiler in 90nm technology, and compared the area
overhead of these registerfiles across the MIPS registerfile. We
introduce the (RFin, RFout) notation as the (number of read port,
number of write port) to recognize different registerfiles. Based
on this definition MIPS registerfile which has two read ports
and one write port is defined as a (2,1) registerfile. The area
overhead of increasing the read and write ports of the
registerfile across MIPS registerfile (i.e (2,1)) is shown in
Figure 2. As shown incrementing each read/write port
approximately increases the area overhead of the registerfile by
20% for each additional read port and nearly 10% for each
additional write port.
90%
80%
Area Overhead
70%
60%
50%
40%
30%
20%
0%
(3,1)
(3,2)
(4,1)
(4,2)
P((
,
),(
P((
,
))
, ),( , ))
− P((
, ),( , ))
(EQ. 1) NORMALIZED POWER PER ACCESS.
Where P(( ,
),(
)) is the PPAC of a CI with
,
CIin input and CIout output ports accessing to a registerfile with
RFin read and RFout write ports.
The results show that a CI with (CIin, CIout)=(2,2) has the
minimum power per access. Also, the results show that
increasing RFout leads to more improvements in power
consumption and the power per access is increased when RFout
is less than CIout. Furthermore, in most cases when the CIin
(CIout) is equal to RFin (RFout), the power consumption is
improved. On the other hand, when the number of read and
write ports is (3,2) ,(4,2), or (4,4), the average of the access
power is smaller than others. Average of the normalized values
of PPAC is shown in Figure 4. As shown, different (RFin,
RFout) values have different values for PPAC. Another
observation is the great effect of RFout on the PPAC, as
illustrated, in the same RFout, increasing the RFin improves the
average PPAC and RFout =1 leads to worst PPAC values.
Since the power consumption and performance of a running
application, depends on the registerfile and custom instruction
I/O constraints, in this paper we propose a framework to
explore the design space to find the optimum configuration
among the compromising high-performance and low-power
configurations.
10%
(2,2)
=
(4,4)
(RFin, RFout)
FIGURE 2. AREA OVERHEAD OF INCREASING INPUT AND OUTPUT PORTS OF
DIFFERENT REGISTERFILES VERSUS (2,1) REGISTERFILE.
275
FIGURE 3. EFFECT OF IO CONSTRAINT AND REGISTERFILE PORTS ON THE POWER PER ACCESS VERSUS (RFIN, RFOUT) = (2,1) AND (CIN, COUT) = (2,1).

Average Power per Access
140%
120%
100%
80%
After this part all structurally equivalent valid custom
instructions are categorized into isomorphic classes, called
templates. Each template is assigned a number that shows the
performance gains of that template regarding the number of
read and write ports of the registerfile. The following formula
shows the performance estimation for each template:
60%
40%
20%
0%
(2,1)
(2,2)
(3,1)
(3,2)
(4,1)
(4,2)
The identified custom instruction should be
convex. Convexity means no path exists between
two selected nodes via another node that is not in
the set of the selected nodes.
(4,4)
(RFin, RFout)
FIGURE 4. THE AVERAGE OF THE NORMALIZED VALUE OF THE POWER PER
ACCESS.
IV.
OVERALL APPROACH
Figure 5 depicts the complete flow of the proposed
framework for obtaining the optimum set of custom
instructions. The CI generation and selection evaluates
performance considering energy per access of registerfile
versus different number of read and write ports. Our proposed
framework is composed of three sub-frameworks:
1.
Custom instruction generation and selection
framework.
2.
Registerfile power estimation framework.
3.
Performance per energy calculation framework.
The custom instruction generation and selection framework
contains two main parts: i) match enumeration and ii) template
generation. Match enumeration algorithm is based on binary
search tree [24] and explores all possible custom instructions
considering the following constraints for each custom
instruction:


The number of input/output ports of the match
should be less or equal to the allowed input/output
ports.
Each custom instruction should not include any
memory operations such as Load or Store
instructions.
276
∗
−
−
=
(
−
−
(
)
−
)
(EQ. 2) PERFORMANCE OF EACH TEMPLATE.
Itr: Iteration number (execution frequency) of each basic
block.
SWLatency: Software latency of template in terms of the
number of cycles that the processor needs to execute the
selected instruction in pure software without any CIs.
HWLatency: Hardware latency of template. It means the
number of cycles that the processor, augmented with CIs,
needs to execute the selected custom instruction.
CIin: Custom instruction input port.
CIout: Custom instruction output port.
RFin: Number of read ports of registerfile.
RFout: Number of write ports of registerfile.
The terms
(
)
and
(
)
denote the
numbers of cycles that are missed due to the pipelining
overhead resulting from the limited number of read and write
ports of the registerfile.
Number of
Read and
Write Ports
Basic Operation
Verilog
Application in C/C++
TSMC 90nm
Library
Extract DFG
TestBench
ModelSim
Simulation
Synopsys Synthesis
Optimized
Layout
Verilog
Register File
Application Profiling
and
Code Coverage
Synopsys Synthesis
Delay
and Area
VCD File
Match Enumeration
Input/Output Constraint
Template Generation
Register File
Power Library
Register File Constraint
Register File
Power Estimation
Framework
Custom Instruction
Selection
Framework
Performance × Energy
Reduction (PER)
Calculation
Framework
Register File Accesses
Template Performance
Estimation
Template Selection
Application Performance
Estimation
Register File Scheduling
Total Energy of Register
File Accesses
Performance × Energy
Reduction
The Set of Custom
Instructions without
considering Energy
The Set of Custom
Instructions considering
Performance per Energy
FIGURE 5. THE PROPOSED FRAMEWORK.
After evaluating the performance of each template, the
objective is to find the maximum weighted independent set
(MWIS) out of these templates. We have exploited the MWIS
proposed in [25]. Registerfile power estimation framework is
aimed to calculate the power per each access to registerfile
considering varying number of read and write ports. The flow
of this framework begins with implementation of registerfile
with various configurations for read and writes ports. These
registerfiles are synthesized with Synopsys® design compiler
and TSMC 90nm library to extract a gate-level netlist. The
synthesized netlist of each configuration is simulated with
ModelSim® to capture the switching activities and generate
value change dump (VCD) files. Meanwhile, the layouts were
generated for different configurations of registerfiles with
Cadence® SoC Encounter and TSMC 90nm library. After
layout generation, the generated VCD files are imported to the
SoC Encounter® to calculate the accurate power for each
configuration. Configurations consider number of read and
write ports and the number of simultaneous read and write
accesses to the registerfile.
The energy dissipation of the registerfile is directly
dependent on the application. In the other word, since in
different applications the CIs and the number of accesses to the
registerfile are different, we should study the energy dissipation
of the registerfile according to the application using the
following formula:
∑∀
,
∈
(
,
,
)
=1−
,
( , ),( , ) ×
,
( , )
×
Where
is the reduced energy dissipation
and shows how much is the energy reduction when the CIs are
added to the processor. S is the set of different I/O ports of the
CIs. Acs( ,
) is the number of access to the registerfile
which take place with the CIs with (CI , CI ) input and
output ports. Pwr ( , ),( , ) × Acs( , ) is the Power×Access
value of the registerfile when there are not any CIs. We also
introduced a new metric to evaluate identified CIs considering
both energy and performance. This metric, named PER, is
shown in the following equation:
=
×
(EQ. 4) PER METRIC (CONSIDERING BOTH ENERGY REDUCTION AND
PERFORMANCE IMPROVEMENT).
This metric means that the set of CIs that improve
performance and meanwhile reduce energy in registerfile are
more profitable to be selected. Finally, the performance per
area overhead is another metric that can be considered in CIs
selection. This metric is calculated by the following equation:
=
(EQ. 5) PPA METRIC (CONSIDERING BOTH AREA OVERHEAD AND
PERFORMANCE IMPROVEMENT).
V. EXPERIMENTAL SETUP
We define a single-issue baseline processor based on MIPS
architecture including 32×32-bit general-purpose registers in a
5-stage pipeline. We do not constrain the number of input and
output operands in custom instruction generation. However, we
,
(EQ. 3) APPLICATION ENERGY REDUCTION.
277
explicitly account for the data-transfer cycles between the base
processor and the custom logic if the number of inputs or
outputs exceeds the available registerfile ports. CHIP [20]
assumes two-cycle software latencies for integer multiplication
instructions and single-cycle software latencies for the rest of
the integer operations. Since the latency of arithmetic and
logical operations are considerably different, assuming singlecycle latency for all of the logical and arithmetic instructions
may lead to non-optimal results. We assume a more accurate
and fare value for latency of different operations. In software
we assume single-cycle latency for each operation. However,
in hardware, we evaluate the latency of arithmetic and logic
operators by synthesizing them with a 90-nm CMOS process
and normalize them to the delay of the MIPS processor. Table
1 shows the normalized delay and area of some VEX
operators [26].
TABLE 1. NORMALIZED VALUES OF DELAY AND AREA OF ARITHMETIC AND
LOGIC OPERATORS.
VEX
Operations
Add
Normalized
delay
Normalized
Area
77%
0.3%
And
1%
0.03%
Eqt
6%
0.1%
imul16
77%
1.6%
imul32
144%
6.3%
wallace-mul16
97%
1.4%
wallace-mul32
199%
5.8%
Shadd
76%
0.4%
Shift
8%
0.3%
Slct
1%
0.1%
Xor
1%
0.1%
single-issue processor. The latency of a custom instruction as a
single instruction is approximated by the number of cycles
equal to the ceiling of the sum of hardware latencies over the
custom instruction critical path. The difference between the
software and the hardware latency is used to estimate the
speedup. Input and output violations are taken into account by
the penalties in the fitness function. We do not include division
operations in custom instructions due to their high latency and
large area overhead. We have also excluded memory access
instructions
from
custom
instructions,
to
avoid
nondeterministic latencies due to the cache memory system.
We assume that given RFin read ports and RFout write ports
supported by the registerfile, RFin input operands and RFout
output operands can be encoded within a single-instruction
word.
For calculating the power consumption of the registerfile,
we have implemented registerfiles with different read and write
ports with Verilog®, and synthesis them in 90nm technology.
Then we place and route them with Cadence® SOC Encounter,
and analyze power consumption of each of them under random
data sets.
VI.
VEX is composed of many components that their main
objective is to simulate, analyze, and compile C programs for
VLIW processor architectures. VEX has also the capability to
extract DFGs and CFGs from C programs. We use this
possibility to extract CFGs from domain-specific benchmarks.
The extracted CFGs are converted to an intermediate format
that is known for our custom instruction selection framework.
On the other hand, with the help of gcov and gprof [27] in
conjunction with GCC, the code coverage and the number of
iterations of each basic block of codes are calculated in the
dynamic execution of domain-specific benchmarks. These
numbers and the intermediate format are processed by our
custom instruction selection framework to find a set of CIs that
increase the performance of domain-specific benchmarks.
The accumulated software latency values of a custom
instruction candidate subgraph estimates its execution time in a
278
EXPERIMENTAL RESULTS
In Figure 6, we analyze the effect of different input and
output constraints (i.e., CIin, CIout) on the achieved performance
improvement of custom instructions versus registerfiles with
different (RFin, RFout). As shown in this figure when the custom
instructions are generated based on (CIin, CIout)= (3,2) and the
registerfile constraint is (3,2), the performance improvement is
higher than when (CIin, CIout)= (∞, ∞) and (RFin, RFout)= (3,2).
Therefore, both of (CIin, CIout) and (RFin, RFout) should be
considered for achieving the best performance improvement.
Another important observation from this figure is that
performance improvement is almost saturated in (CIin, CIout)=
(6,3) and (RFin, RFout)=(6,3) and the performance improvement
of this point is about 3% higher than (CIin, CIout)= (3,2) and
(RFin, RFout)= (3,2). Furthermore, the former achieves this
improvement with a greater area overhead compared to the
latter configuration. Therefore, the area and power
consumption overhead of each configuration should be
considered to reach the best performance per area or
performance per power.
When the base processor is augmented with custom
instructions, the registerfile is faced with a wide range of
access patterns instead of simple accesses. There are two
approaches for implementing the registerfile when the number
of CI operands is greater than the number of registerfile read
ports. In the first approach, for each access to the registerfile,
all read ports of the registerfile are activated. In the other
approach, read form registerfile is controlled with an activation
signal and the number of activated read ports are equal to the
number of the operands of the CI; e.g. only 2 read ports of a (4,
1) registerfile is enabled when the CI has only 2 operands.
FIGURE 6.EFFECT OF IO CONSTRAINTS AND REGISTERFILE PORTS ON THE ACHIEVED PERFORMANCE IMPROVEMENT
Power
4.5
25%
20%
Energy Reduction
Figure 7 compares the power consumption of these two
techniques when the read and write ports of the registerfile are
4 and 1, respectively. When we use the latter method, the
power consumption of a (2, 1) access to a (4, 1) registerfile is
reduced 18% in compare with the former method.
15%
10%
5%
4
(4,1) With Activation
3.5
0%
(3,1)
(4,1) Without Activation
(3,2)
3
(3,1)
(4,1)
Access
FIGURE 7 POWER CONSUMPTION OF TWO DIFFERENT REGISTER FILE READ
METHOD.
Figure 8 depicts the reduced energy dissipation of the MD5
application. The values are calculated by (Eq. 3). Based on
these results, it can be inferred that the registerfile (4, 2) for the
(CIin, CIout) = (4, 2) and registerfile (3, 2) for (CIin, Cout) = (3, 2)
have the highest energy reduction in compare to the other
registerfile configuration for the selected CIs in MD5
application.
Based on the performance improvement and Energy
reduction, we can claim that (3, 2) and (4, 2) registerfiles are
the best choices. But, which of them is better? This question
can be answered by the proposed metric in (Eq. 4). The PER
for each configuration is calculated relative to PER for the (2,1)
registerfile. The following equation is used to evaluate the
normalized PER:
=
(
,
( , )
(4,2)
(4,4)
FIGURE 8. REDUCED ENERGY DISSIPATION IN COMPARE WITH THE CASE
WITHOUT CIS IN MD5 APPLICATION.
)
In Figure 9, the results show that PER metric of the
registerfile when it has 4 read and 2 write ports is better than
the other configurations. However, if we want to consider the
area overhead in our selection (Figure 2), the hardware cost of
the (3, 2) registerfile is also 20% lower than (4, 2) registerfile.
But, these days, the hardware cost is not as important as before.
7%
6%
5%
PER
(2,1)
(4,1)
(RFin, RFout)
4%
3%
2%
1%
0%
(3,1)
(3,2)
(4,1)
(4,2)
(4,4)
(RFin, RFout)
FIGURE 9. PER VALUE OF THE MD5 APPLICATION UNDER DIFFERENT
REGISTERFILE PORTS NUMBER RELATIVE TO (2, 1) REGISTERFILE.
(EQ. 6) NORMALIZED PER RELATIVE TO PER(2,1)
279
Figure 10 shows the normalized PPA value, calculated by
(Eq. 5), for different registerfile configurations relative to the
(2, 1) registerfile. It can be deduced that the PPA of the (3, 1)
and (3, 2) registerfiles are almost 45%-60% higher than the
other registerfile configurations. If we exclude the hardware
cost, it can be concluded that (3, 2) and (4, 2) registerfiles are
the most appropriate registerfile configuration for MD5
application.
Finally, the PPA metric values are shown in Figure 14. Due
to the highest area overhead of the registerfile compared to the
improvements with CIs that are obtained by increasing
registerfile read and write ports, it is obvious that PPA values
decrease when the number of read and write ports are
increased.
IPv4-Trie
(3,2)
(4,1)
12%
100%
10%
90%
8%
PER
80%
70%
60%
PPA
IPSec
6%
4%
50%
40%
2%
30%
0%
20%
(3,1)
10%
0%
(3,1)
(3,2)
(4,1)
(4,2)
(4,2)
(4,4)
(RFin, RFout)
(4,4)
(RFin, RFout)
FIGURE 13. PER VALUE OF THE IPSEC AND IPV4-TRIE APPLICATION UNDER
DIFFERENT REGISTERFILE PORTS NUMBER RELATIVE TO (2, 1) REGISTERFILE
FIGURE 10. PPA VALUE OF THE MD5 APPLICATION UNDER DIFFERENT
REGISTERFILE PORTS NUMBER RELATIVE TO (2, 1) REGISTERFILE.
IPSec
120%
100%
PPA
The performance improvements for two other packet
processing applications (IPSec, IPv4-Trie) are shown in Figure
11. As shown, the obtainable performance improvements are
up to 40% when the base processor is augmented with a set of
CIs .
IPv4-Trie
140%
80%
60%
40%
20%
Performance Improvement
IPSec
IPv4-Trie
0%
(3,1)
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
(4,1)
(4,2)
(4,4)
(RFin, RFout)
FIGURE 14. PPA VALUE OF THE IPSEC AND IPV4-TRIE APPLICATIONS UNDER
DIFFERENT REGISTERFILE PORTS NUMBER RELATIVE TO (2, 1) REGISTERFILE.
VII. CONCLUSION
(3,1),(3,1)
(3,2), (3,2)
(4,1),(4,1)
(4,2), (4,2)
(4,4), (4,4)
(Cin, Cout), (RFin, RFout)
FIGURE 11.THE IPV4-TRIE AND IPSEC PERFORMANCE IMPROVEMENT FOR
DIFFERENT CONFIGURATION OF REGISTERFILE AND IOS.
Energy reduction of these two applications for different
number of read and writes ports in registerfile are illustrated in
Figure 12. It shows that the energy reduction in IPSec and
IPv4-Trie follows the same trend as in MD5. The (3, 2) and (4,
2) registerfiles reduced the energy dissipation more than other
configurations. The PER metric values are shown in Figure 13.
It depicts that the highest PER value in IPSec is related to the
(3, 2) and (4, 2) registerfiles and in IPv4-Trie to the (3, 2), (4,
2), and (4, 4) registerfiles.
IPSec
Energy Reduction
(3,2)
IPv4-Trie
30%
25%
20%
15%
5%
0%
(3,2)
(4,1)
(4,2)
To back up our claim, a framework is proposed that
consists of three basic parts: Custom Instruction Selection,
registerfile Power Estimation, and Performance × Energy
Reduction (PER) Calculation Framework. This framework is
used to explore design space of registerfile I/Os considering
both power per access and performance improvements. To
evaluate identified CIs, a new metric, named PER, is
introduced considering both performance improvements and
energy dissipation. The set of CIs, identified for the specified
number of registerfile I/Os that give the higher PER are the
most valuable ones in terms of performance and energy
reduction.
It is also shown that the performance improvements, energy
reduction, and PER value of the (3, 2) and (4, 2) registerfiles
are different a little in the selected packet-processing
benchmarks. We are currently working on our proposed
framework to extend it and boost its capability to consider the
effect of other part of embedded processors in terms of both
performance and power dissipation on custom instruction
selection algorithms.
10%
(3,1)
In this paper the effects of different read and write ports of
registerfile on the performance improvements and energy
dissipations in extensible processors are analyzed. It is shown
that power dissipation of registerfile includes almost 55% of
the total power dissipation in embedded processors datapath.
Therefore, considering power per access of registerfile in
custom instruction selection algorithms is one of the most
important issues that should be considered by the designers.
(4,4)
(RFin, RFout)
FIGURE 12. ENERGY REDUCTION IN COMPARE WITH THE CASE WITHOUT CIS.
280
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
ARM, The architecture for the digital world, available online:
www.arm.com.
MIPS technologies Inc., available online: www.mips.com.
S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, and E. Earlie,
“Register file power reduction using bypass sensitive compiler,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 27, no. 6, June 2008, p1155-1159.
R. Nalluri, R. Garg, P. R. Panda, “Customization of register file banking
architecture for low power,” 20th International Conference on VLSI
Design (VLSID'07), pp239-244, 2007.
J. Scott et al. “Designing the low-power m*core architecture,” IEEE
Power Driven Microarchitecture Workshop, 1998.
V. Zyuban and P. Kogge. “The energy complexity of registerfiles,”
ISLPED, 1998.
Ricardo E. Gonzalez “XTENSA: A configurable and extensible
processor,” IEEE Micro, vol. 20, 2000, pp. 60-70.
J. Cong et al., “Instruction set extension with shadow registers for
configurable processors,” in Proc. FPGA, Feb. 2005, pp. 99–106.
R. Jayaseelan et al., “Exploiting forwarding to improve data bandwidth
of instruction-set extensions,” in Proc. DAC, Jul. 2006, pp. 43–48.
N.S. Kim, and T. and Mudge,” Reducing register ports using delayed
write-back queues and operand pre-fetch,” in Proceedings of the 17th
annual international conference on Supercomputing, pp. 172-182, 2003.
Park, M.D. Powell, and T.N. Vijaykumar,” Reducing register ports for
higher speed and lower energy,” in Proceedings of the 35th Annual
IEEE/ACM International Symposium on Microarchitecture, pp. 171182, 2002.
K. Karuri, A. Chattopadhyay, M. Hohenauer, R. Leupers, G. Ascheid,
and H. Meyr,” Increasing data-bandwidth to instruction-set extensions
through register clustering,” Proceedings of the IEEE/ACM international
conference on Computer-aided design, pp. 166-177, 2007.
F. Sun et al., “A scalable application-specific processor synthesis
methodology,” in Proc. ICCAD, San Jose, CA, Nov. 2003, pp. 283–290.
K. Atasu, L. Pozzi, and P. Ienne, “Automatic Application-Specific
Instruction-Set Extensions under Microarchitectural Constraints,” in
Proc. of the 40th annual Design Automation Conference, Anaheim, CA,
USA: ACM, Jun. 2003, pp. 256–261.
281
[15] M. Baleani et al., “HW/SW partitioning and code generation of
embedded control applications on a reconfigurable architecture
platform,” in Proc.10th Int. Workshop HW/SW Codesign, May 2002,
pp. 151–156.
[16] C. Alippi et al., “A DAG based design approach for reconfigurable
VLIW processors,” in Proc. DATE, Munich, Germany, Mar. 1999, pp.
778–779.
[17] P. Biswas et al., “ISEGEN: Generation of high-quality instruction set
extensions by iterative improvement,” in Proc. DATE, 2005, pp. 1246–
1251.
[18] P. Bonzini and L. Pozzi, “Polynomial-time subgraph enumeration for
automated instruction set extension,” in Proc. DATE, Apr. 2007, pp.
1331–1336.
[19] P. Yu and T. Mitra, “Satisfying real-time constraints with custom
instructions,” in Proc. CODES+ISSS, Jersey City, NJ, Sep. 2005, pp.
166–171.
[20] K. Atasu, C. Ozturan, G. Dundar, O. Mencer, and W. Luk,” CHIPS:
Custom hardware instruction processor synthesis,” in IEEE Transactions
on Computer Aided Design of Integrated Circuits and Systems,
Vol(27),pp. 528-541, 2008.
[21] Paolo Bonzini, Dilek Harmanci, and Laura Pozzi, “A Study of energy
saving in customizable processors,” SAMOS 2007, LNCS 4599, pp.
304–312, 2007.
[22] David Brooks, Vivek Tiwari, and Margaret Martonosi, “Wattch: a
framework for architectural-level power analysis and optimizations,” in
Proc. of the 27th annual international symposium on Computer
Architecture, pp. 83-94, 2000.
[23] V. Zyuban, and P. and Kogge,” The energy complexity of register files,”
in Proceedings of the international symposium on Low power
electronics and design, pp. 305-310, 1998.
[24] L. Pozzi, K. Atasu, and P. Ienne, “Exact and Approximate Algorithms
for the Extension of Embedded Processor Instruction Sets,” ComputerAided Design of Integrated Circuits and Systems, IEEE Transaction on,
vol. 25, July 2006, pp. 1209-1229.
[25] Panos M. Pardalos, and Nisha Desai, “An algorithm for finding a
maximum weighted independent set in an arbitrary graph,” in
International Journal of Computer Mathematics, vol. 38, pp.163-175,
1991.
[26] VEX Toolchain, available online: www.hpl.hp.com/downloads/vex.
[27] The GNU operating system, available online: www.gnu.og.
Download