Energy-Aware Design Space Exploration of RegisterFile for Extensible Processors Amir Yazdanbakhsh1, Mehdi Kamal2, Mostafa E. Salehi1, Hamid Noori1, and Sied Mehdi Fakhraie1 1 Silicone Intelligence and VLSI Systems Laboratory Low-Power High-Performance Nanosystems Laboratory School of Electrical and Computer Engineering, University of Tehran Tehran 14395-515, Iran {a.yazdanbakhsh, m.kamal}@ece.ut.ac.ir, mersali@ut.ac.ir, noori@cad.ut.ac.ir, fakhraie@ut.ac.ir 2 Abstract - This paper describes an energy-aware methodology that identifies custom instructions for critical code segments, given the available data bandwidth constraint between custom logic and a base processor. Our approach enables designers to optionally constrain the number of input and output operands for custom instructions to reach the acceptable performance considering the energy dissipation of the registerfile. We describe a design flow to identify promising area, performance, and power tradeoffs. We study the effect of custom instruction I/O constraints and registerfile input/output (I/O) ports on overall performance and energy usage of the registerfile. Our experiments show that, in most cases, the solutions with the highest performance are not identified with relaxed I/O constraints. Results for packet-processing benchmarks covering cryptography and lookup applications are shown, with speed-ups between 25% and 40%, and energy reduction between 20% and 30%. Keywords: custom instruction, extensible processors, network processors, embedded processors, registerfile, performance, energy dissipation. I. INTRODUCTION Embedded systems are special purpose systems which perform specific tasks with predefined performance and power requirements. Using a general purpose processor for such systems almost results in a design that does not meet the performance and power demands of the application. On the other hand, ASIC design cycle is too costly and too timeconsuming for the embedded application market. Recent developments in customized processors significantly improve the performance metrics of a general purpose processor by coupling it with an application specific hardware and address the issues of ASICs such as lack of flexibility, high design and manufacturing cost and time. Maximizing the performance while minimizing the chip area and power consumption is usually the main goal of embedded processor design. Designers carefully analyze the characteristics of the target applications and fine tune the implementation to achieve the best tradeoffs. The most popular strategy is to build a system consisting of a number of specialized application-specific functional units coupled with a low-cost and optimized general-purpose processor as a base processor with basic instruction set (e.g ARM[1] or MIPS [2]). 978-1-4244-7938-2/10/$26.00 ©2010 IEEE 273 The base processor is augmented with custom-hardware units that implement application-specific instructions (custom instructions). There are a number of benefits in augmenting the core processor with new instructions. First, the system is programmable and support modest changes to the application, such as bug fixes or incremental modifications to a standard. Second, the computationally intensive portions of applications in the same domain are often similar in structure. Thus, the customized processors can often be generalized to have applicability across a set of applications. In recent years, customized extensible processors offer the possibility of extending the instruction set for a specific application. A customized processor consists of a microprocessor core that is tightly coupled with functional units (FUs) that facilitates critical parts of the application to be implemented in hardware using a specialized instruction set. In the context of customized processors, hardware/software partitioning is done at the instruction-level. Basic blocks of the application are transformed into data-flow graphs (DFGs), where the graph nodes represent operations similar to those in assembly languages, and the edges represent data dependencies between the nodes. Instruction set extension exploits a set of custom instructions (CIs) to achieve considerable performance improvements by executing the hot-spots of the application on hardware. Extension of the instruction set with new CIs can be divided into instruction generation and instruction selection phases. Given the application code, instruction generation consists of clustering some basic operations into larger and more complex operations. These complex operations are entirely or partially identified by subgraphs which can cover the application graph. Once the subgraphs are identified, these are considered as single complex operations and they pass through a selection process. Generation and selection are performed with the use of a guide function and a cost function respectively, which take into account constraints that the new instructions have to satisfy for hardware implementation. Partitioning an application into base-processor instructions and CIs is done under certain constraints. First, there is a limited area available in the custom logic. Second, the data bandwidth between the base processor and the custom logic is limited, and the data transfer costs have to be explicitly evaluated. Next, only a limited number of input and output operands can be encoded in a fixed-length instruction word. Since the speed-up obtainable by custom instructions is limited by the available data bandwidth between the base processor and the custom logic, extending the core registerfile to support additional read and write ports improves the data bandwidth. However, additional ports result in increased registerfile area, power consumption, and cycle time. Results in [3]-[6] show that the major power in the processors datapath is consumed in the registerfile. So reducing the power consumption of the registerfile has a great impact on overall power consumption and temperature of the processor. On the other hand, augmenting the processor with CIs and hence, improving the performance requires an increase in the data bandwidth of the registerfile of the base processor. Although extending the registerfile ports increases the power consumption of the system, we expect lower energy dissipation by lowering accesses to the registerfile. This paper presents a systematic approach for generating and selecting the most profitable CI candidates. Our investigations show that considering the architectural constraints in the custom instruction selection leads to improvements in the total performance and energy dissipation of the registerfile. The remainder of this paper is organized as follows: in the following section, we discuss some existing work and state the main contributions of this paper. Section III describes the overall approach of the work with the aid of motivational examples. In Sections IV, we talk about the experimental setup, and in Section V, the experimental results for some packet processing benchmarks are presented. Finally, Section VI concludes this paper with some consideration in future works. II. RELATED WORK The bandwidth of registerfile is a limitation in improving the performance of customized processors. Many techniques have been proposed to cope with this limitation. The Tensilica Xtensa [7] uses custom state registers to explicitly move additional input and output operands between the base processor and custom units. Use of shadow registers [8] and exploitation of forwarding paths of the base processor [9] can improve the data bandwidth. Kim [10] developed two techniques for reducing the number of registerfile ports. The techniques are based on a prefetch technique with pre-fetch buffer and request queue, and delayed write-back technique. Park [11] proposed two different techniques for reducing the read and write ports of the registerfiles including bypass hint technique and the register banking to decrease the required read and write ports, respectively. Karuri [12] developed a cluster-based registerfile to overcome the number of access port limitations. The proposed technique is adapted from the idea of VLIW architectures, and uses more area in comparison with conventional registerfiles. All of the investigated techniques tried to deal with registerfile I/O constraints by changing the registerfile architecture. On the other hand, there are some techniques that address the CI selection. However, improving the CI with these techniques has effects on the processors performance. 274 Sun et al. [13] imposed no explicit constraints on the number of input and output operands for CIs and formulated CI selection as a mathematical optimization problem. Atasu et al. [14] introduced constraints on the number of input and output operands for subgraphs and showed that exploiting constraint-based techniques could significantly reduce the exponential search space. The work of Atasu et al. showed that clustering based approaches (e.g., [15]) or single output operand restriction, (e.g., [16]) could severely reduce the achievable speed-up using custom instructions. In [17], Biswas et al. proposed a heuristic which is based on input and output constraints. This approach does not evaluate all feasible subgraphs. Therefore, an optimal solution is not guaranteed. In [18], Bonzini and Pozzi derived a polynomial bound on the number of feasible subgraphs while the number of inputs and outputs for the subgraphs are fixed. However, the complexity grows exponentially as the I/O constraints are relaxed. [19] Yu and Mitra [19] enumerate only connected subgraphs having up to four input and two output operands and do not allow overlapping between selected subgraphs. Atasu [20] proposed a technique for CI identification. The proposed technique generates a set of candidate templates by using integer linear programming (ILP), the best CIs are selected based on the Knapsack model. Besides, the authors studied the effect of the registerfile ports on the performance of the selected CIs. Additionally, the researches that work on CI identification techniques, have shown that increasing the number of input and output ports of the selected CI, improves the performance [20]. Although there has been a large amount of work in the literature to improve the performance and automation of CI generation, little has been done to examine the power consumption of the generated custom instructions. Monitoring this power behavior may provide new directions in the ASIP design. Bonzini [21] augmented Wattch [22] with a model of the power consumption of functional units (using a combination of RTL and gate-level power modeling). They have shown that a well-designed CI set may reduce register and memory accesses, and hence, improve the overall power consumption of an embedded system. Zyuban [23] did a complete research on energy consumption of the registerfiles under different number of read and write ports. Additionally, the impact of the technology and different accessing techniques to registerfiles were investigated. In this paper, we target customized architectures for exploring the data bandwidth between the base processor and the custom logic considering the available registerfile ports. Our main contributions are as follows: 1) a novel methodology for extracting CIs , which optionally constrains the number of input and output operands of custom instructions and explicitly evaluates the data-transfer costs in terms of latency, area, and power consumption; 2) an evaluation of the impact of power, area, and registerfile port constraints on the execution cycle count of a packet-processing benchmark. III. MOTIVATIONAL EXAMPLE In this paper, we use I/O constraints to control the granularity of the CIs and manage the power consumption of the registerfile. When the I/O constraints are tight, we are more likely to identify fine-grain CIs (i.e. including small number of operation nodes). Relaxation of the constraints results in coarse-grain custom instructions which are likely to provide higher speed-up, although at the expense of increased area and power consumption. Percentage of Power Dissipation The MIPS processor datapath is synthesized by Synopsys® design compiler and the power of each pipeline stage is evaluated by Synopsys® Power Compiler in 90nm technology. Registerfile is considered as a separate hardware (HW) block. The registerfile is accessed in decode stage for reading the operands of instruction and in writeback stage to write the results. As Figure 1 shows nearly 55% of the processor datapath power is consumed by registerfile. These results motivate us to consider registerfile power consumption in the custom instruction selection algorithm. 60% 50% 40% 30% 20% 10% Beside the area overhead, power consumption is another main topic in designing the application specific processors. Based on the Figure 1 registerfile has the highest power dissipation among other HW blocks in MIPS. On the other hand, as the number of I/O ports has effects on the area overhead, the power consumption of the registerfile is also affected by the I/O ports. Since in extensile processors, custom instructions have different number of read and write operands, the power consumption of the registerfile depends on the I/O ports of the registerfile and the I/O ports of the CIs as well. So, the power consumption of the registerfile on different access patterns of CIs is not similar. Hence, in the following paragraph, we will show the power per access (PPAC) of a register file under different number of I/Os. Figure 3 depicts the power consumption per access in registerfiles of different (RFin, RFout) versus custom instructions of different number of input ports (i.e. CIin) and output ports (i.e. CIout). The results are normalized to the power consumption of the base registerfile model (i.e (2,1)). The power of the base model is the power per an access of a CI with two read and one write to the (2,1) registerfile. So, the value of each point is calculated by the following equation: 0% Fetch Decode Execute Memory Register Access WB MIPS blocks FIGURE 1. POWER CONSUMPTION OF THE HW BLOCKS IN MIPS PROCESSOR We have modeled and synthesized registerfile with different input and output ports by Synopsys® Design Compiler in 90nm technology, and compared the area overhead of these registerfiles across the MIPS registerfile. We introduce the (RFin, RFout) notation as the (number of read port, number of write port) to recognize different registerfiles. Based on this definition MIPS registerfile which has two read ports and one write port is defined as a (2,1) registerfile. The area overhead of increasing the read and write ports of the registerfile across MIPS registerfile (i.e (2,1)) is shown in Figure 2. As shown incrementing each read/write port approximately increases the area overhead of the registerfile by 20% for each additional read port and nearly 10% for each additional write port. 90% 80% Area Overhead 70% 60% 50% 40% 30% 20% 0% (3,1) (3,2) (4,1) (4,2) P(( , ),( P(( , )) , ),( , )) − P(( , ),( , )) (EQ. 1) NORMALIZED POWER PER ACCESS. Where P(( , ),( )) is the PPAC of a CI with , CIin input and CIout output ports accessing to a registerfile with RFin read and RFout write ports. The results show that a CI with (CIin, CIout)=(2,2) has the minimum power per access. Also, the results show that increasing RFout leads to more improvements in power consumption and the power per access is increased when RFout is less than CIout. Furthermore, in most cases when the CIin (CIout) is equal to RFin (RFout), the power consumption is improved. On the other hand, when the number of read and write ports is (3,2) ,(4,2), or (4,4), the average of the access power is smaller than others. Average of the normalized values of PPAC is shown in Figure 4. As shown, different (RFin, RFout) values have different values for PPAC. Another observation is the great effect of RFout on the PPAC, as illustrated, in the same RFout, increasing the RFin improves the average PPAC and RFout =1 leads to worst PPAC values. Since the power consumption and performance of a running application, depends on the registerfile and custom instruction I/O constraints, in this paper we propose a framework to explore the design space to find the optimum configuration among the compromising high-performance and low-power configurations. 10% (2,2) = (4,4) (RFin, RFout) FIGURE 2. AREA OVERHEAD OF INCREASING INPUT AND OUTPUT PORTS OF DIFFERENT REGISTERFILES VERSUS (2,1) REGISTERFILE. 275 FIGURE 3. EFFECT OF IO CONSTRAINT AND REGISTERFILE PORTS ON THE POWER PER ACCESS VERSUS (RFIN, RFOUT) = (2,1) AND (CIN, COUT) = (2,1). Average Power per Access 140% 120% 100% 80% After this part all structurally equivalent valid custom instructions are categorized into isomorphic classes, called templates. Each template is assigned a number that shows the performance gains of that template regarding the number of read and write ports of the registerfile. The following formula shows the performance estimation for each template: 60% 40% 20% 0% (2,1) (2,2) (3,1) (3,2) (4,1) (4,2) The identified custom instruction should be convex. Convexity means no path exists between two selected nodes via another node that is not in the set of the selected nodes. (4,4) (RFin, RFout) FIGURE 4. THE AVERAGE OF THE NORMALIZED VALUE OF THE POWER PER ACCESS. IV. OVERALL APPROACH Figure 5 depicts the complete flow of the proposed framework for obtaining the optimum set of custom instructions. The CI generation and selection evaluates performance considering energy per access of registerfile versus different number of read and write ports. Our proposed framework is composed of three sub-frameworks: 1. Custom instruction generation and selection framework. 2. Registerfile power estimation framework. 3. Performance per energy calculation framework. The custom instruction generation and selection framework contains two main parts: i) match enumeration and ii) template generation. Match enumeration algorithm is based on binary search tree [24] and explores all possible custom instructions considering the following constraints for each custom instruction: The number of input/output ports of the match should be less or equal to the allowed input/output ports. Each custom instruction should not include any memory operations such as Load or Store instructions. 276 ∗ − − = ( − − ( ) − ) (EQ. 2) PERFORMANCE OF EACH TEMPLATE. Itr: Iteration number (execution frequency) of each basic block. SWLatency: Software latency of template in terms of the number of cycles that the processor needs to execute the selected instruction in pure software without any CIs. HWLatency: Hardware latency of template. It means the number of cycles that the processor, augmented with CIs, needs to execute the selected custom instruction. CIin: Custom instruction input port. CIout: Custom instruction output port. RFin: Number of read ports of registerfile. RFout: Number of write ports of registerfile. The terms ( ) and ( ) denote the numbers of cycles that are missed due to the pipelining overhead resulting from the limited number of read and write ports of the registerfile. Number of Read and Write Ports Basic Operation Verilog Application in C/C++ TSMC 90nm Library Extract DFG TestBench ModelSim Simulation Synopsys Synthesis Optimized Layout Verilog Register File Application Profiling and Code Coverage Synopsys Synthesis Delay and Area VCD File Match Enumeration Input/Output Constraint Template Generation Register File Power Library Register File Constraint Register File Power Estimation Framework Custom Instruction Selection Framework Performance × Energy Reduction (PER) Calculation Framework Register File Accesses Template Performance Estimation Template Selection Application Performance Estimation Register File Scheduling Total Energy of Register File Accesses Performance × Energy Reduction The Set of Custom Instructions without considering Energy The Set of Custom Instructions considering Performance per Energy FIGURE 5. THE PROPOSED FRAMEWORK. After evaluating the performance of each template, the objective is to find the maximum weighted independent set (MWIS) out of these templates. We have exploited the MWIS proposed in [25]. Registerfile power estimation framework is aimed to calculate the power per each access to registerfile considering varying number of read and write ports. The flow of this framework begins with implementation of registerfile with various configurations for read and writes ports. These registerfiles are synthesized with Synopsys® design compiler and TSMC 90nm library to extract a gate-level netlist. The synthesized netlist of each configuration is simulated with ModelSim® to capture the switching activities and generate value change dump (VCD) files. Meanwhile, the layouts were generated for different configurations of registerfiles with Cadence® SoC Encounter and TSMC 90nm library. After layout generation, the generated VCD files are imported to the SoC Encounter® to calculate the accurate power for each configuration. Configurations consider number of read and write ports and the number of simultaneous read and write accesses to the registerfile. The energy dissipation of the registerfile is directly dependent on the application. In the other word, since in different applications the CIs and the number of accesses to the registerfile are different, we should study the energy dissipation of the registerfile according to the application using the following formula: ∑∀ , ∈ ( , , ) =1− , ( , ),( , ) × , ( , ) × Where is the reduced energy dissipation and shows how much is the energy reduction when the CIs are added to the processor. S is the set of different I/O ports of the CIs. Acs( , ) is the number of access to the registerfile which take place with the CIs with (CI , CI ) input and output ports. Pwr ( , ),( , ) × Acs( , ) is the Power×Access value of the registerfile when there are not any CIs. We also introduced a new metric to evaluate identified CIs considering both energy and performance. This metric, named PER, is shown in the following equation: = × (EQ. 4) PER METRIC (CONSIDERING BOTH ENERGY REDUCTION AND PERFORMANCE IMPROVEMENT). This metric means that the set of CIs that improve performance and meanwhile reduce energy in registerfile are more profitable to be selected. Finally, the performance per area overhead is another metric that can be considered in CIs selection. This metric is calculated by the following equation: = (EQ. 5) PPA METRIC (CONSIDERING BOTH AREA OVERHEAD AND PERFORMANCE IMPROVEMENT). V. EXPERIMENTAL SETUP We define a single-issue baseline processor based on MIPS architecture including 32×32-bit general-purpose registers in a 5-stage pipeline. We do not constrain the number of input and output operands in custom instruction generation. However, we , (EQ. 3) APPLICATION ENERGY REDUCTION. 277 explicitly account for the data-transfer cycles between the base processor and the custom logic if the number of inputs or outputs exceeds the available registerfile ports. CHIP [20] assumes two-cycle software latencies for integer multiplication instructions and single-cycle software latencies for the rest of the integer operations. Since the latency of arithmetic and logical operations are considerably different, assuming singlecycle latency for all of the logical and arithmetic instructions may lead to non-optimal results. We assume a more accurate and fare value for latency of different operations. In software we assume single-cycle latency for each operation. However, in hardware, we evaluate the latency of arithmetic and logic operators by synthesizing them with a 90-nm CMOS process and normalize them to the delay of the MIPS processor. Table 1 shows the normalized delay and area of some VEX operators [26]. TABLE 1. NORMALIZED VALUES OF DELAY AND AREA OF ARITHMETIC AND LOGIC OPERATORS. VEX Operations Add Normalized delay Normalized Area 77% 0.3% And 1% 0.03% Eqt 6% 0.1% imul16 77% 1.6% imul32 144% 6.3% wallace-mul16 97% 1.4% wallace-mul32 199% 5.8% Shadd 76% 0.4% Shift 8% 0.3% Slct 1% 0.1% Xor 1% 0.1% single-issue processor. The latency of a custom instruction as a single instruction is approximated by the number of cycles equal to the ceiling of the sum of hardware latencies over the custom instruction critical path. The difference between the software and the hardware latency is used to estimate the speedup. Input and output violations are taken into account by the penalties in the fitness function. We do not include division operations in custom instructions due to their high latency and large area overhead. We have also excluded memory access instructions from custom instructions, to avoid nondeterministic latencies due to the cache memory system. We assume that given RFin read ports and RFout write ports supported by the registerfile, RFin input operands and RFout output operands can be encoded within a single-instruction word. For calculating the power consumption of the registerfile, we have implemented registerfiles with different read and write ports with Verilog®, and synthesis them in 90nm technology. Then we place and route them with Cadence® SOC Encounter, and analyze power consumption of each of them under random data sets. VI. VEX is composed of many components that their main objective is to simulate, analyze, and compile C programs for VLIW processor architectures. VEX has also the capability to extract DFGs and CFGs from C programs. We use this possibility to extract CFGs from domain-specific benchmarks. The extracted CFGs are converted to an intermediate format that is known for our custom instruction selection framework. On the other hand, with the help of gcov and gprof [27] in conjunction with GCC, the code coverage and the number of iterations of each basic block of codes are calculated in the dynamic execution of domain-specific benchmarks. These numbers and the intermediate format are processed by our custom instruction selection framework to find a set of CIs that increase the performance of domain-specific benchmarks. The accumulated software latency values of a custom instruction candidate subgraph estimates its execution time in a 278 EXPERIMENTAL RESULTS In Figure 6, we analyze the effect of different input and output constraints (i.e., CIin, CIout) on the achieved performance improvement of custom instructions versus registerfiles with different (RFin, RFout). As shown in this figure when the custom instructions are generated based on (CIin, CIout)= (3,2) and the registerfile constraint is (3,2), the performance improvement is higher than when (CIin, CIout)= (∞, ∞) and (RFin, RFout)= (3,2). Therefore, both of (CIin, CIout) and (RFin, RFout) should be considered for achieving the best performance improvement. Another important observation from this figure is that performance improvement is almost saturated in (CIin, CIout)= (6,3) and (RFin, RFout)=(6,3) and the performance improvement of this point is about 3% higher than (CIin, CIout)= (3,2) and (RFin, RFout)= (3,2). Furthermore, the former achieves this improvement with a greater area overhead compared to the latter configuration. Therefore, the area and power consumption overhead of each configuration should be considered to reach the best performance per area or performance per power. When the base processor is augmented with custom instructions, the registerfile is faced with a wide range of access patterns instead of simple accesses. There are two approaches for implementing the registerfile when the number of CI operands is greater than the number of registerfile read ports. In the first approach, for each access to the registerfile, all read ports of the registerfile are activated. In the other approach, read form registerfile is controlled with an activation signal and the number of activated read ports are equal to the number of the operands of the CI; e.g. only 2 read ports of a (4, 1) registerfile is enabled when the CI has only 2 operands. FIGURE 6.EFFECT OF IO CONSTRAINTS AND REGISTERFILE PORTS ON THE ACHIEVED PERFORMANCE IMPROVEMENT Power 4.5 25% 20% Energy Reduction Figure 7 compares the power consumption of these two techniques when the read and write ports of the registerfile are 4 and 1, respectively. When we use the latter method, the power consumption of a (2, 1) access to a (4, 1) registerfile is reduced 18% in compare with the former method. 15% 10% 5% 4 (4,1) With Activation 3.5 0% (3,1) (4,1) Without Activation (3,2) 3 (3,1) (4,1) Access FIGURE 7 POWER CONSUMPTION OF TWO DIFFERENT REGISTER FILE READ METHOD. Figure 8 depicts the reduced energy dissipation of the MD5 application. The values are calculated by (Eq. 3). Based on these results, it can be inferred that the registerfile (4, 2) for the (CIin, CIout) = (4, 2) and registerfile (3, 2) for (CIin, Cout) = (3, 2) have the highest energy reduction in compare to the other registerfile configuration for the selected CIs in MD5 application. Based on the performance improvement and Energy reduction, we can claim that (3, 2) and (4, 2) registerfiles are the best choices. But, which of them is better? This question can be answered by the proposed metric in (Eq. 4). The PER for each configuration is calculated relative to PER for the (2,1) registerfile. The following equation is used to evaluate the normalized PER: = ( , ( , ) (4,2) (4,4) FIGURE 8. REDUCED ENERGY DISSIPATION IN COMPARE WITH THE CASE WITHOUT CIS IN MD5 APPLICATION. ) In Figure 9, the results show that PER metric of the registerfile when it has 4 read and 2 write ports is better than the other configurations. However, if we want to consider the area overhead in our selection (Figure 2), the hardware cost of the (3, 2) registerfile is also 20% lower than (4, 2) registerfile. But, these days, the hardware cost is not as important as before. 7% 6% 5% PER (2,1) (4,1) (RFin, RFout) 4% 3% 2% 1% 0% (3,1) (3,2) (4,1) (4,2) (4,4) (RFin, RFout) FIGURE 9. PER VALUE OF THE MD5 APPLICATION UNDER DIFFERENT REGISTERFILE PORTS NUMBER RELATIVE TO (2, 1) REGISTERFILE. (EQ. 6) NORMALIZED PER RELATIVE TO PER(2,1) 279 Figure 10 shows the normalized PPA value, calculated by (Eq. 5), for different registerfile configurations relative to the (2, 1) registerfile. It can be deduced that the PPA of the (3, 1) and (3, 2) registerfiles are almost 45%-60% higher than the other registerfile configurations. If we exclude the hardware cost, it can be concluded that (3, 2) and (4, 2) registerfiles are the most appropriate registerfile configuration for MD5 application. Finally, the PPA metric values are shown in Figure 14. Due to the highest area overhead of the registerfile compared to the improvements with CIs that are obtained by increasing registerfile read and write ports, it is obvious that PPA values decrease when the number of read and write ports are increased. IPv4-Trie (3,2) (4,1) 12% 100% 10% 90% 8% PER 80% 70% 60% PPA IPSec 6% 4% 50% 40% 2% 30% 0% 20% (3,1) 10% 0% (3,1) (3,2) (4,1) (4,2) (4,2) (4,4) (RFin, RFout) (4,4) (RFin, RFout) FIGURE 13. PER VALUE OF THE IPSEC AND IPV4-TRIE APPLICATION UNDER DIFFERENT REGISTERFILE PORTS NUMBER RELATIVE TO (2, 1) REGISTERFILE FIGURE 10. PPA VALUE OF THE MD5 APPLICATION UNDER DIFFERENT REGISTERFILE PORTS NUMBER RELATIVE TO (2, 1) REGISTERFILE. IPSec 120% 100% PPA The performance improvements for two other packet processing applications (IPSec, IPv4-Trie) are shown in Figure 11. As shown, the obtainable performance improvements are up to 40% when the base processor is augmented with a set of CIs . IPv4-Trie 140% 80% 60% 40% 20% Performance Improvement IPSec IPv4-Trie 0% (3,1) 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% (4,1) (4,2) (4,4) (RFin, RFout) FIGURE 14. PPA VALUE OF THE IPSEC AND IPV4-TRIE APPLICATIONS UNDER DIFFERENT REGISTERFILE PORTS NUMBER RELATIVE TO (2, 1) REGISTERFILE. VII. CONCLUSION (3,1),(3,1) (3,2), (3,2) (4,1),(4,1) (4,2), (4,2) (4,4), (4,4) (Cin, Cout), (RFin, RFout) FIGURE 11.THE IPV4-TRIE AND IPSEC PERFORMANCE IMPROVEMENT FOR DIFFERENT CONFIGURATION OF REGISTERFILE AND IOS. Energy reduction of these two applications for different number of read and writes ports in registerfile are illustrated in Figure 12. It shows that the energy reduction in IPSec and IPv4-Trie follows the same trend as in MD5. The (3, 2) and (4, 2) registerfiles reduced the energy dissipation more than other configurations. The PER metric values are shown in Figure 13. It depicts that the highest PER value in IPSec is related to the (3, 2) and (4, 2) registerfiles and in IPv4-Trie to the (3, 2), (4, 2), and (4, 4) registerfiles. IPSec Energy Reduction (3,2) IPv4-Trie 30% 25% 20% 15% 5% 0% (3,2) (4,1) (4,2) To back up our claim, a framework is proposed that consists of three basic parts: Custom Instruction Selection, registerfile Power Estimation, and Performance × Energy Reduction (PER) Calculation Framework. This framework is used to explore design space of registerfile I/Os considering both power per access and performance improvements. To evaluate identified CIs, a new metric, named PER, is introduced considering both performance improvements and energy dissipation. The set of CIs, identified for the specified number of registerfile I/Os that give the higher PER are the most valuable ones in terms of performance and energy reduction. It is also shown that the performance improvements, energy reduction, and PER value of the (3, 2) and (4, 2) registerfiles are different a little in the selected packet-processing benchmarks. We are currently working on our proposed framework to extend it and boost its capability to consider the effect of other part of embedded processors in terms of both performance and power dissipation on custom instruction selection algorithms. 10% (3,1) In this paper the effects of different read and write ports of registerfile on the performance improvements and energy dissipations in extensible processors are analyzed. It is shown that power dissipation of registerfile includes almost 55% of the total power dissipation in embedded processors datapath. Therefore, considering power per access of registerfile in custom instruction selection algorithms is one of the most important issues that should be considered by the designers. (4,4) (RFin, RFout) FIGURE 12. ENERGY REDUCTION IN COMPARE WITH THE CASE WITHOUT CIS. 280 REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] ARM, The architecture for the digital world, available online: www.arm.com. MIPS technologies Inc., available online: www.mips.com. S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, and E. Earlie, “Register file power reduction using bypass sensitive compiler,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 27, no. 6, June 2008, p1155-1159. R. Nalluri, R. Garg, P. R. Panda, “Customization of register file banking architecture for low power,” 20th International Conference on VLSI Design (VLSID'07), pp239-244, 2007. J. Scott et al. “Designing the low-power m*core architecture,” IEEE Power Driven Microarchitecture Workshop, 1998. V. Zyuban and P. Kogge. “The energy complexity of registerfiles,” ISLPED, 1998. Ricardo E. Gonzalez “XTENSA: A configurable and extensible processor,” IEEE Micro, vol. 20, 2000, pp. 60-70. J. Cong et al., “Instruction set extension with shadow registers for configurable processors,” in Proc. FPGA, Feb. 2005, pp. 99–106. R. Jayaseelan et al., “Exploiting forwarding to improve data bandwidth of instruction-set extensions,” in Proc. DAC, Jul. 2006, pp. 43–48. N.S. Kim, and T. and Mudge,” Reducing register ports using delayed write-back queues and operand pre-fetch,” in Proceedings of the 17th annual international conference on Supercomputing, pp. 172-182, 2003. Park, M.D. Powell, and T.N. Vijaykumar,” Reducing register ports for higher speed and lower energy,” in Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 171182, 2002. K. Karuri, A. Chattopadhyay, M. Hohenauer, R. Leupers, G. Ascheid, and H. Meyr,” Increasing data-bandwidth to instruction-set extensions through register clustering,” Proceedings of the IEEE/ACM international conference on Computer-aided design, pp. 166-177, 2007. F. Sun et al., “A scalable application-specific processor synthesis methodology,” in Proc. ICCAD, San Jose, CA, Nov. 2003, pp. 283–290. K. Atasu, L. Pozzi, and P. Ienne, “Automatic Application-Specific Instruction-Set Extensions under Microarchitectural Constraints,” in Proc. of the 40th annual Design Automation Conference, Anaheim, CA, USA: ACM, Jun. 2003, pp. 256–261. 281 [15] M. Baleani et al., “HW/SW partitioning and code generation of embedded control applications on a reconfigurable architecture platform,” in Proc.10th Int. Workshop HW/SW Codesign, May 2002, pp. 151–156. [16] C. Alippi et al., “A DAG based design approach for reconfigurable VLIW processors,” in Proc. DATE, Munich, Germany, Mar. 1999, pp. 778–779. [17] P. Biswas et al., “ISEGEN: Generation of high-quality instruction set extensions by iterative improvement,” in Proc. DATE, 2005, pp. 1246– 1251. [18] P. Bonzini and L. Pozzi, “Polynomial-time subgraph enumeration for automated instruction set extension,” in Proc. DATE, Apr. 2007, pp. 1331–1336. [19] P. Yu and T. Mitra, “Satisfying real-time constraints with custom instructions,” in Proc. CODES+ISSS, Jersey City, NJ, Sep. 2005, pp. 166–171. [20] K. Atasu, C. Ozturan, G. Dundar, O. Mencer, and W. Luk,” CHIPS: Custom hardware instruction processor synthesis,” in IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, Vol(27),pp. 528-541, 2008. [21] Paolo Bonzini, Dilek Harmanci, and Laura Pozzi, “A Study of energy saving in customizable processors,” SAMOS 2007, LNCS 4599, pp. 304–312, 2007. [22] David Brooks, Vivek Tiwari, and Margaret Martonosi, “Wattch: a framework for architectural-level power analysis and optimizations,” in Proc. of the 27th annual international symposium on Computer Architecture, pp. 83-94, 2000. [23] V. Zyuban, and P. and Kogge,” The energy complexity of register files,” in Proceedings of the international symposium on Low power electronics and design, pp. 305-310, 1998. [24] L. Pozzi, K. Atasu, and P. Ienne, “Exact and Approximate Algorithms for the Extension of Embedded Processor Instruction Sets,” ComputerAided Design of Integrated Circuits and Systems, IEEE Transaction on, vol. 25, July 2006, pp. 1209-1229. [25] Panos M. Pardalos, and Nisha Desai, “An algorithm for finding a maximum weighted independent set in an arbitrary graph,” in International Journal of Computer Mathematics, vol. 38, pp.163-175, 1991. [26] VEX Toolchain, available online: www.hpl.hp.com/downloads/vex. [27] The GNU operating system, available online: www.gnu.og.