J Supercomput (2014) 68:948–977 DOI 10.1007/s11227-013-1075-8 Customized pipeline and instruction set architecture for embedded processing engines Amir Yazdanbakhsh · Mostafa E. Salehi · Sied Mehdi Fakhraie Published online: 6 February 2014 © Springer Science+Business Media New York 2014 Abstract Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains. A. Yazdanbakhsh · M. E. Salehi (B) · S. M. Fakhraie Nano Electronics Center of Excellence, University of Tehran, 14395-515 Tehran, Islamic Republic of Iran e-mail: mostafa.salehi@gmail.com; mersali@ut.ac.ir A. Yazdanbakhsh e-mail: a.yazdanbakhsh@ece.ut.ac.ir S. M. Fakhraie e-mail: fakhraie@ut.ac.ir 123 Architecture for embedded processing engines 949 Keywords Embedded packet-processing engine · Customized application-specific processor · Custom instruction generation · Area performance tradeoffs · Custom instruction data bandwidth 1 Introduction Higher bandwidth and performance demands besides the development of new networking services call for flexible and optimized packet-processing engines. While Moore’s Law continues to promise more transistors, some critical problems in processor design, including design complexity, and power and thermal concerns have forced the industry to shift its focus away from complex high-performance power-hungry single processors to multi-core designs [1–4]. Industry has also shifted to multi-core designs that provide multiple processing cores on a single chip [5–8]. Azizi et al. [9] show the optimization results under performance and performance per area objectives for different processor architectures including simple single-issue in-order processors and complex high-performance out-of-order architectures. In out-of-order processors, wide issue widths yield high performance improvements exploiting larger and wider structures in the execution core to process more instructions in parallel. In particular, these structures include instruction scheduler, registerfile, and bypass network that do not scale and make such designs more complicated. As described in [9], the area and power overheads of implementing complex out-of-order processors can outweigh their performance benefits. Therefore, the single- and dual-issue in-order designs are shown to be more optimal in terms of performance per area. To boost the performance, instead of designing more and more complex highperformance single processors for improving instruction-level parallelism within a single thread, architects can design simple processing elements and compensate their lower individual performance by replicating them across the chip and improve overall efficiency by focusing on multi-task parallelism [2,10,11]. The recent multi-core systems target the highest throughput within a specified power and area budget. However, they could have more simple cores to exploit additional parallelism or fewer more powerful ones, trading between high-performance and low-power architectures. Amdahl’s law in the multi-core era [2,12] suggests that only one core should get more resources to rapidly perform the sequential part of a code and the others should be as small as possible to run the scalable parallel parts. That is because replicating the processing cores yields linear and sub-linear speed-up for the parallel and the sequential parts, respectively [13]. Eyerman and Eeckhout [14] model the critical sections in Amdahl’s law [12] and show its implications for multi-core designs. Amdahl’s law suggests many tiny cores for optimum performance in asymmetric processors, however, Eyerman and Eeckhout [14] find that fewer but larger cores can yield substantially better performance, supposing that critical sections are executed on a large high-performance core. They have modeled the attainable speed-up as a function of the large and tiny cores sizes. The square-root relationship between resources and performance (known as Pollack’s law [13]) is also exploited for calculating the multi-core performance. Their results show that for low contention rates for threads wanting to 123 950 A. Yazdanbakhsh et al. enter the same critical section, more tiny cores yield the best speed-up. However, for higher contention rates, fewer larger cores yield the best performance. According to this discussion, it is observed that the architecture of the composing cores of a multi-core system strictly depends on the parallelism level of the application workload. For highly parallel workloads (e. g. packet-processing tasks for different packet flows) the throughput is highly dependent on the number of the cores that can fit on a die, therefore, both the performance and area of the cores must be considered in evaluating the system overall performance. In these cases, the attainable performance of multi-core system is related to the product of the performance per core and the number of cores that will fit in a given area budget. Thus in a multi-core system only the processor performance is no longer known as the performance metric, but performance per square millimeter of each processor. To account for workloads with lower amounts of parallelism which are not the focus of this paper, one needs to prepare a high performance complex core in addition [12]. Therefore, the focus of this research is on increasing the efficiency by customizing a single embedded packet-processing core with the objective of optimizing performance per area of each core in a multi-core environment for processing parallel network workloads. Current embedded systems are expected to deliver high performance under tight area and power constraints. In today’s embedded systems, conflicting requirements of performance, area, and flexibility, make the so called customized general processing cores a proper tradeoff for system designers. Designers carefully explore the target application and adjust the architecture to achieve the best tradeoffs. General-purpose processors alone are often not profitable enough for adapting to the strict area, performance, and power consumption requirements of embedded applications. Hence, a common approach in the design of embedded systems is to implement the controlintensive tasks on a general-purpose processor and the computation-intensive tasks on custom hardware. In this context, general-purpose processors augmented with custom instructions could provide a good balance between performance and flexibility. Augmenting a core processor with new instructions has some benefits. First, the system still remains programmable and supports modest changes to the application. Second, the computationally intensive portions of the applications in the same domain often have similar structures. Thus, the customized processors can often be applicable across a set of applications. A customized processor is composed of a general purpose base processor augmented with application specific Instruction Set Extensions (ISE) to boost the performance by accommodating Custom Instructions (CI) which has become a vital component in current processor customization flows. A custom instruction encapsulates a group of frequently occurring operation patterns of an application in a single instruction which is implemented as a custom functional unit (CFU) embedded within the base-processor. CIs improve performance and energy dissipation through parallelization, chaining of hot-spot operations of application, reducing the number of instructions and consequently reducing instruction fetches, cache accesses, and instructions decodes, and also reducing the number of intermediate results stored in the registerfile. One way of building a good ISE is to combine several operations into one single, complex instruction. Combining such operations becomes feasible only when a special instruction is allowed to have many operands from the registerfile. There- 123 Architecture for embedded processing engines 951 fore, one major bottleneck in maximizing the performance of custom instructions is the limitation on the data-bandwidth between the registerfile and the CFU which can be improved by extending the registerfile to support additional read and write ports. However, additional ports result in increased registerfile area, power consumption, and cycle time. This reminds that the decision on selecting the most appropriate custom instructions should be carefully made in multi-core platforms. Relaxing the registerfile I/O constraints for specialized custom instructions have a significant impact on the amount of achieved speed-up at the expense of some area and power overhead [15,16]. Although, at present the area is not considered as a limiting factor in fabrication of a single processor, the performance per area should be optimized in multi-core designs. It means that with a specific silicon area budget, there would be different architecture designs with different performance gains that consume the specified area budget. This motivates the designer to consider different processor design alternatives to gain as much performance per die area as possible. Increasing the processor area to gain more performance at the expense of reducing the number of processors or exploiting more simple processors are the two design alternatives that the processor designer faces in the field of multi-core embedded processor design. In this paper, we exploit a custom instruction exploration framework that provides architecture-aware identification of a profitable set of custom instructions. Given the input/output constraints, available registerfile bandwidth, and also transfer latencies between CFU and baseline processor, our design flow identifies area, performance, and code-size tradeoffs. We have also proposed a novel approach to increase the registerfile bandwidth without increasing the size of the registerfile. This is achieved by the developed instruction set architecture and exploiting the proposed pipeline backwarding logic which is named after the forwarding concept common in the pipeline architectures. The remainder of this paper is organized as follows: in the following section, we review the existing researches and state the main contributions of this paper. In Sect. 3, we describe the main stages of the exploited framework. Section 4 describes the proposed instruction set and the pipeline architecture of the developed customized processor. Section 5 provides the evaluation method and experimental results for a set of representative networking benchmarks. Finally, we conclude in Sect. 6. 2 Related work and motivation Among the variety of designs for customized accelerators, the most common practice still is the instruction-set extensible processor [17,34]. Customization of a processor through instruction set extensions (ISEs) is now a well-known technique in highperformance embedded systems. Instruction set extension can be divided into instruction identification and instruction selection phases. There are significant researches on automatic identification and selection of ISEs to create application specific processors [18–22]. Given the application code, instruction identification consists of encapsulating some basic operations into larger and more complex operations. These complex operations are recognized by their representative subgraphs which cover the application graph. Once the subgraphs are identified and extracted from the application flow graph, these are considered as single powerful instructions and then should pass 123 952 A. Yazdanbakhsh et al. through a selection process. Identification and selection processes use guide function and cost function to take into account constraints that the new instructions have to satisfy. Partitioning the required operations of an application into base-processor instructions and CIs is done under certain constraints. First, there is a limited area allocated to CFUs. Second, the data bandwidth between the base processor (including its registerfile) and the CFU is limited, and the data transfer costs have to be explicitly evaluated. Next, an instruction supports a limited amount of encoding bits; therefore, only a limited number of input and output operands can be encoded in a fixed-length instruction word. Most embedded processors today have instruction words no wider than 32 bits. Considering that many of them come with 32 or more general purpose registers, it is not possible to encode many register operands in an instruction word. Exhaustive CI identification algorithms are not scalable and generally fail for larger applications with assumption of unlimited registerfile read and write ports. Most earlier techniques for ISE generation limit the number of I/O ports of the registerfile to reduce the set of subgraphs that can be identified. Since the speed-up obtainable by custom instructions is limited by the available data bandwidth between the base processor and CFU, the limitation on the number of registerfile accesses from a CFU can influence the performance of a custom instruction to a great extent. For improving the performance, it is desirable to have a large data-bandwidth from registerfile to CFUs. However, researchers have shown that the area and power of the registerfile, as well as the access delay, increases proportional to the number of registerfile ports [23]. Since most embedded processors are designed under very tight area restrictions, modern embedded processors restrict the number of I/O ports of registerfile to save area and power. The researches in [24] and [25] have presented a technique to use a custom registers called Internal Register (IR) inside the CFU accompanying with separate move instructions. The move instructions transfer data between the registerfile and IR before and after executing a custom instruction. Similarly, the MicroBlaze processor [26] from Xilinx Inc. provides dedicated Fast Simplex Link (FSL) channels to move operands to the CFU. The Tensilica Xtensa [27] uses custom state registers for moving additional input and output operands between the base processor and CFUs. Binding of base processor registers to custom state registers at compile time reduces the amount of data transfers. The register clustering method described in [28] exploits the clustering techniques used in VLIW machines as a promising technique for reducing the registerfile ports in presence of ISEs. Clustered VLIWs try to minimize the number of registerfile ports by partitioning the registerfile into several small pieces, and generally restricting a functional unit to access only one of these clusters [29]. Special instructions are used to move data from one cluster to another. Kim and Mudge [30] developed two techniques for reducing the number of registerfile ports. The techniques exploit pre-fetch method and delayed write-back. The reviewed researches rely on special move instructions for reading the input operands for CFU and also extra move instructions are required to transfer the generated results from CFU to the general registerfile. We refer to these techniques as MOV model in the rest of the paper. Pozzi and Ienne [31] propose I/O serialization or registerfile pipelining approach to reduce custom instruction I/O constraints. They propose a custom instruction identification algorithm that serializes CFU operand reads and result writes that exceed the 123 Architecture for embedded processing engines 953 registerfile I/O constraints. Their method also schedules the registerfile accesses and minimizes the CFU latency by pipelining and multi-cycle registerfile access. Though the mentioned method improves the data bandwidth of regiterfile by serialization, it does not mention the architectural implementation concerns and do not address the related problems such as encoding multiple operands in a limited fixed-length instruction format and also neglect the pipeline architecture and data hazard design implications. Atasu et al. [32] have presented an algorithm to solve the registerfile limited data bandwidth problem with relaxed I/O constraints. However, for each additional I/O access they calculate a constant penalty on hardware latency. This leads to an inaccurate calculation of the speed-up, because some I/O accesses can overlapped with the execution stage. In another work Atasu et al. [33] introduce a design flow to identify area, performance, and code-size tradeoffs studying the effect of I/O constraints, registerfile ports, and compiler transformations on the application efficiency. They show that in most cases, highest performance is achieved when the I/O constraints are relaxed. Bonzini and Pozzi [17] formalized the problem of recurrenceaware instruction-set extension. They presented a framework to explore the tradeoffs between the achieved speed-up and the CI size. Using this tradeoff, they can obtain the optimal solutions through balancing between the benefit of higher-gain large CIs that are used less often and smaller CIs that are found many times in the applications. However, they have not considered the effect of large registerfiles on the processor area or the limited instruction encoding bits for accommodating the custom instruction operands is neglected in these studies. Another approach to address the scalability concern for exhaustive ISE generation for larger applications and registerfiles with a large number of read and write ports is proposed in [34]. This approach considerably prunes the list of ISE candidates and hence, runs faster. Adding custom instructions to customized processors would increase their performance. However, die area and energy efficiency are as important as performance in embedded systems. The researchers in [35–38] merge a collection of sub-graph datapaths for resource sharing during synthesis of ISE to increase the reusability of custom instructions and reduce the imposed customized processor die area. These techniques explore the design space of customized processors to support a wider set of ISEs and increase the number of custom instructions that can be identified within a given area budget. However, reducing the overall area of the customized processor is not only affected by custom instructions. Results in [15] and [16] show that a major percentage of power and area in a processor datapath is consumed in the registerfile. So reducing the area and power consumption of the registerfile has a great impact on overall die area of the customized processor. Therefore, maximizing the area savings is strictly dependent on the registerfile area, which is not considered quantitatively in the above studies. Figure 1 depicts the area overhead of different registerfile I/Os relative to the area of a (2, 1) registerfile (i.e., two read ports and one write port) in the MIPS processor evaluated by hardware synthesis with 90 nm CMOS standard cell library. The selected MIPS processor has a 32-bit single issue in-order 5-stage pipeline architecture as introduced in [39]. It reveals that increasing each read (write) port imposes almost 20 % (10 %) area 123 954 A. Yazdanbakhsh et al. Fig. 1 Area overhead of increasing the number of read and write ports of registerfile relative to the (2, 1) registerfile. The registerfiles are synthesized with a 90 nm CMOS library Fig. 2 MIPS layout using a 90 nm CMOS library overhead. It is also observed from synthesis results (as shown in Fig. 2) that the specified (2, 1) registerfile occupies almost 45.7 % of the base MIPS processor core area. Other researchers on this area such as Park et al. [40] proposed two different techniques for reducing the read and write ports of the registerfiles including bypass hint technique and the register banking. Introduction of shadow registers [41] and use of forwarding paths of the base processor [42] can also improve the data bandwidth. While such methods reduce the hardware cost of multi-port registerfiles, they make instruction scheduling and register allocation extremely complex for the compiler. A new architecture is introduced in [43] that is based on MIPS processor and supports custom instructions with multiple read and multiple write operands. It uses the method proposed in [31] to relax the constraints on the number of read/write ports in registerfile by pipelining the read/write accesses to registerfile and therefore, preserving the size of the base registerfile. [54] also proposes an approach for custom instruction generation considering both area and performance at the same time. The identified CIs in [54] are 123 Architecture for embedded processing engines 955 pipelined and extra registers are inserted into the data-path to serialize the registerfile access. This approach supports CIs that require more operand than what is supported by the base processor. However, these designs do not impose any additional hardware cost at the expense of losing some cycles due to the pipelining in registerfile accesses. Though the studied researches improve the data bandwidth of registerfile, they do not discuss the issue of implementation concerns and do not address the related problems such as encoding multiple operands in a limited fixed-length instruction format. Fixed-length instruction encoding constraint introduced in RISC processors, limits the space for encoding extra opcodes and operands for new custom instructions. Supporting arbitrary custom operations with sufficient operands in a generic base processor incurs large overheads. This overhead is because of extra operands and opcode bits of custom instructions. This overhead leads to wider instructions, and hence, more energy consumption in the instruction fetch. The above-mentioned researches do not discuss the issue of encoding extra operands in the limited instruction format. For example, the work based on shadow register files enforces either the shadow register identifiers or architectural register identifier for input operands of the custom instruction. She et al. [53] tackle the problem of integrating custom instructions in an embedded processor and reduce the requirement for operand encoding and registerfile ports by exploiting a software-controlled bypass network. Their method relies on a compiler backend for generating and scheduling special instructions. Size of the generated instructions of [53] is limited to patterns of just two operations and they exploit a look-up table for storing control signals of only eight custom instructions. The table is visible to the software and can be reconfigured for new custom instructions. Another important challenge in designing customized processors is the pipeline architecture and hazard design implications which are also neglected in the studied researches. Data hazards of the pipeline are resolved by employing data forwarding. For a multi-operand custom instruction, these hazards can occur on any of the input operands. In multi-cycle register reads or shadow registers it is not clearly stated how data hazards are resolved for the additional operands. Our work addresses both of these important issues. In addition, our method avoids complex forwarding and omits temporary registers in custom instruction in terms of internal registers, local registerfile, shadow registers and use pipeline registers instead. Towards designing our high-performance embedded packet-processing engines, we have utilized MIPS base instruction set and enhanced it with new architectural solutions and applied the concept of pipeline backwarding to it to improve the registerfile bandwidth without increasing the size of the registerfile. We have studied both MIPS and ARM architectures for packet processing applications in [44] and [45]. The key point of our research is simultaneous optimization of search algorithm for finding larger custom instructions while keeping the number of IO branches of the CI subgraphs at an application-based optimum value. In summary, the design objectives of the proposed architecture are as follows: (1) we increase the data bandwidth between the registerfile and the CFU without increasing the registerfile area. (2) We ensure that the proposed custom instructions do not modify the base processor registerfile and keep its base instruction set applicable. (3) The proposed pipeline backwarding concept is implemented with a simple hardware and does not considerably affect the pipeline architecture of the base processor. (4) Fidelity of the proposed instruction 123 956 A. Yazdanbakhsh et al. set architecture does not complicate the base processor compiler. (5) The proposed instruction set can accommodate a large number of custom instructions and meanwhile lead to more compact code in comparison to the traditional methods. 3 Architecture-aware processor customization In this section, we briefly introduce our instruction set extension framework. Our framework is composed of custom instruction identification and selection considering the architectural constraints. 3.1 Custom instruction identification framework The complete flow of our proposed framework to extend instruction set of the base processor for an extensible processor design is depicted in Fig. 3. As shown, our framework consists of two main parts: (1) custom instruction identification, (2) custom instruction selection. The process of instruction set identification initiates with extracting the data flow graph (DFG) from the application. DFG is a directed graph G = (V, E) in which V is the set of finite nodes labeled with the basic operations of the processor and E is a finite set of edges that shows the data dependencies among these operations. In this part, the VEX [29] configurable compiler and simulation tool is utilized to extract DFGs from applications that are written in C/C++. In addition, to obtain statistics of the real behavior of the desired applications, the code coverage and the iteration frequency of all of the basic blocks are calculated in the dynamic execution of the domain-specific benchmarks by gprof and gcov in conjunction with gcc [46]. gprof is a profiler that informs the spent time percentage in each function of the applications during its execution. gcov acts as a coverage program that gives the information about the number of times each line of the code and each basic block is executed. The aim of custom instruction identification framework is to find a set of sub-graphs from DFG that can be added to the basic instruction set in hardware. Exhaustively extracting all of the possible sub-graphs from a DFG with n nodes is not computationally feasible, especially when the number of nodes is increased. So, an algorithmic generation method is proposed in [21]. We have used this algorithm to limit the search space and to generate valid sub-graphs based on the following three constraints and the extra fourth constraint that is imposed by our modified algorithm: 1. The number of inputs/outputs of each sub-graph should be compared to the maximum allowed custom function input/output ports. 2. Memory operations, e.g., Load/Store, are considered as forbidden nodes and have to be excluded from generated sub-graphs. 3. All of the generated sub-graphs should be convex. It is a legality check to assure that when all the sub-graph nodes are merged into one custom instruction, all the input operands are available at the beginning and all the outputs are produced at the end of the execution cycle of the generated custom instruction. 123 Architecture for embedded processing engines 957 Fig. 3 The proposed custom instruction identification framework 4. The hardware latency of a candidate sub-graph critical path is compared and normalized to the base processor critical path. The proposed algorithm in [21] is modified to extract all of the possible valid sub-graphs and preserve them into a structure that will be used later in the custom instruction selection part. In the next part, all of the identified matches are classified into isomorphic classes, called template classes exploiting the algorithm introduced in our previous work [47]. For evaluating the performance of each template, the number of cycles that are reduced by means of custom instructions are calculated. The performance of each template is calculated based on the actual hardware latency of each operation that is calculated after hardware synthesis (as calculated in our previous work [49]), by the following formula: Iteration × SWLatency − HWLatency Templateperformance = # of matches in each template SWLatency represents the number of clock cycles that it takes to run this match totally by software and HWLatency is showing the number of cycles that the execution of this 123 958 A. Yazdanbakhsh et al. part takes if it is implemented by custom instruction in hardware. The term HWLatency is rounded up to the nearest integer number to represent the number of execution clock cycles in the hardware. Iteration denotes the number of times that the specified custom instruction is executed during dynamic execution of the main application. After all, the valid matches are enumerated by the exact algorithm [21] and the conflict graph Gc = (V, E) is made. In the conflict graph, V is the set of generated matches which are called conflict nodes, and E is the set of edges indicating the conflicts between matches. An edge exists between two conflict nodes if they have at least one common operation in their representative subgraphs. Template selection part consists of two sub-parts as introduced in our previous work [48]: (1) locally find the maximum-weighted independent set (MWIS) in each template; (2) globally select the MWIS for all of the templates. Due to the fact that the number of matches in each template is limited, we use the exact MWIS for the first part. For the second part, the GWMIN algorithm proposed in [50] is utilized to find the set of independent templates that yield the best performance. When a template is selected by this algorithm, a label is assigned to it, and all of the adjacent templates (i.e., the templates that have a common node with the selected one) are labeled as deleted. The algorithm iteratively searches among all of the unselected and undeleted templates until all of the templates are labeled selected or deleted. In each iteration, the algorithm selects template u that (u) , where maximizes the associated merit introduced by the following formula: [dGW(u)+1] i W (u) is the weight of the template (i.e., template performance) and dGi (u) is degree of the template (i.e., number of edges originating from template). The following code represents the algorithm in pseudo C notation. 123 Architecture for embedded processing engines 959 Fig. 4 Performance improvement of the selected applications versus numbers of different input/Output constraints for custom instructions The next part in the framework flow graph, estimates the application performance improvement when the base processor is augmented with the selected templates by the following formula: ApplicationPerformance = selected templates TemplatePerformance Application performance stands for the number of cycles that will be saved with the help of custom instruction implementation. The packet-processing applications which were selected from the PacketBench [51] include IPv4 look-up (i.e., IPv4-trie and IPV4-radix), flow classification (Flow-class), and encryption and message digest algorithms (i.e., IPSec and MD5) that are widely used in network processing nodes. The representative benchmark applications are completely introduced and analyzed in our previous work described in [44] and [45]. The proposed algorithm is applied to the selected packet-processing benchmark applications for different input/output (CIin, CIout) constraints for custom instructions. The performance improvement is shown in Fig. 4. As shown, the performance gain is up to 40 %. Due to the high rate of memory operations in some benchmarks (e.g., Flow-Class and IPv4-Radix), these applications cannot gain much better performance in the presence of the proposed simple custom instruction and these applications can be further improved with complex custom instructions that include memory accesses such as the one proposed in [52]. It is also noticeable that for (CIin, CIout) higher than (4, 4), the performance is not considerably improved. 3.2 I/O aware custom instruction selection The bandwidth of the registerfile is a limitation in improving the performance of customized processors. We extend our framework to analyze the effects of changing 123 960 A. Yazdanbakhsh et al. the number of registerfile read/write ports on the performance of the selected custom instruction. In [31] a solution is proposed for resolving the limitation of the number of registerfile read/write ports for customized processors. In this method, simultaneous read/write accesses that exceed the number of available read/write ports are serialized with some pipeline registers. Pipeline registers make it possible to have extra read/write operations using the base registerfile in multiple cycles. Although, this architecture has a little area overhead, applications lose some performance due to the imposed extra cycles in multi-cycle reads/writes. In these architectures the performance degradations in each template is evaluated due to the multi-cycle reads/writes from the registerfile. The following equation is exploited to calculate the template performance considering the multi-cycle read writes imposed by register pipelining: TemplatePerformance = Iteration × SWLatency − HWLatency matches in each template (CIin − RFin ) − RFin (CIout − RFout ) − RFout SWLatency : Software latency of template which is calculated as the accumulated number of the operations in the candidate subgraphs. HWLatency : Hardware latency of the template which is calculated as the sum of the operation delays through the critical path of the template. CIin : Number of custom instruction input operands. CIout : Number of custom instruction output operands. RFin : Number of read ports of registerfile. RFout : Number of write ports of registerfile. When the number of CI input/output operands is greater than the available registerfile read/write ports, registerfile pipelining is used for accessing the registerfile. (CIin −RFin ) (CIout −RFout ) and − denote the numbers of cycles Hence, the terms − RFin RFout that will be missed due to the pipelining overhead resulting from the limited number of read and write ports of the registerfile. The evaluated performance is assigned to each template and then, the remaining process of template selection is followed the same way as the previously described algorithm. I/O-aware custom instruction selection is applied to packet-processing benchmarks and their performance improvements are shown in Fig. 5. As shown, when the custom instructions are generated based on (CIin , CIout ) = (3, 2) and the registerfile constraint is (3, 2), the performance improvement is higher than when (CIin , CIout ) = (6, 3) and (RFin , RFout ) = (3, 2). Therefore, both of (CIin , CIout ) and (RFin , RFout ) should be considered for achieving the best performance improvement. Another important observation is that performance improvement is almost saturated in (CIin , CIout ) = (6, 3) and (RFin , RFout ) = (6, 3) and the performance improvement of this point 123 Architecture for embedded processing engines 961 Fig. 5 Performance improvement of selected applications versus different registerfile and custom instruction constraints is about 5 % in average higher than (CIin , CIout ) = (3, 2) and (RFin , RFout ) = (3, 2). Considering that the former configuration achieves this improvement with a greater area overhead compared to the latter. 3.3 Analyzing area overhead The area overhead is another term that should be analyzed in custom instruction selection. As mentioned in the previous part, the designers have two solutions to design the architecture of extensible processors: (1) increasing registerfile bandwidth, (2) pipelining registerfile. The first strategy yields higher performance at the expense of area overhead of a larger registerfile. The second one has no area overhead at the expense of some performance degradation for register pipelining. The performance per area of different custom instructions on different registerfiles is shown in Fig. 6, where performances per area values are normalized to that of MIPS. The selected MIPS processor has 32-bit single issue in-order 5-stage pipeline architecture as introduced in [39]. As shown in this figure, since the area overhead of increasing the number of registerfile read/write ports can significantly increase the processor area, we propose the architecture with the (2, 1) registerfile as the optimum one in terms of performance per area. It is also obvious that among the registerfiles that have the same performance per area, the ones which lead to higher performance or lower area are selected considering the area budget or performance requirements. 123 962 A. Yazdanbakhsh et al. Fig. 6 Performance per area for different benchmark applications 3.4 Analyzing long latency custom instructions Another point that should be considered in the design of customized processors is the execution of long latency custom instructions (which has a critical path longer than the base processor cycle). We applied our modified identification algorithm to select long latency custom instructions for the representative benchmark applications. The results are shown in Fig. 7. Increasing the number of allowed clock cycles improves the performance up to 5 % at the expense of controller complexity. So, we conclude that allowing long latency custom instructions is not profitable in terms of performance improvements and hardware complexity. The performance of the long latency instructions can be improved with the methods proposed in [31] and [34]; however, it is not the focus of this paper and we 123 Architecture for embedded processing engines 963 Fig. 7 Performance improvements of long latency custom instruction for different benchmark applications have used our simple method and omit the long latency instruction for sake of simplicity. 4 Architecture model Our proposed architecture exploits “backwarding” logic in the processor pipeline to resolve the hazards of the additional operands per custom instruction. In addition, our architecture imposes minimal modifications to the regular instruction encoding for accommodating the additional operands of custom instructions. In this section, we describe these modifications in the processor pipeline and the instruction encoding. We assume RISC-style in-order pipeline processor architecture with extensibility features. For illustration purposes, we use a simple single-issue in-order MIPS-like 5-stage pipeline. However, our technique can be easily applied to other pipeline archi- 123 964 A. Yazdanbakhsh et al. tectures. The proposed architecture supports multiple input multiple output (MIMO) custom instructions considering the area overhead of the augmented custom instructions and also the imposed area overhead of the registerfile. As shown in the previous part, the performance of the extensible processor can be improved when relaxing the input/output constraints in custom instruction generation. However, the area overhead of the registerfiles with relaxed number of read/write ports can significantly reduce the achieved performance per area. Chen et al. have proposed architecture for MIMO custom instructions in [43]. We first analyze this model and describe its shortcomings in terms of performance. Next, we propose an architecture that overcomes these shortcomings. MIMO extension (M2 E) architecture model [43] which is in the MOV category (defined previously) provides two custom instruction formats to handle the desired custom instructions. The first one is capable of executing up to 32 custom instructions which have at most two input and single output operands. This format is designed to support some small tasks that can be handled by a single instruction. The shift amount and the function field of the standard MIPS R-type instruction format (i.e., the last ten bits of the R-type instruction) are exploited to specify the fields of custom instructions and custom function units. The other instruction format provides the possibility to compound a bundle of instructions for executing complex functions on M2 E. It is worth mentioning that executing such complex custom instructions on this machine needs at least two extra instructions, one to read operands from registerfile and one to get results from the CFU. M2 E model exploits this approach to provide up to 16 CFUs each supporting up to 16 custom functions. This instruction format offers up to 256 complex custom functions that have multiple inputs and multiple outputs. The sequence of instructions that are generated for executing a complex custom function should be arranged in the proper order. It means that all of the instructions that put the operands in the targeted CFU should be executed before the instructions which provide the results. As stated in the previous sections, due to the high area overhead of increasing the number of read and write ports of the registerfile, our proposed architecture is based on a registerfile with dual read and single write port. A modified architecture for the traditional registerfile pipelining is used to support the multiple input and multiple output custom functions. In traditional register pipelining, preparing operands for a custom function that for example needs four inputs and provides two outputs, consumes two clock cycles for reading the operands from registerfile and two clock cycles are required for writing the results back into the registerfile. However, for each additional I/O access a constant penalty is imposed. This leads to a sub-optimal speed-up, because some I/O accesses can be done in parallel with the other pipeline stages. We resolve this problem with the proposed instruction set and pipeline architecture. As demonstrated in Fig. 5, four input and two output operand custom instructions [i.e., (4, 2) custom instructions] have a performance gain near custom instructions with relaxed input and output ports. These results motivate us to restrict the custom function inputs and outputs to (4, 2). Based on this observation, we have proposed two distinct instruction formats for executing custom instructions on the base processor architecture. The first one as shown in Fig. 8a, is a single 32-bit instruction that can be used for custom instructions, where the total number of operands are less than 123 Architecture for embedded processing engines 965 Fig. 8 The proposed instruction formats Table 1 Classification of custom functions based on the custom instruction inputs and outputs Number of inputs Number of outputs Type Opcodes 2 1 Single 2 2 3 1 3 2 4 1 011110, 011111, 4 2 110011, 111011 010000, 010001, 010010, 010011, Double 011100, 011101, or equal to four [i.e., (2, 1), (2, 2), and (3, 1) custom instructions]. The other custom instructions as shown in Fig. 8b have more than four operands and are decomposed into two consecutive single instructions. In the proposed double-word custom instructions, the first instruction reads up to two operands and writes one of the results and the second one is overlapped with the first one in the pipeline and can read the other two operands and write the other output operand. Our encoding ensures that the encoding of base processor instructions are not affected. We illustrate our encoding based on the instruction format of MIPS processor. However, the general idea is applicable to any RISC-style instruction format. The proposed instruction formats are shown in Table 1 and Fig. 8. As shown in Table 1, the custom functions are classified into six different groups based on the number of inputs and outputs. The custom functions which have (2, 1), (2, 2), (3, 1) input and output ports fit into a single 32-bit instruction, but other configurations need two consecutive instructions for execution. According to this classification, two distinct instruction formats are developed for custom instructions: one of them belongs to the group that need only one instruction for execution and the other belongs to the group that need two consecutive instructions. As we do not want to affect the 123 966 A. Yazdanbakhsh et al. encoding of normal instructions, all the information about the operands of the custom instructions are encoded in the unused parts of the MIPS instruction format. As shown in Fig. 8a, joint opcode and function fields specify the custom functions that should be executed on the CFU and Rs1, Rs2, Rs3 represents the indices of source operands in the registerfile and Rd1, Rd2 show the destination operands. The instruction bits 20 to 16 can be the index of a source or destination operand depending on the type of the instruction. Figure 8b demonstrates the situation where two consecutive instructions should be executed on the processor for custom instructions which belong to the second group. The opcode and function fields in the first instruction just specify that three or four operands should be read from the registerfile and the custom field in the second instruction specify the custom function which provides the results. Rd1 and Rd2 are used for the destination operands. As demonstrated in [39], there are ten unused opcodes in the MIPS R-type instructions that can be used for both custom instruction groups. The classification of these unused configurations is shown in Table 1. With this ten unused opcodes, the total number of custom instructions that can be added to the base instruction set is calculated with the following formula: (10−n)×2bitwidth(Function) + n×2bitwidth(Custom)+bitwidth(Function) = (10−n)×26 + n×211 (for 0<n< 10). To meet the designer demands and depending on the required number of custom instructions of each group, number of custom instructions vary from 2,624 to 18,496 instructions for different values of ‘n’. Splitting the complex custom instructions in the mentioned manner into two consecutive 32-bit instructions leads to a new problem. If these two consecutive instructions are executed normally on the processor pipeline, the results cannot be written into the registerfile correctly. Figure 9 shows the pipeline execution of a custom instruction from group 2. The first instruction reads the first and second source operand in ID1 stage (the second clock cycle). In the third clock cycle, the instruction is executed and meanwhile the third and fourth source operands are read by the second instruction. This means that the first instruction uses invalid value for the third and fourth source operands. To eliminate this problem, we propose a special logic called backwarding which is added to the base processor. Figure 9 demonstrates the use of backwarding in our proposed architecture. Backwarding module backwards the appropriate result from the output of the execute stage (EX) of the second instruction to the write-back stage (WB) of the first one. This causes the calculated results backward to the previous instruction in the pipeline to be written when the first instruction is in WB stage. Therefore, the first instruction’s EX stage performs a useless operation because the calculated results are overwritten by the backward logic. That is why this stage is darkened in the pipeline timing diagram in Fig. 9. The data-path of the proposed architecture is shown in Fig. 10. Two modules, called read controller and write controller, are added to the base processor pipeline Backwarding Custom Function Rd2 Rd1 Rs4 Rs3 Rs2 IF ID1 EX MEM WB1 Rs1 IF ID2 EX MEM WB2 Fig. 9 Pipeline execution of a custom instruction from goup 2 and the backwarding concept 123 Fig. 10 Datapath of the proposed single-issue 5-stage pipeline Architecture for embedded processing engines 967 123 968 A. Yazdanbakhsh et al. Fig. 11 Read and write controller flow graph: a read controller, b write controller to control and schedule read/write accesses to the registerfile. The state diagrams of these controllers are shown in Fig. 11. Read controller schedules the values deriving Rs1, Rs2, Rs3, and Rs4 for different instructions. Normally Rs1 and Rs2 are fed as the input operands for normal instructions or custom instructions that have up to two read operands. For instructions that have more than two source operands, the read controller directs the other source operands to Rs3 and Rs4. Write controller schedules the writes into the registerfile. According to the previous explanation on the instruction formats and the proposed backwarding concept, the write controller must select the destination value among three sources: Rd1 for normal instructions or single-word custom instructions that have one destination operand. Backwarding and Rd2 for selecting the first and second destination values, respectively, for multi-output custom instructions. To handle the hazards that may arise in the pipeline stages during the execution of two consequent instructions, all of the states for each combination of previously classified instructions are shown in Figs. 12 and 13. The pipeline stages are similar to the five-stage pipeline of MIPS processor, except that decode and write-back stages may be modified for reading more than two operands and writing two results in the registerfile, respectively. Since our architecture is based on a (2, 1) registerfile, the instruction decoding stage (ID) is extended into ID1 and ID2 for instructions that need three or four input operands. In the same manner, the write-back stage (WB) is extended into WB1 and WB2 for instructions that have more than one destination operand. The stages that are colored with gray pattern means nothing useful is done in this stage (i.e., NOP). Checking probable structural hazards needs to analyze the execution of two consecutive instructions. Figure 12 shows the states of the pipeline for the custom 123 Fig. 12 Instruction sequences which start with a single instruction Architecture for embedded processing engines 969 123 Fig. 13 Instruction sequences which start with a double instruction 970 123 A. Yazdanbakhsh et al. Architecture for embedded processing engines 971 Table 2 Possible hazards between two consequent instructions and the affected pipeline stage Second Ins. (2, 1) (2, 2) (3, 1) (3, 2) (4, 1) (4, 2) No No No No No No – – – – – – (2, 2) Yes Yes No Yes Yes Yes WB2 WB2 – WB2 WB2 WB2 (3, 1) Yes Yes Yes Yes Yes Yes ID2 ID2 ID2 ID2 ID2 ID2 (3, 2) No No No No No No – – – – – – (4, 1) No No No No No No – – – – – – (4, 2) No No No No No No – – – – – – First Ins. (2, 1) instructions that are implemented with two consecutive single instructions or a single instruction followed by a double instruction. The first and second instructions are separated by a dashed-line between them. The cloud figure in pipeline means a stall or bubble in the pipeline that may be required during the execution to eliminate the hazards. According to Fig. 12 part (a) in which the first instruction is a (2, 1) instruction, all of the combinations are hazard free no matter what the second instruction is. Figure 12 part (b) demonstrates the situation where the first instruction is (2, 2) and the following instructions are (2, 1), (2, 2), (3, 1), (3, 2), (4, 1), and (4, 2). In this case, a stall is required in the write-back stage in most of the sequences. Upon the detection of this kind of hazard, all of the pipeline stages should be stalled. As shown in Fig. 12 part (c), all of the stalls occurred in the decode stage. All of the instruction sequences that start with a double-word instruction are shown in Fig. 13. As shown, these sequences never need to stall the pipeline. All of the hazards are shown in tabular format in Table 2. The table demonstrates the hazards that may happen for two consecutive instructions in pipeline stages during the execution. Another important note is that we execute the double-word instructions as atomic instructions in the pipeline, therefore, interrupts, exceptions, or context switching do not issue any operation between these two instructions. 5 Evaluation and results In this section, we evaluate our architecture model and compare it to the MOV model. The worst conflict between two consecutive instructions occurs when every fetched instruction has four inputs and two outputs as operands. Figure 14 depicts the states of the pipeline stages during the execution of such instructions on our proposed architecture and MOV model as well. Our model needs six cycles to execute the first (4, 2) 123 972 A. Yazdanbakhsh et al. Fig. 14 Pipeline stages for executing consecutive custom instructions: a MOV architecture, b proposed architecture instruction and then after each two clock cycles another (4, 2) instruction can be committed. However, MOV model needs eight clock cycles for executing the first custom instruction (including 4 single instructions) and after that, for every four clock cycles an additional instruction can be committed. In MOV model: the instructions labeled with (a) put the first two operands into the internal registers; instruction labeled (b) put the other two operands into two other registers; (c) gets the first result from internal latches after execution of the desired custom function, and finally (d) gets the second result from the internal register. In the other words, the minimum execution cycles to execute n instructions of type (4, 2) on our proposed and MOV architectures are calculated from the following formulas: Total execution cycleproposed = 6 + 2 × (n − 1) Total execution cycleM2E = 8 + 4 × (n − 1) It is obvious that when the number of instructions increases, our proposed architecture can execute the custom instructions about two times faster than MOV model considering that all of the normal data dependency hazards are resolved by traditional forwarding logic in the MIPS hardware. The following formula compares the two models in the limiting case: Total execution cycleproposed Total execution cycleMOV 1 6 + 2 × (n − 1) = n→ ∞ 8 + 4 × (n − 1) 2 = lim Another metric that is important in comparing the proposed architecture with MOV is the code compression rate. In MOV model for each (4, 2) instruction, two instructions are needed to put the appropriate operands in the internal registers and two instructions for executing the desired instruction and getting the results from the internal registers. However, in our proposed model only two instructions are exploited for a (4, 2) custom instruction. It means that the instruction memory usage in our proposed model is more efficient than MOV. The following formula is used to evaluate the code compression for both architecture models after custom instruction generation: 123 Architecture for embedded processing engines Table 3 Number of generated instructions for each custom instruction Architecture model 973 MOV Proposed architecture (2, 1) 1 1 (2, 2) 3 1 (3, 1) 3 1 (3, 2) 4 2 (4, 1) 3 2 (4, 2) 4 2 (input, output) Fig. 15 Percentage of code compression for selected applications Code Compression = 1 − total number of instruction with CIs total number of instruction without any CI This metric shows the percentage of instructions that are reduced with the help of custom instructions. Table 3 illustrates the number of instructions that are generated for custom instructions for the proposed and the MOV architecture considering different input and output operands. In our proposed architecture, only one or two instructions are generated for each custom function; however, in MOV model up to four instructions are generated to execute a custom instruction on the processor. The percentage of code compressions for selected packet-processing applications is summarized in Fig. 15. We compare the performance improvement of each benchmark for three different architectures. The architecture which is called “ideal” is supposed to support a registerfile with four read ports and two write ports and the instruction format has enough space to encode these six operands in a single instruction. This architecture can provide the maximum performance improvement with custom instructions. The other architecture is based on our proposed instruction set and the proposed idea of exploiting data “backwarding”. The last architecture which is called MOV is based on the typical register pipelining architecture, where custom “move” instructions are required to transfer data from the registerfile to internal registers of the CFU. 123 974 A. Yazdanbakhsh et al. Fig. 16 Performance improvements of the proposed and MOV architectures normalized to the ideal performance improvement Figure 16 shows performance improvement obtained from our proposed architecture and the case where custom “move” instructions are inserted (i.e., MOV). The performance improvement of the proposed architecture is closer to the ideal performance limit. Thus, our technique can overcome the limitation of the number of registerfile ports with negligible effects on instruction encoding, pipeline architecture, and performance. 6 Conclusion We have proposed custom instruction set and pipeline architecture for customizing a single embedded packet-processing core with the objective of optimizing performance per area of each core for processing parallel network workloads. The focus of this research was on increasing the application efficiency by effectively handle larger subgraphs of an application and simultaneously overcoming the limitation of the number of registerfile ports. Custom instructions with more register operands are accommodated without increasing the register file IO ports through the developed architecture by definition of double-word custom instructions that overlap their numerous registerfile accesses in the pipeline. The potential hazards are resolved through the proposed pipeline backwarding concept that improves the execution speed of custom instructions in processor’s pipeline. In comparison to the previous work, our proposed architecture allows definition of many more custom instructions by accommodating both of single- and double-word custom instructions with enhanced code compression. The MIPS-based design in our case study utilizes unused instruction opcodes and fields to ensure that the proposed custom instructions and pipeline architecture do not significantly affect the base instruction set and pipeline without complicating the base processor compiler. 123 Architecture for embedded processing engines 975 Acknowledgments The authors acknowledge the partial support received for this work under contract 149811/140 from Microelectronic Committee-Research Administration of the University of Tehran. References 1. Swanson S, Putnam A, Mercaldi M, Michelson K, Petersen A, Schwerin A, Oskin M, Eggers SJ (2006) Area-performance trade-offs in tiled dataflow architectures. in: Proceedings of the 33rd international symposium on computer architecture (ISCA’06), pp. 314–326 2. Nickolls J, Dally WJ (2010) The GPU computing era. IEEE Micro 30(2):56–59 3. Lee SJ (2010) A 345 mW heterogeneous many-core processor with an intelligent inference engine for robust object recognition. In: Porceedings of the IEEE international solid-state circuits conference, 2010, pp. 332–334 4. Bell S, et al (2008) TILE64T M processor: a 64-core SoC with mesh interconnect. In: Porceedings ofthe IEEE international solid-state circuits conference, pp. 88–90 5. Jotwani R, et al (2010) An x86–64 core implemented in 32 nm SOI CMOS. In: Porceedings of the IEEE international solid-state circuits conference, pp. 106–107 6. Howard J, et al (2010) A 48-Core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: Poreedings of the IEEE international solid-state circuits conference, pp. 108–110 7. Shin JL, et al (2010) A 40 nm 16-Core 128-thread CMT SPARC SoC processor. In: Porceedings of the IEEE international solid-state circuits conference, pp. 98–99 8. Johnson C, et al (2010) A wire-speed POWERT M processor: 2.3G Hz, 45 nm SOI with 16 cores and 64 threads. In: Porceedings of the IEEE international solid-state circuits conference, pp. 104–106 9. Azizi O, Mahesri A, Lee BC, Patel SJ, Horowitz M (2010) Energy-performance tradeoffs in processor architecture and circuit design: a marginal cost analysis. In: Proceedings of the 37th international symposium on computer architecture (ISCA’10), pp. 26–36 10. Kapre N, DeHon A (2009) Performance comparison of single-precision SPICE model-evaluation on FPGA, GPU, Cell, and multi-core processors. In: Proceedings of the international conference on field programmable logic and applications, pp. 65–72 11. Truong DN et al (2009) A 167-processor computational platform in 65 nm CMOS. IEEE J Solid State Circuits 44(4):1130–1144 12. Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. IEEE Comput 41(7):33–38 13. Borkar S (2007) Thousand core chips—a technology perspective. In: Proceedings of the design automation conference (DAC), pp. 746–749 14. Eyerman S, Eeckhout L (2010) Modeling critical sections in Amdahl’s Law and its implications for multicore design. In: Proceedings of the 37th international symposium on computer, architecture (ISCA’10), pp. 362–370 15. Park S, Shrivastava A, Dutt N, Nicolau A, Paek Y, Earlie E (2008) Register file power reduction using bypass sensitive compiler. IEEE Trans Comput Aided Des Integr Circuits Syst 27(6):1155–1159 16. Nalluri R, Garg R, Panda PR (2007) Customization of register file banking architecture for low power. In: Proceedings of the 20th international conference on VLSI design (VLSID’07), pp. 239–244 17. Bonzini P, Pozzi L (2008) Recurrence-aware instruction set selection for extensible embedded processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 16(10):1259–1267 18. Atasu K, Pozzi L, Ienne P (2003) Automatic application-specific instruction-set extensions under microarchitectural constraints. In: Proceedings of the design automation conference (DAC), pp. 256– 261 19. Clark N, Zhong H, Mahlke S (2003) Processor acceleration through automated instruction set customization. In: Proceedings of the 36th Annu. IEEE/ACM, MICRO, pp. 129–140 20. Yu P, Mitra T (2004) Scalable custom instructions identification for instruction-set extensible processors. In: Proceedings of the CASES, pp. 69–78 21. Pozzi L, Atasu K, Ienne P (2006) Exact and approximate algorithms for the extension of embedded processor instruction sets. IEEE Trans Comput Aided Des Integr Circuits Syst 25:1209–1229 22. Chen X, Maskell DL, Sun Y (2007) Fast identification of custom instructions for extensible processors. IEEE Trans Comput Aided Des Integr Circuits Syst 26(2):359–368 23. Zyuban VV, Kogge PM (1998) The energy complexity of register files. In: Proceedings of the international symposium on low power, electronic design, pp. 305–310 123 976 A. Yazdanbakhsh et al. 24. Leupers R, Karuri K, Kraemer S, Pandey M (2006) A design flow for configurable embedded processors based on optimized instruction set extension synthesis. In: Proceedings of the design, automation & test in Europe (DATE) 25. Altera Corp. Nios processor reference handbook 26. Xilinx Inc., Microblaze soft processor core 27. Gonzalez RE (2000) XTENSA: a configurable and extensible processor. IEEE Micro 20:60–70 28. Karuri K, Chattopadhyay A, Hohenauer M, Leupers R, Ascheid G, Meyr H (2007) Increasing databandwidth to instruction-set extensions through register clustering. In: Proceedings of the international conference on computer aided design, pp. 166–177 29. Fischer JA, Faraboschi P, Young C (2005) Embedded computing: a VLIW approach to architecture. Elsevier Inc, Compiler and Tools, Amsterdam 30. Kim NS, Mudge T (2003) Reducing register ports using delayed write-back queues and operand prefetch. In: Proceedings of the 17th annual international conference on Supercomputing, pp. 172–182 31. Pozzi L, Ienne P (2005) Exploiting pipelining to relax register-file port constraints of instruction set extensions. In: Proceedings of the international conference on compilers, architectures and synthesis for embedded systems, pp. 2–10 32. Atasu K, Dimond R, Mencer O, Luk W, Özturan C, Dünda G (2007) Optimizing instruction-set extensible processors under data bandwidth constraints. In: Proceedings of the design automation and test in, Europe, Mar. 2007, pp. 588–593 33. Atasu K, Ozturan C, Dundar G, Mencer O, Luk W (2008) CHIPS: custom hardware instruction processor synthesis. IEEE Trans Comput Aided Des Integr Circuits Syst 27(3):528–541 34. Verma Ajay K, Brisk Philip, Ienne Paolo (2010) Fast, nearly optimal ISE identification with I/O serialization through maximal clique enumeration. IEEE Trans Comput Aided Des Integr Circuits Syst 29(3):341–354 35. Brisk P, Kaplan A, Sarrafzadeh M (2004) Area-efficient instruction set synthesis for reconfigurable system-on-chip designs. In: Proceedings of the design automation conference (DAC), pp. 395–400 36. Moreano N, Borin E, de Souza C, Araujo G (2005) Efficient datapath merging for partially reconfigurable architectures. IEEE Trans Comput Aided Des Integr Circuits Syst 24(7):969–980 37. Dinh Q, Chen D, Wong MDF (2008) Efficient ASIP design for configurable processors with finegrained resource sharing. In: Proceedings of the ACM/SIGDA 16th international symposium on FPGA, pp. 99–106 38. Zuluaga M, Topham N (2009) Design-space exploration of resource-sharing solutions for custom instruction set extensions. IEEE Trans Comput Aided Des Integr Circuits Syst 28(12):1788–1801 39. Hennessy JL, Patterson DA (2005) Computer organization and design: the hardware/software interface, the Morgan Kaufmann Series in computer architecture and design, 3rd edn. Elsevier Inc., Amsterdam 40. Powell PMD, Vijaykumar TN (2002) Reducing register ports for higher speed and lower energy. In: Proceedings of the 35th annual IEEE/ACM international symposium on microarchitecture, pp. 171– 182 41. Cong J, et al (2005) Instruction set extension with shadow registers for configurable processors. In: Proceedings of the FPGA, pp. 99–106 42. Liu H, Jayaseelan R, Mitra T (2006) Exploiting forwarding to improve data bandwidth of instruction-set extensions. In: Proceedings of the design automation conference (DAC), pp. 43–48 43. Chen X, Maskell DL (2007) Supporting multiple-input, multiple-output custom functions in configurable processors. J Syst Architect 53:263–271 44. Salehi ME, Fakhraie SM (2009) Quantitative analysis of packet-processing applications regarding architectural guidelines for network-processing-engine development. J Syst Architect 55:373–386 45. Salehi ME, Fakhraie SM, Yazdanbakhsh A (2012) Instruction set architectural guidelines for embedded packet-processing engines. J Syst Architect 58:112–125 46. The GNU operating system, available online: http://www.gnu.org 47. Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Architecture-aware graph-covering algorithm for custom instruction selection. In: Proceedings of the international conference on future information technology (FutureTech), pp. 1–6 48. Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Locality considerations in exploring custom instruction selection algorithms. In: Proceedings of the ASQED 49. Yazdanbakhsh A, Kamal M, Salehi ME, Noori H, Fakhraie SM (2010) Energy-aware design space exploration of registerfile for extensible processors. In: Proceedings of the SAMOS 123 Architecture for embedded processing engines 977 50. Sakai S, Togasaki M, Yamazaki K (2003) A note on greedy algorithms for the maximum weighted independent set problem. Discret Appl Math 126:313–322 51. Ramaswamy R, Weng N, Wolf T (2009) Analysis of network processing workloads. J Syst Architect 55(10—-12):421–433 52. Biswas P, Atasu K, Choudhary V, Pozzi L, Dutt N, Ienne P (2004) Introduction of local memory elements in instruction set extensions. In: Proceedings of the 41st design automation conference, June 2004, pp. 729–734 53. She D, He Y, Corporaal H (2012) Energy efficient special instruction support in an embedded processor with compact ISA. In: proceedings of the CASES, pp. 131–140 54. Wu D, Ahn J, Lee I, Choi K (2012) Resource-shared custom instruction generation under performance/area constraints. International symposium on system on chip (SoC), pp. 1–6 123