Customized pipeline and instruction set architecture for embedded processing engines Amir Yazdanbakhsh

advertisement
J Supercomput (2014) 68:948–977
DOI 10.1007/s11227-013-1075-8
Customized pipeline and instruction set architecture
for embedded processing engines
Amir Yazdanbakhsh · Mostafa E. Salehi ·
Sied Mehdi Fakhraie
Published online: 6 February 2014
© Springer Science+Business Media New York 2014
Abstract Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need
higher number of simultaneous registerfile accesses. Larger registerfiles are more
power hungry with complex forwarding interconnects. Therefore, due to the limited
ports of the base processor registerfile, size and efficiency of custom instructions could
be generally limited. Recent researches have focused on overcoming this limitation
by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take
into account the complete pipeline design and implementation considerations. This
paper proposes a customized instruction set and pipeline architecture for an optimized
embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The
achieved improvements are made by introducing double-word custom instructions
whose registerfile accesses are overlapped in the pipeline. Potential hazards in such
instructions are resolved by the introduced pipeline backwarding concept, yielding
higher performance and code compression. While we study the effectiveness of the
proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded
application domains.
A. Yazdanbakhsh · M. E. Salehi (B) · S. M. Fakhraie
Nano Electronics Center of Excellence, University of Tehran,
14395-515 Tehran, Islamic Republic of Iran
e-mail: mostafa.salehi@gmail.com; mersali@ut.ac.ir
A. Yazdanbakhsh
e-mail: a.yazdanbakhsh@ece.ut.ac.ir
S. M. Fakhraie
e-mail: fakhraie@ut.ac.ir
123
Architecture for embedded processing engines
949
Keywords Embedded packet-processing engine · Customized application-specific
processor · Custom instruction generation · Area performance tradeoffs · Custom
instruction data bandwidth
1 Introduction
Higher bandwidth and performance demands besides the development of new networking services call for flexible and optimized packet-processing engines. While
Moore’s Law continues to promise more transistors, some critical problems in processor design, including design complexity, and power and thermal concerns have forced
the industry to shift its focus away from complex high-performance power-hungry
single processors to multi-core designs [1–4]. Industry has also shifted to multi-core
designs that provide multiple processing cores on a single chip [5–8]. Azizi et al. [9]
show the optimization results under performance and performance per area objectives
for different processor architectures including simple single-issue in-order processors
and complex high-performance out-of-order architectures. In out-of-order processors,
wide issue widths yield high performance improvements exploiting larger and wider
structures in the execution core to process more instructions in parallel. In particular,
these structures include instruction scheduler, registerfile, and bypass network that
do not scale and make such designs more complicated. As described in [9], the area
and power overheads of implementing complex out-of-order processors can outweigh
their performance benefits. Therefore, the single- and dual-issue in-order designs are
shown to be more optimal in terms of performance per area.
To boost the performance, instead of designing more and more complex highperformance single processors for improving instruction-level parallelism within a
single thread, architects can design simple processing elements and compensate their
lower individual performance by replicating them across the chip and improve overall efficiency by focusing on multi-task parallelism [2,10,11]. The recent multi-core
systems target the highest throughput within a specified power and area budget. However, they could have more simple cores to exploit additional parallelism or fewer
more powerful ones, trading between high-performance and low-power architectures.
Amdahl’s law in the multi-core era [2,12] suggests that only one core should get
more resources to rapidly perform the sequential part of a code and the others should
be as small as possible to run the scalable parallel parts. That is because replicating the
processing cores yields linear and sub-linear speed-up for the parallel and the sequential parts, respectively [13]. Eyerman and Eeckhout [14] model the critical sections
in Amdahl’s law [12] and show its implications for multi-core designs. Amdahl’s
law suggests many tiny cores for optimum performance in asymmetric processors,
however, Eyerman and Eeckhout [14] find that fewer but larger cores can yield substantially better performance, supposing that critical sections are executed on a large
high-performance core. They have modeled the attainable speed-up as a function of
the large and tiny cores sizes. The square-root relationship between resources and performance (known as Pollack’s law [13]) is also exploited for calculating the multi-core
performance. Their results show that for low contention rates for threads wanting to
123
950
A. Yazdanbakhsh et al.
enter the same critical section, more tiny cores yield the best speed-up. However, for
higher contention rates, fewer larger cores yield the best performance.
According to this discussion, it is observed that the architecture of the composing
cores of a multi-core system strictly depends on the parallelism level of the application
workload. For highly parallel workloads (e. g. packet-processing tasks for different
packet flows) the throughput is highly dependent on the number of the cores that can
fit on a die, therefore, both the performance and area of the cores must be considered in
evaluating the system overall performance. In these cases, the attainable performance
of multi-core system is related to the product of the performance per core and the
number of cores that will fit in a given area budget. Thus in a multi-core system
only the processor performance is no longer known as the performance metric, but
performance per square millimeter of each processor. To account for workloads with
lower amounts of parallelism which are not the focus of this paper, one needs to prepare
a high performance complex core in addition [12]. Therefore, the focus of this research
is on increasing the efficiency by customizing a single embedded packet-processing
core with the objective of optimizing performance per area of each core in a multi-core
environment for processing parallel network workloads.
Current embedded systems are expected to deliver high performance under tight
area and power constraints. In today’s embedded systems, conflicting requirements of
performance, area, and flexibility, make the so called customized general processing
cores a proper tradeoff for system designers. Designers carefully explore the target
application and adjust the architecture to achieve the best tradeoffs. General-purpose
processors alone are often not profitable enough for adapting to the strict area, performance, and power consumption requirements of embedded applications. Hence, a
common approach in the design of embedded systems is to implement the controlintensive tasks on a general-purpose processor and the computation-intensive tasks on
custom hardware. In this context, general-purpose processors augmented with custom
instructions could provide a good balance between performance and flexibility.
Augmenting a core processor with new instructions has some benefits. First, the
system still remains programmable and supports modest changes to the application.
Second, the computationally intensive portions of the applications in the same domain
often have similar structures. Thus, the customized processors can often be applicable
across a set of applications. A customized processor is composed of a general purpose
base processor augmented with application specific Instruction Set Extensions (ISE) to
boost the performance by accommodating Custom Instructions (CI) which has become
a vital component in current processor customization flows. A custom instruction
encapsulates a group of frequently occurring operation patterns of an application in a
single instruction which is implemented as a custom functional unit (CFU) embedded
within the base-processor. CIs improve performance and energy dissipation through
parallelization, chaining of hot-spot operations of application, reducing the number
of instructions and consequently reducing instruction fetches, cache accesses, and
instructions decodes, and also reducing the number of intermediate results stored in
the registerfile.
One way of building a good ISE is to combine several operations into one single, complex instruction. Combining such operations becomes feasible only when
a special instruction is allowed to have many operands from the registerfile. There-
123
Architecture for embedded processing engines
951
fore, one major bottleneck in maximizing the performance of custom instructions is
the limitation on the data-bandwidth between the registerfile and the CFU which can
be improved by extending the registerfile to support additional read and write ports.
However, additional ports result in increased registerfile area, power consumption, and
cycle time. This reminds that the decision on selecting the most appropriate custom
instructions should be carefully made in multi-core platforms. Relaxing the registerfile
I/O constraints for specialized custom instructions have a significant impact on the
amount of achieved speed-up at the expense of some area and power overhead [15,16].
Although, at present the area is not considered as a limiting factor in fabrication of a
single processor, the performance per area should be optimized in multi-core designs.
It means that with a specific silicon area budget, there would be different architecture
designs with different performance gains that consume the specified area budget. This
motivates the designer to consider different processor design alternatives to gain as
much performance per die area as possible. Increasing the processor area to gain more
performance at the expense of reducing the number of processors or exploiting more
simple processors are the two design alternatives that the processor designer faces in
the field of multi-core embedded processor design.
In this paper, we exploit a custom instruction exploration framework that provides
architecture-aware identification of a profitable set of custom instructions. Given the
input/output constraints, available registerfile bandwidth, and also transfer latencies
between CFU and baseline processor, our design flow identifies area, performance,
and code-size tradeoffs. We have also proposed a novel approach to increase the registerfile bandwidth without increasing the size of the registerfile. This is achieved by
the developed instruction set architecture and exploiting the proposed pipeline backwarding logic which is named after the forwarding concept common in the pipeline
architectures. The remainder of this paper is organized as follows: in the following section, we review the existing researches and state the main contributions of this paper.
In Sect. 3, we describe the main stages of the exploited framework. Section 4 describes
the proposed instruction set and the pipeline architecture of the developed customized
processor. Section 5 provides the evaluation method and experimental results for a set
of representative networking benchmarks. Finally, we conclude in Sect. 6.
2 Related work and motivation
Among the variety of designs for customized accelerators, the most common practice
still is the instruction-set extensible processor [17,34]. Customization of a processor
through instruction set extensions (ISEs) is now a well-known technique in highperformance embedded systems. Instruction set extension can be divided into instruction identification and instruction selection phases. There are significant researches on
automatic identification and selection of ISEs to create application specific processors
[18–22]. Given the application code, instruction identification consists of encapsulating some basic operations into larger and more complex operations. These complex
operations are recognized by their representative subgraphs which cover the application graph. Once the subgraphs are identified and extracted from the application
flow graph, these are considered as single powerful instructions and then should pass
123
952
A. Yazdanbakhsh et al.
through a selection process. Identification and selection processes use guide function
and cost function to take into account constraints that the new instructions have to
satisfy.
Partitioning the required operations of an application into base-processor instructions and CIs is done under certain constraints. First, there is a limited area allocated to
CFUs. Second, the data bandwidth between the base processor (including its registerfile) and the CFU is limited, and the data transfer costs have to be explicitly evaluated.
Next, an instruction supports a limited amount of encoding bits; therefore, only a limited number of input and output operands can be encoded in a fixed-length instruction
word. Most embedded processors today have instruction words no wider than 32 bits.
Considering that many of them come with 32 or more general purpose registers, it is
not possible to encode many register operands in an instruction word.
Exhaustive CI identification algorithms are not scalable and generally fail for larger
applications with assumption of unlimited registerfile read and write ports. Most earlier
techniques for ISE generation limit the number of I/O ports of the registerfile to reduce
the set of subgraphs that can be identified. Since the speed-up obtainable by custom
instructions is limited by the available data bandwidth between the base processor and
CFU, the limitation on the number of registerfile accesses from a CFU can influence
the performance of a custom instruction to a great extent. For improving the performance, it is desirable to have a large data-bandwidth from registerfile to CFUs. However, researchers have shown that the area and power of the registerfile, as well as the
access delay, increases proportional to the number of registerfile ports [23]. Since most
embedded processors are designed under very tight area restrictions, modern embedded processors restrict the number of I/O ports of registerfile to save area and power.
The researches in [24] and [25] have presented a technique to use a custom registers
called Internal Register (IR) inside the CFU accompanying with separate move instructions. The move instructions transfer data between the registerfile and IR before and
after executing a custom instruction. Similarly, the MicroBlaze processor [26] from
Xilinx Inc. provides dedicated Fast Simplex Link (FSL) channels to move operands to
the CFU. The Tensilica Xtensa [27] uses custom state registers for moving additional
input and output operands between the base processor and CFUs. Binding of base
processor registers to custom state registers at compile time reduces the amount of
data transfers. The register clustering method described in [28] exploits the clustering
techniques used in VLIW machines as a promising technique for reducing the registerfile ports in presence of ISEs. Clustered VLIWs try to minimize the number of
registerfile ports by partitioning the registerfile into several small pieces, and generally
restricting a functional unit to access only one of these clusters [29]. Special instructions are used to move data from one cluster to another. Kim and Mudge [30] developed
two techniques for reducing the number of registerfile ports. The techniques exploit
pre-fetch method and delayed write-back. The reviewed researches rely on special
move instructions for reading the input operands for CFU and also extra move instructions are required to transfer the generated results from CFU to the general registerfile.
We refer to these techniques as MOV model in the rest of the paper.
Pozzi and Ienne [31] propose I/O serialization or registerfile pipelining approach to
reduce custom instruction I/O constraints. They propose a custom instruction identification algorithm that serializes CFU operand reads and result writes that exceed the
123
Architecture for embedded processing engines
953
registerfile I/O constraints. Their method also schedules the registerfile accesses and
minimizes the CFU latency by pipelining and multi-cycle registerfile access. Though
the mentioned method improves the data bandwidth of regiterfile by serialization,
it does not mention the architectural implementation concerns and do not address
the related problems such as encoding multiple operands in a limited fixed-length
instruction format and also neglect the pipeline architecture and data hazard design
implications. Atasu et al. [32] have presented an algorithm to solve the registerfile
limited data bandwidth problem with relaxed I/O constraints. However, for each additional I/O access they calculate a constant penalty on hardware latency. This leads
to an inaccurate calculation of the speed-up, because some I/O accesses can overlapped with the execution stage. In another work Atasu et al. [33] introduce a design
flow to identify area, performance, and code-size tradeoffs studying the effect of I/O
constraints, registerfile ports, and compiler transformations on the application efficiency. They show that in most cases, highest performance is achieved when the I/O
constraints are relaxed. Bonzini and Pozzi [17] formalized the problem of recurrenceaware instruction-set extension. They presented a framework to explore the tradeoffs
between the achieved speed-up and the CI size. Using this tradeoff, they can obtain
the optimal solutions through balancing between the benefit of higher-gain large CIs
that are used less often and smaller CIs that are found many times in the applications.
However, they have not considered the effect of large registerfiles on the processor
area or the limited instruction encoding bits for accommodating the custom instruction operands is neglected in these studies. Another approach to address the scalability
concern for exhaustive ISE generation for larger applications and registerfiles with a
large number of read and write ports is proposed in [34]. This approach considerably
prunes the list of ISE candidates and hence, runs faster.
Adding custom instructions to customized processors would increase their performance. However, die area and energy efficiency are as important as performance in embedded systems. The researchers in [35–38] merge a collection of
sub-graph datapaths for resource sharing during synthesis of ISE to increase the
reusability of custom instructions and reduce the imposed customized processor
die area. These techniques explore the design space of customized processors to
support a wider set of ISEs and increase the number of custom instructions that
can be identified within a given area budget. However, reducing the overall area
of the customized processor is not only affected by custom instructions. Results
in [15] and [16] show that a major percentage of power and area in a processor datapath is consumed in the registerfile. So reducing the area and power consumption of the registerfile has a great impact on overall die area of the customized processor. Therefore, maximizing the area savings is strictly dependent
on the registerfile area, which is not considered quantitatively in the above studies.
Figure 1 depicts the area overhead of different registerfile I/Os relative to the area of
a (2, 1) registerfile (i.e., two read ports and one write port) in the MIPS processor evaluated by hardware synthesis with 90 nm CMOS standard cell library. The selected MIPS
processor has a 32-bit single issue in-order 5-stage pipeline architecture as introduced
in [39]. It reveals that increasing each read (write) port imposes almost 20 % (10 %) area
123
954
A. Yazdanbakhsh et al.
Fig. 1 Area overhead of increasing the number of read and write ports of registerfile relative to the (2, 1)
registerfile. The registerfiles are synthesized with a 90 nm CMOS library
Fig. 2 MIPS layout using a
90 nm CMOS library
overhead. It is also observed from synthesis results (as shown in Fig. 2) that the specified (2, 1) registerfile occupies almost 45.7 % of the base MIPS processor core area.
Other researchers on this area such as Park et al. [40] proposed two different techniques for reducing the read and write ports of the registerfiles including bypass hint
technique and the register banking. Introduction of shadow registers [41] and use of
forwarding paths of the base processor [42] can also improve the data bandwidth.
While such methods reduce the hardware cost of multi-port registerfiles, they make
instruction scheduling and register allocation extremely complex for the compiler. A
new architecture is introduced in [43] that is based on MIPS processor and supports
custom instructions with multiple read and multiple write operands. It uses the method
proposed in [31] to relax the constraints on the number of read/write ports in registerfile
by pipelining the read/write accesses to registerfile and therefore, preserving the size of
the base registerfile. [54] also proposes an approach for custom instruction generation
considering both area and performance at the same time. The identified CIs in [54] are
123
Architecture for embedded processing engines
955
pipelined and extra registers are inserted into the data-path to serialize the registerfile
access. This approach supports CIs that require more operand than what is supported
by the base processor. However, these designs do not impose any additional hardware
cost at the expense of losing some cycles due to the pipelining in registerfile accesses.
Though the studied researches improve the data bandwidth of registerfile, they do
not discuss the issue of implementation concerns and do not address the related problems such as encoding multiple operands in a limited fixed-length instruction format.
Fixed-length instruction encoding constraint introduced in RISC processors, limits the
space for encoding extra opcodes and operands for new custom instructions. Supporting arbitrary custom operations with sufficient operands in a generic base processor
incurs large overheads. This overhead is because of extra operands and opcode bits of
custom instructions. This overhead leads to wider instructions, and hence, more energy
consumption in the instruction fetch. The above-mentioned researches do not discuss
the issue of encoding extra operands in the limited instruction format. For example,
the work based on shadow register files enforces either the shadow register identifiers
or architectural register identifier for input operands of the custom instruction.
She et al. [53] tackle the problem of integrating custom instructions in an embedded
processor and reduce the requirement for operand encoding and registerfile ports by
exploiting a software-controlled bypass network. Their method relies on a compiler
backend for generating and scheduling special instructions. Size of the generated
instructions of [53] is limited to patterns of just two operations and they exploit a
look-up table for storing control signals of only eight custom instructions. The table
is visible to the software and can be reconfigured for new custom instructions.
Another important challenge in designing customized processors is the pipeline
architecture and hazard design implications which are also neglected in the studied
researches. Data hazards of the pipeline are resolved by employing data forwarding.
For a multi-operand custom instruction, these hazards can occur on any of the input
operands. In multi-cycle register reads or shadow registers it is not clearly stated
how data hazards are resolved for the additional operands. Our work addresses both
of these important issues. In addition, our method avoids complex forwarding and
omits temporary registers in custom instruction in terms of internal registers, local
registerfile, shadow registers and use pipeline registers instead.
Towards designing our high-performance embedded packet-processing engines, we
have utilized MIPS base instruction set and enhanced it with new architectural solutions and applied the concept of pipeline backwarding to it to improve the registerfile
bandwidth without increasing the size of the registerfile. We have studied both MIPS
and ARM architectures for packet processing applications in [44] and [45]. The key
point of our research is simultaneous optimization of search algorithm for finding
larger custom instructions while keeping the number of IO branches of the CI subgraphs at an application-based optimum value. In summary, the design objectives of
the proposed architecture are as follows: (1) we increase the data bandwidth between
the registerfile and the CFU without increasing the registerfile area. (2) We ensure
that the proposed custom instructions do not modify the base processor registerfile
and keep its base instruction set applicable. (3) The proposed pipeline backwarding
concept is implemented with a simple hardware and does not considerably affect the
pipeline architecture of the base processor. (4) Fidelity of the proposed instruction
123
956
A. Yazdanbakhsh et al.
set architecture does not complicate the base processor compiler. (5) The proposed
instruction set can accommodate a large number of custom instructions and meanwhile
lead to more compact code in comparison to the traditional methods.
3 Architecture-aware processor customization
In this section, we briefly introduce our instruction set extension framework. Our
framework is composed of custom instruction identification and selection considering
the architectural constraints.
3.1 Custom instruction identification framework
The complete flow of our proposed framework to extend instruction set of the base
processor for an extensible processor design is depicted in Fig. 3. As shown, our
framework consists of two main parts: (1) custom instruction identification, (2) custom instruction selection. The process of instruction set identification initiates with
extracting the data flow graph (DFG) from the application. DFG is a directed graph
G = (V, E) in which V is the set of finite nodes labeled with the basic operations of
the processor and E is a finite set of edges that shows the data dependencies among
these operations. In this part, the VEX [29] configurable compiler and simulation tool
is utilized to extract DFGs from applications that are written in C/C++.
In addition, to obtain statistics of the real behavior of the desired applications, the
code coverage and the iteration frequency of all of the basic blocks are calculated
in the dynamic execution of the domain-specific benchmarks by gprof and gcov in
conjunction with gcc [46]. gprof is a profiler that informs the spent time percentage in
each function of the applications during its execution. gcov acts as a coverage program
that gives the information about the number of times each line of the code and each
basic block is executed.
The aim of custom instruction identification framework is to find a set of sub-graphs
from DFG that can be added to the basic instruction set in hardware. Exhaustively
extracting all of the possible sub-graphs from a DFG with n nodes is not computationally feasible, especially when the number of nodes is increased. So, an algorithmic
generation method is proposed in [21]. We have used this algorithm to limit the search
space and to generate valid sub-graphs based on the following three constraints and
the extra fourth constraint that is imposed by our modified algorithm:
1. The number of inputs/outputs of each sub-graph should be compared to the maximum allowed custom function input/output ports.
2. Memory operations, e.g., Load/Store, are considered as forbidden nodes and have
to be excluded from generated sub-graphs.
3. All of the generated sub-graphs should be convex. It is a legality check to assure
that when all the sub-graph nodes are merged into one custom instruction, all the
input operands are available at the beginning and all the outputs are produced at
the end of the execution cycle of the generated custom instruction.
123
Architecture for embedded processing engines
957
Fig. 3 The proposed custom instruction identification framework
4. The hardware latency of a candidate sub-graph critical path is compared and normalized to the base processor critical path.
The proposed algorithm in [21] is modified to extract all of the possible valid
sub-graphs and preserve them into a structure that will be used later in the custom
instruction selection part. In the next part, all of the identified matches are classified
into isomorphic classes, called template classes exploiting the algorithm introduced in
our previous work [47]. For evaluating the performance of each template, the number
of cycles that are reduced by means of custom instructions are calculated. The performance of each template is calculated based on the actual hardware latency of each
operation that is calculated after hardware synthesis (as calculated in our previous
work [49]), by the following formula:
Iteration × SWLatency − HWLatency
Templateperformance =
# of matches in
each template
SWLatency represents the number of clock cycles that it takes to run this match totally
by software and HWLatency is showing the number of cycles that the execution of this
123
958
A. Yazdanbakhsh et al.
part takes if it is implemented by custom instruction in hardware. The term HWLatency
is rounded up to the nearest integer number to represent the number of execution clock
cycles in the hardware. Iteration denotes the number of times that the specified custom
instruction is executed during dynamic execution of the main application.
After all, the valid matches are enumerated by the exact algorithm [21] and the
conflict graph Gc = (V, E) is made. In the conflict graph, V is the set of generated
matches which are called conflict nodes, and E is the set of edges indicating the conflicts between matches. An edge exists between two conflict nodes if they have at
least one common operation in their representative subgraphs. Template selection part
consists of two sub-parts as introduced in our previous work [48]: (1) locally find the
maximum-weighted independent set (MWIS) in each template; (2) globally select the
MWIS for all of the templates. Due to the fact that the number of matches in each
template is limited, we use the exact MWIS for the first part. For the second part, the
GWMIN algorithm proposed in [50] is utilized to find the set of independent templates that yield the best performance. When a template is selected by this algorithm,
a label is assigned to it, and all of the adjacent templates (i.e., the templates that have a
common node with the selected one) are labeled as deleted. The algorithm iteratively
searches among all of the unselected and undeleted templates until all of the templates
are labeled selected or deleted. In each iteration, the algorithm selects template u that
(u)
, where
maximizes the associated merit introduced by the following formula: [dGW(u)+1]
i
W (u) is the weight of the template (i.e., template performance) and dGi (u) is degree
of the template (i.e., number of edges originating from template). The following code
represents the algorithm in pseudo C notation.
123
Architecture for embedded processing engines
959
Fig. 4 Performance improvement of the selected applications versus numbers of different input/Output
constraints for custom instructions
The next part in the framework flow graph, estimates the application performance improvement when the base processor is augmented with
the selected templates by the following formula: ApplicationPerformance =
selected templates
TemplatePerformance
Application performance stands for the number of cycles that will be saved with
the help of custom instruction implementation. The packet-processing applications
which were selected from the PacketBench [51] include IPv4 look-up (i.e., IPv4-trie
and IPV4-radix), flow classification (Flow-class), and encryption and message digest
algorithms (i.e., IPSec and MD5) that are widely used in network processing nodes.
The representative benchmark applications are completely introduced and analyzed
in our previous work described in [44] and [45]. The proposed algorithm is applied
to the selected packet-processing benchmark applications for different input/output
(CIin, CIout) constraints for custom instructions. The performance improvement is
shown in Fig. 4. As shown, the performance gain is up to 40 %. Due to the high
rate of memory operations in some benchmarks (e.g., Flow-Class and IPv4-Radix),
these applications cannot gain much better performance in the presence of the proposed
simple custom instruction and these applications can be further improved with complex
custom instructions that include memory accesses such as the one proposed in [52].
It is also noticeable that for (CIin, CIout) higher than (4, 4), the performance is not
considerably improved.
3.2 I/O aware custom instruction selection
The bandwidth of the registerfile is a limitation in improving the performance of
customized processors. We extend our framework to analyze the effects of changing
123
960
A. Yazdanbakhsh et al.
the number of registerfile read/write ports on the performance of the selected custom
instruction. In [31] a solution is proposed for resolving the limitation of the number
of registerfile read/write ports for customized processors. In this method, simultaneous read/write accesses that exceed the number of available read/write ports are
serialized with some pipeline registers. Pipeline registers make it possible to have
extra read/write operations using the base registerfile in multiple cycles. Although,
this architecture has a little area overhead, applications lose some performance due to
the imposed extra cycles in multi-cycle reads/writes. In these architectures the performance degradations in each template is evaluated due to the multi-cycle reads/writes
from the registerfile. The following equation is exploited to calculate the template
performance considering the multi-cycle read writes imposed by register pipelining:
TemplatePerformance
=
Iteration × SWLatency − HWLatency
matches in
each template
(CIin − RFin )
−
RFin
(CIout − RFout )
−
RFout
SWLatency : Software latency of template which is calculated as the accumulated
number of the operations in the candidate subgraphs.
HWLatency : Hardware latency of the template which is calculated as the sum of the
operation delays through the critical path of the template.
CIin : Number of custom instruction input operands.
CIout : Number of custom instruction output operands.
RFin : Number of read ports of registerfile.
RFout : Number of write ports of registerfile.
When the number of CI input/output operands is greater than the available registerfile read/write ports,
registerfile
pipelining
is used
for accessing the registerfile.
(CIin −RFin )
(CIout −RFout )
and −
denote the numbers of cycles
Hence, the terms −
RFin
RFout
that will be missed due to the pipelining overhead resulting from the limited number
of read and write ports of the registerfile. The evaluated performance is assigned to
each template and then, the remaining process of template selection is followed the
same way as the previously described algorithm.
I/O-aware custom instruction selection is applied to packet-processing benchmarks
and their performance improvements are shown in Fig. 5. As shown, when the custom
instructions are generated based on (CIin , CIout ) = (3, 2) and the registerfile constraint
is (3, 2), the performance improvement is higher than when (CIin , CIout ) = (6, 3) and
(RFin , RFout ) = (3, 2). Therefore, both of (CIin , CIout ) and (RFin , RFout ) should
be considered for achieving the best performance improvement. Another important
observation is that performance improvement is almost saturated in (CIin , CIout )
= (6, 3) and (RFin , RFout ) = (6, 3) and the performance improvement of this point
123
Architecture for embedded processing engines
961
Fig. 5 Performance improvement of selected applications versus different registerfile and custom instruction constraints
is about 5 % in average higher than (CIin , CIout ) = (3, 2) and (RFin , RFout ) = (3, 2).
Considering that the former configuration achieves this improvement with a greater
area overhead compared to the latter.
3.3 Analyzing area overhead
The area overhead is another term that should be analyzed in custom instruction selection. As mentioned in the previous part, the designers have two solutions to design
the architecture of extensible processors: (1) increasing registerfile bandwidth, (2)
pipelining registerfile. The first strategy yields higher performance at the expense
of area overhead of a larger registerfile. The second one has no area overhead at
the expense of some performance degradation for register pipelining. The performance per area of different custom instructions on different registerfiles is shown
in Fig. 6, where performances per area values are normalized to that of MIPS. The
selected MIPS processor has 32-bit single issue in-order 5-stage pipeline architecture
as introduced in [39]. As shown in this figure, since the area overhead of increasing the number of registerfile read/write ports can significantly increase the processor area, we propose the architecture with the (2, 1) registerfile as the optimum
one in terms of performance per area. It is also obvious that among the registerfiles that have the same performance per area, the ones which lead to higher performance or lower area are selected considering the area budget or performance requirements.
123
962
A. Yazdanbakhsh et al.
Fig. 6 Performance per area for different benchmark applications
3.4 Analyzing long latency custom instructions
Another point that should be considered in the design of customized processors
is the execution of long latency custom instructions (which has a critical path
longer than the base processor cycle). We applied our modified identification algorithm to select long latency custom instructions for the representative benchmark
applications. The results are shown in Fig. 7. Increasing the number of allowed
clock cycles improves the performance up to 5 % at the expense of controller
complexity. So, we conclude that allowing long latency custom instructions is not
profitable in terms of performance improvements and hardware complexity. The
performance of the long latency instructions can be improved with the methods
proposed in [31] and [34]; however, it is not the focus of this paper and we
123
Architecture for embedded processing engines
963
Fig. 7 Performance improvements of long latency custom instruction for different benchmark applications
have used our simple method and omit the long latency instruction for sake of
simplicity.
4 Architecture model
Our proposed architecture exploits “backwarding” logic in the processor pipeline to
resolve the hazards of the additional operands per custom instruction. In addition,
our architecture imposes minimal modifications to the regular instruction encoding
for accommodating the additional operands of custom instructions. In this section,
we describe these modifications in the processor pipeline and the instruction encoding. We assume RISC-style in-order pipeline processor architecture with extensibility
features. For illustration purposes, we use a simple single-issue in-order MIPS-like
5-stage pipeline. However, our technique can be easily applied to other pipeline archi-
123
964
A. Yazdanbakhsh et al.
tectures. The proposed architecture supports multiple input multiple output (MIMO)
custom instructions considering the area overhead of the augmented custom instructions and also the imposed area overhead of the registerfile. As shown in the previous
part, the performance of the extensible processor can be improved when relaxing the
input/output constraints in custom instruction generation. However, the area overhead
of the registerfiles with relaxed number of read/write ports can significantly reduce
the achieved performance per area. Chen et al. have proposed architecture for MIMO
custom instructions in [43]. We first analyze this model and describe its shortcomings in terms of performance. Next, we propose an architecture that overcomes these
shortcomings.
MIMO extension (M2 E) architecture model [43] which is in the MOV category
(defined previously) provides two custom instruction formats to handle the desired
custom instructions. The first one is capable of executing up to 32 custom instructions
which have at most two input and single output operands. This format is designed to
support some small tasks that can be handled by a single instruction. The shift amount
and the function field of the standard MIPS R-type instruction format (i.e., the last ten
bits of the R-type instruction) are exploited to specify the fields of custom instructions
and custom function units. The other instruction format provides the possibility to
compound a bundle of instructions for executing complex functions on M2 E. It is
worth mentioning that executing such complex custom instructions on this machine
needs at least two extra instructions, one to read operands from registerfile and one
to get results from the CFU. M2 E model exploits this approach to provide up to 16
CFUs each supporting up to 16 custom functions. This instruction format offers up
to 256 complex custom functions that have multiple inputs and multiple outputs. The
sequence of instructions that are generated for executing a complex custom function
should be arranged in the proper order. It means that all of the instructions that put
the operands in the targeted CFU should be executed before the instructions which
provide the results.
As stated in the previous sections, due to the high area overhead of increasing the
number of read and write ports of the registerfile, our proposed architecture is based
on a registerfile with dual read and single write port. A modified architecture for the
traditional registerfile pipelining is used to support the multiple input and multiple
output custom functions. In traditional register pipelining, preparing operands for a
custom function that for example needs four inputs and provides two outputs, consumes
two clock cycles for reading the operands from registerfile and two clock cycles are
required for writing the results back into the registerfile. However, for each additional
I/O access a constant penalty is imposed. This leads to a sub-optimal speed-up, because
some I/O accesses can be done in parallel with the other pipeline stages. We resolve
this problem with the proposed instruction set and pipeline architecture.
As demonstrated in Fig. 5, four input and two output operand custom instructions
[i.e., (4, 2) custom instructions] have a performance gain near custom instructions
with relaxed input and output ports. These results motivate us to restrict the custom
function inputs and outputs to (4, 2). Based on this observation, we have proposed two
distinct instruction formats for executing custom instructions on the base processor
architecture. The first one as shown in Fig. 8a, is a single 32-bit instruction that can
be used for custom instructions, where the total number of operands are less than
123
Architecture for embedded processing engines
965
Fig. 8 The proposed instruction formats
Table 1 Classification of
custom functions based on the
custom instruction inputs and
outputs
Number of
inputs
Number of
outputs
Type
Opcodes
2
1
Single
2
2
3
1
3
2
4
1
011110, 011111,
4
2
110011, 111011
010000, 010001,
010010, 010011,
Double
011100, 011101,
or equal to four [i.e., (2, 1), (2, 2), and (3, 1) custom instructions]. The other custom
instructions as shown in Fig. 8b have more than four operands and are decomposed into
two consecutive single instructions. In the proposed double-word custom instructions,
the first instruction reads up to two operands and writes one of the results and the
second one is overlapped with the first one in the pipeline and can read the other two
operands and write the other output operand. Our encoding ensures that the encoding
of base processor instructions are not affected. We illustrate our encoding based on
the instruction format of MIPS processor. However, the general idea is applicable to
any RISC-style instruction format.
The proposed instruction formats are shown in Table 1 and Fig. 8. As shown in
Table 1, the custom functions are classified into six different groups based on the
number of inputs and outputs. The custom functions which have (2, 1), (2, 2), (3,
1) input and output ports fit into a single 32-bit instruction, but other configurations
need two consecutive instructions for execution. According to this classification, two
distinct instruction formats are developed for custom instructions: one of them belongs
to the group that need only one instruction for execution and the other belongs to
the group that need two consecutive instructions. As we do not want to affect the
123
966
A. Yazdanbakhsh et al.
encoding of normal instructions, all the information about the operands of the custom
instructions are encoded in the unused parts of the MIPS instruction format.
As shown in Fig. 8a, joint opcode and function fields specify the custom functions
that should be executed on the CFU and Rs1, Rs2, Rs3 represents the indices of source
operands in the registerfile and Rd1, Rd2 show the destination operands. The instruction
bits 20 to 16 can be the index of a source or destination operand depending on the
type of the instruction. Figure 8b demonstrates the situation where two consecutive
instructions should be executed on the processor for custom instructions which belong
to the second group.
The opcode and function fields in the first instruction just specify that three or
four operands should be read from the registerfile and the custom field in the second
instruction specify the custom function which provides the results. Rd1 and Rd2 are
used for the destination operands. As demonstrated in [39], there are ten unused
opcodes in the MIPS R-type instructions that can be used for both custom instruction
groups. The classification of these unused configurations is shown in Table 1. With this
ten unused opcodes, the total number of custom instructions that can be added to the
base instruction set is calculated with the following formula: (10−n)×2bitwidth(Function)
+ n×2bitwidth(Custom)+bitwidth(Function) = (10−n)×26 + n×211 (for 0<n< 10). To meet
the designer demands and depending on the required number of custom instructions
of each group, number of custom instructions vary from 2,624 to 18,496 instructions
for different values of ‘n’.
Splitting the complex custom instructions in the mentioned manner into two consecutive 32-bit instructions leads to a new problem. If these two consecutive instructions
are executed normally on the processor pipeline, the results cannot be written into the
registerfile correctly. Figure 9 shows the pipeline execution of a custom instruction
from group 2. The first instruction reads the first and second source operand in ID1
stage (the second clock cycle). In the third clock cycle, the instruction is executed and
meanwhile the third and fourth source operands are read by the second instruction.
This means that the first instruction uses invalid value for the third and fourth source
operands. To eliminate this problem, we propose a special logic called backwarding
which is added to the base processor. Figure 9 demonstrates the use of backwarding
in our proposed architecture. Backwarding module backwards the appropriate result
from the output of the execute stage (EX) of the second instruction to the write-back
stage (WB) of the first one. This causes the calculated results backward to the previous instruction in the pipeline to be written when the first instruction is in WB stage.
Therefore, the first instruction’s EX stage performs a useless operation because the
calculated results are overwritten by the backward logic. That is why this stage is
darkened in the pipeline timing diagram in Fig. 9.
The data-path of the proposed architecture is shown in Fig. 10. Two modules,
called read controller and write controller, are added to the base processor pipeline
Backwarding
Custom
Function
Rd2
Rd1
Rs4
Rs3
Rs2
IF
ID1 EX MEM WB1
Rs1
IF
ID2 EX MEM WB2
Fig. 9 Pipeline execution of a custom instruction from goup 2 and the backwarding concept
123
Fig. 10 Datapath of the proposed single-issue 5-stage pipeline
Architecture for embedded processing engines
967
123
968
A. Yazdanbakhsh et al.
Fig. 11 Read and write controller flow graph: a read controller, b write controller
to control and schedule read/write accesses to the registerfile. The state diagrams of
these controllers are shown in Fig. 11. Read controller schedules the values deriving
Rs1, Rs2, Rs3, and Rs4 for different instructions. Normally Rs1 and Rs2 are fed as
the input operands for normal instructions or custom instructions that have up to two
read operands. For instructions that have more than two source operands, the read
controller directs the other source operands to Rs3 and Rs4.
Write controller schedules the writes into the registerfile. According to the previous
explanation on the instruction formats and the proposed backwarding concept, the
write controller must select the destination value among three sources: Rd1 for normal
instructions or single-word custom instructions that have one destination operand.
Backwarding and Rd2 for selecting the first and second destination values, respectively,
for multi-output custom instructions.
To handle the hazards that may arise in the pipeline stages during the execution
of two consequent instructions, all of the states for each combination of previously
classified instructions are shown in Figs. 12 and 13. The pipeline stages are similar to
the five-stage pipeline of MIPS processor, except that decode and write-back stages
may be modified for reading more than two operands and writing two results in the
registerfile, respectively. Since our architecture is based on a (2, 1) registerfile, the
instruction decoding stage (ID) is extended into ID1 and ID2 for instructions that need
three or four input operands. In the same manner, the write-back stage (WB) is extended
into WB1 and WB2 for instructions that have more than one destination operand. The
stages that are colored with gray pattern means nothing useful is done in this stage
(i.e., NOP). Checking probable structural hazards needs to analyze the execution of
two consecutive instructions. Figure 12 shows the states of the pipeline for the custom
123
Fig. 12 Instruction sequences which start with a single instruction
Architecture for embedded processing engines
969
123
Fig. 13 Instruction sequences which start with a double instruction
970
123
A. Yazdanbakhsh et al.
Architecture for embedded processing engines
971
Table 2 Possible hazards between two consequent instructions and the affected pipeline stage
Second Ins.
(2, 1)
(2, 2)
(3, 1)
(3, 2)
(4, 1)
(4, 2)
No
No
No
No
No
No
–
–
–
–
–
–
(2, 2)
Yes
Yes
No
Yes
Yes
Yes
WB2
WB2
–
WB2
WB2
WB2
(3, 1)
Yes
Yes
Yes
Yes
Yes
Yes
ID2
ID2
ID2
ID2
ID2
ID2
(3, 2)
No
No
No
No
No
No
–
–
–
–
–
–
(4, 1)
No
No
No
No
No
No
–
–
–
–
–
–
(4, 2)
No
No
No
No
No
No
–
–
–
–
–
–
First Ins.
(2, 1)
instructions that are implemented with two consecutive single instructions or a single
instruction followed by a double instruction. The first and second instructions are
separated by a dashed-line between them. The cloud figure in pipeline means a stall
or bubble in the pipeline that may be required during the execution to eliminate the
hazards. According to Fig. 12 part (a) in which the first instruction is a (2, 1) instruction,
all of the combinations are hazard free no matter what the second instruction is. Figure
12 part (b) demonstrates the situation where the first instruction is (2, 2) and the
following instructions are (2, 1), (2, 2), (3, 1), (3, 2), (4, 1), and (4, 2). In this case, a
stall is required in the write-back stage in most of the sequences. Upon the detection
of this kind of hazard, all of the pipeline stages should be stalled. As shown in Fig. 12
part (c), all of the stalls occurred in the decode stage.
All of the instruction sequences that start with a double-word instruction are shown
in Fig. 13. As shown, these sequences never need to stall the pipeline. All of the
hazards are shown in tabular format in Table 2. The table demonstrates the hazards that
may happen for two consecutive instructions in pipeline stages during the execution.
Another important note is that we execute the double-word instructions as atomic
instructions in the pipeline, therefore, interrupts, exceptions, or context switching do
not issue any operation between these two instructions.
5 Evaluation and results
In this section, we evaluate our architecture model and compare it to the MOV model.
The worst conflict between two consecutive instructions occurs when every fetched
instruction has four inputs and two outputs as operands. Figure 14 depicts the states
of the pipeline stages during the execution of such instructions on our proposed architecture and MOV model as well. Our model needs six cycles to execute the first (4, 2)
123
972
A. Yazdanbakhsh et al.
Fig. 14 Pipeline stages for executing consecutive custom instructions: a MOV architecture, b proposed
architecture
instruction and then after each two clock cycles another (4, 2) instruction can be committed. However, MOV model needs eight clock cycles for executing the first custom
instruction (including 4 single instructions) and after that, for every four clock cycles
an additional instruction can be committed. In MOV model: the instructions labeled
with (a) put the first two operands into the internal registers; instruction labeled (b) put
the other two operands into two other registers; (c) gets the first result from internal
latches after execution of the desired custom function, and finally (d) gets the second
result from the internal register. In the other words, the minimum execution cycles
to execute n instructions of type (4, 2) on our proposed and MOV architectures are
calculated from the following formulas:
Total execution cycleproposed = 6 + 2 × (n − 1)
Total execution cycleM2E = 8 + 4 × (n − 1)
It is obvious that when the number of instructions increases, our proposed architecture
can execute the custom instructions about two times faster than MOV model considering that all of the normal data dependency hazards are resolved by traditional
forwarding logic in the MIPS hardware. The following formula compares the two
models in the limiting case:
Total execution cycleproposed
Total execution cycleMOV
1
6 + 2 × (n − 1)
=
n→ ∞ 8 + 4 × (n − 1)
2
= lim
Another metric that is important in comparing the proposed architecture with MOV
is the code compression rate. In MOV model for each (4, 2) instruction, two instructions
are needed to put the appropriate operands in the internal registers and two instructions
for executing the desired instruction and getting the results from the internal registers.
However, in our proposed model only two instructions are exploited for a (4, 2) custom
instruction. It means that the instruction memory usage in our proposed model is more
efficient than MOV. The following formula is used to evaluate the code compression
for both architecture models after custom instruction generation:
123
Architecture for embedded processing engines
Table 3 Number of generated
instructions for each custom
instruction
Architecture model
973
MOV
Proposed architecture
(2, 1)
1
1
(2, 2)
3
1
(3, 1)
3
1
(3, 2)
4
2
(4, 1)
3
2
(4, 2)
4
2
(input, output)
Fig. 15 Percentage of code compression for selected applications
Code Compression = 1 −
total number of instruction with CIs
total number of instruction without any CI
This metric shows the percentage of instructions that are reduced with the help of
custom instructions. Table 3 illustrates the number of instructions that are generated for
custom instructions for the proposed and the MOV architecture considering different
input and output operands. In our proposed architecture, only one or two instructions
are generated for each custom function; however, in MOV model up to four instructions are generated to execute a custom instruction on the processor. The percentage
of code compressions for selected packet-processing applications is summarized in
Fig. 15.
We compare the performance improvement of each benchmark for three different architectures. The architecture which is called “ideal” is supposed to support a
registerfile with four read ports and two write ports and the instruction format has
enough space to encode these six operands in a single instruction. This architecture
can provide the maximum performance improvement with custom instructions. The
other architecture is based on our proposed instruction set and the proposed idea of
exploiting data “backwarding”. The last architecture which is called MOV is based
on the typical register pipelining architecture, where custom “move” instructions are
required to transfer data from the registerfile to internal registers of the CFU.
123
974
A. Yazdanbakhsh et al.
Fig. 16 Performance improvements of the proposed and MOV architectures normalized to the ideal performance improvement
Figure 16 shows performance improvement obtained from our proposed architecture and the case where custom “move” instructions are inserted (i.e., MOV). The
performance improvement of the proposed architecture is closer to the ideal performance limit. Thus, our technique can overcome the limitation of the number of
registerfile ports with negligible effects on instruction encoding, pipeline architecture,
and performance.
6 Conclusion
We have proposed custom instruction set and pipeline architecture for customizing
a single embedded packet-processing core with the objective of optimizing performance per area of each core for processing parallel network workloads. The focus
of this research was on increasing the application efficiency by effectively handle
larger subgraphs of an application and simultaneously overcoming the limitation of
the number of registerfile ports. Custom instructions with more register operands
are accommodated without increasing the register file IO ports through the developed
architecture by definition of double-word custom instructions that overlap their numerous registerfile accesses in the pipeline. The potential hazards are resolved through the
proposed pipeline backwarding concept that improves the execution speed of custom
instructions in processor’s pipeline. In comparison to the previous work, our proposed
architecture allows definition of many more custom instructions by accommodating
both of single- and double-word custom instructions with enhanced code compression. The MIPS-based design in our case study utilizes unused instruction opcodes
and fields to ensure that the proposed custom instructions and pipeline architecture do
not significantly affect the base instruction set and pipeline without complicating the
base processor compiler.
123
Architecture for embedded processing engines
975
Acknowledgments The authors acknowledge the partial support received for this work under contract
149811/140 from Microelectronic Committee-Research Administration of the University of Tehran.
References
1. Swanson S, Putnam A, Mercaldi M, Michelson K, Petersen A, Schwerin A, Oskin M, Eggers SJ (2006)
Area-performance trade-offs in tiled dataflow architectures. in: Proceedings of the 33rd international
symposium on computer architecture (ISCA’06), pp. 314–326
2. Nickolls J, Dally WJ (2010) The GPU computing era. IEEE Micro 30(2):56–59
3. Lee SJ (2010) A 345 mW heterogeneous many-core processor with an intelligent inference engine for
robust object recognition. In: Porceedings of the IEEE international solid-state circuits conference,
2010, pp. 332–334
4. Bell S, et al (2008) TILE64T M processor: a 64-core SoC with mesh interconnect. In: Porceedings
ofthe IEEE international solid-state circuits conference, pp. 88–90
5. Jotwani R, et al (2010) An x86–64 core implemented in 32 nm SOI CMOS. In: Porceedings of the
IEEE international solid-state circuits conference, pp. 106–107
6. Howard J, et al (2010) A 48-Core IA-32 message-passing processor with DVFS in 45 nm CMOS. In:
Poreedings of the IEEE international solid-state circuits conference, pp. 108–110
7. Shin JL, et al (2010) A 40 nm 16-Core 128-thread CMT SPARC SoC processor. In: Porceedings of
the IEEE international solid-state circuits conference, pp. 98–99
8. Johnson C, et al (2010) A wire-speed POWERT M processor: 2.3G Hz, 45 nm SOI with 16 cores and
64 threads. In: Porceedings of the IEEE international solid-state circuits conference, pp. 104–106
9. Azizi O, Mahesri A, Lee BC, Patel SJ, Horowitz M (2010) Energy-performance tradeoffs in processor
architecture and circuit design: a marginal cost analysis. In: Proceedings of the 37th international
symposium on computer architecture (ISCA’10), pp. 26–36
10. Kapre N, DeHon A (2009) Performance comparison of single-precision SPICE model-evaluation on
FPGA, GPU, Cell, and multi-core processors. In: Proceedings of the international conference on field
programmable logic and applications, pp. 65–72
11. Truong DN et al (2009) A 167-processor computational platform in 65 nm CMOS. IEEE J Solid State
Circuits 44(4):1130–1144
12. Hill MD, Marty MR (2008) Amdahl’s law in the multicore era. IEEE Comput 41(7):33–38
13. Borkar S (2007) Thousand core chips—a technology perspective. In: Proceedings of the design automation conference (DAC), pp. 746–749
14. Eyerman S, Eeckhout L (2010) Modeling critical sections in Amdahl’s Law and its implications
for multicore design. In: Proceedings of the 37th international symposium on computer, architecture
(ISCA’10), pp. 362–370
15. Park S, Shrivastava A, Dutt N, Nicolau A, Paek Y, Earlie E (2008) Register file power reduction using
bypass sensitive compiler. IEEE Trans Comput Aided Des Integr Circuits Syst 27(6):1155–1159
16. Nalluri R, Garg R, Panda PR (2007) Customization of register file banking architecture for low power.
In: Proceedings of the 20th international conference on VLSI design (VLSID’07), pp. 239–244
17. Bonzini P, Pozzi L (2008) Recurrence-aware instruction set selection for extensible embedded processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 16(10):1259–1267
18. Atasu K, Pozzi L, Ienne P (2003) Automatic application-specific instruction-set extensions under
microarchitectural constraints. In: Proceedings of the design automation conference (DAC), pp. 256–
261
19. Clark N, Zhong H, Mahlke S (2003) Processor acceleration through automated instruction set customization. In: Proceedings of the 36th Annu. IEEE/ACM, MICRO, pp. 129–140
20. Yu P, Mitra T (2004) Scalable custom instructions identification for instruction-set extensible processors. In: Proceedings of the CASES, pp. 69–78
21. Pozzi L, Atasu K, Ienne P (2006) Exact and approximate algorithms for the extension of embedded
processor instruction sets. IEEE Trans Comput Aided Des Integr Circuits Syst 25:1209–1229
22. Chen X, Maskell DL, Sun Y (2007) Fast identification of custom instructions for extensible processors.
IEEE Trans Comput Aided Des Integr Circuits Syst 26(2):359–368
23. Zyuban VV, Kogge PM (1998) The energy complexity of register files. In: Proceedings of the international symposium on low power, electronic design, pp. 305–310
123
976
A. Yazdanbakhsh et al.
24. Leupers R, Karuri K, Kraemer S, Pandey M (2006) A design flow for configurable embedded processors
based on optimized instruction set extension synthesis. In: Proceedings of the design, automation &
test in Europe (DATE)
25. Altera Corp. Nios processor reference handbook
26. Xilinx Inc., Microblaze soft processor core
27. Gonzalez RE (2000) XTENSA: a configurable and extensible processor. IEEE Micro 20:60–70
28. Karuri K, Chattopadhyay A, Hohenauer M, Leupers R, Ascheid G, Meyr H (2007) Increasing databandwidth to instruction-set extensions through register clustering. In: Proceedings of the international
conference on computer aided design, pp. 166–177
29. Fischer JA, Faraboschi P, Young C (2005) Embedded computing: a VLIW approach to architecture.
Elsevier Inc, Compiler and Tools, Amsterdam
30. Kim NS, Mudge T (2003) Reducing register ports using delayed write-back queues and operand prefetch. In: Proceedings of the 17th annual international conference on Supercomputing, pp. 172–182
31. Pozzi L, Ienne P (2005) Exploiting pipelining to relax register-file port constraints of instruction set
extensions. In: Proceedings of the international conference on compilers, architectures and synthesis
for embedded systems, pp. 2–10
32. Atasu K, Dimond R, Mencer O, Luk W, Özturan C, Dünda G (2007) Optimizing instruction-set
extensible processors under data bandwidth constraints. In: Proceedings of the design automation and
test in, Europe, Mar. 2007, pp. 588–593
33. Atasu K, Ozturan C, Dundar G, Mencer O, Luk W (2008) CHIPS: custom hardware instruction
processor synthesis. IEEE Trans Comput Aided Des Integr Circuits Syst 27(3):528–541
34. Verma Ajay K, Brisk Philip, Ienne Paolo (2010) Fast, nearly optimal ISE identification with I/O
serialization through maximal clique enumeration. IEEE Trans Comput Aided Des Integr Circuits
Syst 29(3):341–354
35. Brisk P, Kaplan A, Sarrafzadeh M (2004) Area-efficient instruction set synthesis for reconfigurable
system-on-chip designs. In: Proceedings of the design automation conference (DAC), pp. 395–400
36. Moreano N, Borin E, de Souza C, Araujo G (2005) Efficient datapath merging for partially reconfigurable architectures. IEEE Trans Comput Aided Des Integr Circuits Syst 24(7):969–980
37. Dinh Q, Chen D, Wong MDF (2008) Efficient ASIP design for configurable processors with finegrained resource sharing. In: Proceedings of the ACM/SIGDA 16th international symposium on FPGA,
pp. 99–106
38. Zuluaga M, Topham N (2009) Design-space exploration of resource-sharing solutions for custom
instruction set extensions. IEEE Trans Comput Aided Des Integr Circuits Syst 28(12):1788–1801
39. Hennessy JL, Patterson DA (2005) Computer organization and design: the hardware/software interface,
the Morgan Kaufmann Series in computer architecture and design, 3rd edn. Elsevier Inc., Amsterdam
40. Powell PMD, Vijaykumar TN (2002) Reducing register ports for higher speed and lower energy. In:
Proceedings of the 35th annual IEEE/ACM international symposium on microarchitecture, pp. 171–
182
41. Cong J, et al (2005) Instruction set extension with shadow registers for configurable processors. In:
Proceedings of the FPGA, pp. 99–106
42. Liu H, Jayaseelan R, Mitra T (2006) Exploiting forwarding to improve data bandwidth of instruction-set
extensions. In: Proceedings of the design automation conference (DAC), pp. 43–48
43. Chen X, Maskell DL (2007) Supporting multiple-input, multiple-output custom functions in configurable processors. J Syst Architect 53:263–271
44. Salehi ME, Fakhraie SM (2009) Quantitative analysis of packet-processing applications regarding
architectural guidelines for network-processing-engine development. J Syst Architect 55:373–386
45. Salehi ME, Fakhraie SM, Yazdanbakhsh A (2012) Instruction set architectural guidelines for embedded
packet-processing engines. J Syst Architect 58:112–125
46. The GNU operating system, available online: http://www.gnu.org
47. Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Architecture-aware graph-covering algorithm for
custom instruction selection. In: Proceedings of the international conference on future information
technology (FutureTech), pp. 1–6
48. Yazdanbakhsh A, Salehi ME, Fakhraie SM (2010) Locality considerations in exploring custom instruction selection algorithms. In: Proceedings of the ASQED
49. Yazdanbakhsh A, Kamal M, Salehi ME, Noori H, Fakhraie SM (2010) Energy-aware design space
exploration of registerfile for extensible processors. In: Proceedings of the SAMOS
123
Architecture for embedded processing engines
977
50. Sakai S, Togasaki M, Yamazaki K (2003) A note on greedy algorithms for the maximum weighted
independent set problem. Discret Appl Math 126:313–322
51. Ramaswamy R, Weng N, Wolf T (2009) Analysis of network processing workloads. J Syst Architect
55(10—-12):421–433
52. Biswas P, Atasu K, Choudhary V, Pozzi L, Dutt N, Ienne P (2004) Introduction of local memory
elements in instruction set extensions. In: Proceedings of the 41st design automation conference, June
2004, pp. 729–734
53. She D, He Y, Corporaal H (2012) Energy efficient special instruction support in an embedded processor
with compact ISA. In: proceedings of the CASES, pp. 131–140
54. Wu D, Ahn J, Lee I, Choi K (2012) Resource-shared custom instruction generation under performance/area constraints. International symposium on system on chip (SoC), pp. 1–6
123
Download