Instruction set architectural guidelines for embedded packet-processing engines Mostafa E. Salehi ⇑

Journal of Systems Architecture 58 (2012) 112–125
Contents lists available at SciVerse ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
Instruction set architectural guidelines for embedded packet-processing engines
Mostafa E. Salehi ⇑, Sied Mehdi Fakhraie, Amir Yazdanbakhsh
Nano Electronics Center of Excellence, School of Electrical and Computer Engineering, Faculty of Engineering, University of Tehran, Tehran 14395-515, Iran
a r t i c l e
i n f o
Article history:
Received 7 September 2009
Received in revised form 16 January 2012
Accepted 25 February 2012
Available online 5 March 2012
Keywords:
Packet-processing engine
Application-specific processor
Benchmark profiling
Architectural guideline
a b s t r a c t
This paper presents instruction set architectural guidelines for improving general-purpose embedded
processors to optimally accommodate packet-processing applications. Similar to other embedded processors such as media processors, packet-processing engines are deployed in embedded applications, where
cost and power are as important as performance. In this domain, the growing demands for higher bandwidth and performance besides the ongoing development of new networking protocols and applications
call for flexible power- and performance-optimized engines.
The instruction set architectural guidelines are extracted from an exhaustive simulation-based profiledriven quantitative analysis of different packet-processing workloads on 32-bit versions of two wellknown general-purpose processors, ARM and MIPS. This extensive study has revealed the main performance challenges and tradeoffs in development of evolution path for survival of such general-purpose
processors with optimum accommodation of packet-processing functions for future switching-intensive
applications. Architectural guidelines include types of instructions, branch offset size, displacement and
immediate addressing modes for memory access along with the effective size of these fields, data types of
memory operations, and also new branch instructions.
The effectiveness of the proposed guidelines is evaluated with the development of a retargetable compilation and simulation framework. Developing the HDL model of the optimized base processor for networking applications and using a logic synthesis tool, we show that enhanced area, power, delay, and
power per watt measures are achieved.
Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction
High-performance and flexible network processors are expected
to comply the user demands for improved networking services and
packet-processing tasks at different line speeds. According to the
ever-increasing demand for higher bandwidth, the performance
bottleneck of networks has been transferred to the processing elements and consequently there has been a tremendous effort in
speeding these modules. Traditional PEs are either based on custom hardware blocks or general-purpose processors (GPPs). Custom ASIC designs have better performance, higher manufacturing
costs, and lower flexibility; however, GPPs are more flexible but
are not speed-power-area optimized for networking applications.
According to various performance requirements of network workloads, there is a plenty of work on the design of network processor
architecture and instruction set. Some designs exploit the flexibility of GPPs and use as many GPPs as required to satisfy the performance requirements. For example, Niemann et al. [1] exploit a
massive parallel-processing structure of simple processing
⇑ Corresponding author.
E-mail addresses: mersali@ut.ac.ir, mostafa.salehi@gmail.com (M.E. Salehi),
fakhraie@ut.ac.ir (S.M. Fakhraie), a.yazdanbakhsh@ece.ut.ac.ir (A. Yazdanbakhsh).
1383-7621/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.
doi:10.1016/j.sysarc.2012.02.004
elements, and due to its regularity, the architecture can be scaled
to accommodate various performance and throughput requirements. As an alternative to employing large number of simple
GPPs, Vlachos et al. [2] introduce a high performance packet-processing engine (PPE) with a three-stage pipeline consisting of three
special purpose processors (SPPs). The proposed SPPs are microprogrammed processors optimized for header field extraction,
header field modification, packet verification, bit and byte processing, and leave only some generic software execution to the central
processing core.
SPPs are also used in many of the commercial network processors
(NPs) to improve packet processing performance. Sixteen programmable processing units are used in Motorola C-5 [3] in a parallel
configuration. IBM PowerNP [4] introduce a multi-processor NP
architecture with embedded processor complex (EPC). The EPC has
a PowerPC core and 16 programmable protocol processors. Intel
IXP1250 [5] uses six micro-engines (MEs). Each ME is a 32-bit RISC
processor that do the majority of the network processing tasks such
as packet header inspection and modification, classification, routing,
metering, etc. The ME instruction set is a mix of conventional RISC
instructions with additional features specifically tailored for network processing. Another NP called FlexPathNP [6,7] exploits the
diverse processing requirements of packet flows and sends the
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
packets with relatively simple processing requirements directly to
the traffic manager unit. By this technique the CPU cluster computing capacity will be used optimally and the processing performance
is increased. A cache-based network processor architecture is proposed in [8] that has special process-learning cache mechanism to
memorize every packet-processing activity with all table lookup results of those packets that have the same information such as the
same pair of source and destination addresses. The memorized results are then applied to subsequent packets that have the same
information in their headers.
Most of the previously introduced architectures and instruction
sets are based on a refined version of a well-known architecture
and instruction set. To have a reproducible analysis, we focus on
typical NP workloads and benchmarks and exploit a powerful simulator and profiler [9] to obtain some useful details of network processing benchmarks on two well-known GPPs. The results indicate
performance bottlenecks of representative packet-processing
applications when GPPs are used as the sole processing engine.
Then we use these results in accordance with quantitative study
of different network applications to extract helpful architectural
guidelines for designing optimized instruction set for packet-processing engines.
To keep up with the demands of increasing performance and
evolving network applications, the programmable network-specific PEs should support application changes and meet their heavy
processing workloads. Therefore considering flexibility for short
time to market, it is necessary to build on the existing application
development environments and users’ background on using general purpose processors (GPPs). On the other hand, means should
be provided for catching up with the increasing demand of higher
performance at affordable power and area. In this paper, we provide a solution for finding the minimum required instructions of
two most-frequently-used GPPs for packet-processing applications. In addition, a retargetable compilation and simulation
framework is developed based upon which the proposed instruction set architectural guidelines are evaluated and compared to
the base architectures.
The proposed guidelines provide a wide variety of design alternatives available to the instruction set architects including: memory addressing, addressing modes, type and size of operands,
operands for packet processing, operations for packet processing,
control flow instructions, and also propose special-purpose
instructions for packet-processing applications. These guidelines
demonstrate what future general-purpose processors need to
power-speed optimally respond to the growing number of embedded applications with switching demands. Based on the introduced
architectural guidelines, an embedded packet-processing engine
useful for a wide range of packet-processing applications can be
developed, and be used in massively parallel processing architectures for cost-sensitive demanding embedded network applications. The proposed guidelines would also be applicable to the
processing nodes of embedded applications that are responsible
for packet-processing tasks among others.
2. Analysis of packet-processing applications
Hennessy and Patterson [10] present a quantitative analysis of
instruction set architectures aimed at processors for desktops,
servers and also embedded media processors and introduce a wide
variety of design alternatives available to the instruction set architects. In this paper we present comparative results for development of embedded engines customized for packet-processing
tasks in different network applications. According to the IETF
(Internet Engineering Task Force), operations of network applications can be functionally categorized into data-plane and control-
113
plane functions [11]. The data-plane performs packet-processing
tasks such as packet fragmentation, encapsulation, editing, classification, forwarding, lookup, and encryption. While the controlplane performs congestion control, flow management, signaling,
handling higher-level protocols, and other control tasks. There
are a large variety of NP applications that contain a wide range
of different data-plane and control-plane processing tasks. To
properly evaluate network-specific processing tasks, it is necessary
to specify a workload that is typical of that environment.
CommBench [12] is composed of eight data-plane programs
that are categorized into four packet-header processing and four
packet-payload processing tasks. In a similar work NetBench [13]
contains nine applications that are representative of commercial
applications for network processors and cover small low-level code
fragments as well as large application-level programs. CommBench
and NetBench both introduce data-plane applications. NpBench
[14] targets towards both control-plane and data-plane workloads.
A tool called PacketBench is presented in [15], which provides a
framework for implementing network-processing applications
and extracting workload characteristics. For statistics collection,
PacketBench presents a number of micro-architectural and networking related metrics for four different networking applications
ranging from simple packet forwarding to complex packet payload
encryption. The profiling results of PacketBench are obtained from
ARM-based SimpleScalar [16]. Embedded Microprocessor Benchmarking Consortium (EEMBC) [17] has also developed a networking benchmark suite to reflect the performance of client and
server systems (TCPmark), and also functions mostly carried out
in infrastructure equipment (IPmark). The IPmark is intended for
developers of infrastructure equipment, while the TCPmark, which
includes the TCP benchmark, focuses on client- and server-based
network hardware.
Considering representative benchmark applications for header
and payload processing for IPv4 protocol, we have presented a simulation-based profile-driven quantitative analysis of packet-processing applications. The selected applications are IPv4-radix and
IPv4-trie as look-up and forwarding algorithms, a packet-classification algorithm called Flow-Class, Internet Protocol Security (IPSec)
and Message-Digest algorithm 5 (MD5) as payload-processing
applications. To develop efficient network processing engines, it
is important to have a detailed understanding of the workloads
associated with this processing. PacketBench provides a framework for developing network applications and extracting a set of
workload characteristics on ARM-based SimpleScalar [16] simulator. To have an architecture-independent analysis, we have identified and then modified the PacketBench profiling capabilities that
are added to ARM-based SimpleScalar and also developed a compound simulation and profiling environment for MIPS-based
SimpleScalar, yielding a MIPS-based profiling platform. MIPSbased SimpleScalar which is augmented with PacketBench profiling capabilities, reproduce the PacketBench profiling observations
on MIPS processor. The measurements which are indicative of network applications reveal the performance challenges of different
programs. The presented measurements are also dynamic, in
which, the frequency of a measured event is weighed by the number of times that the event occurs during execution of the
application.
We present the experimental results of profiling representative
network applications on 32-bit versions of ARM and MIPS processors using ARM9 and MIPSR3000 examples. ARM and MIPS family
of processors are two widely-used processors in the network-processor products. Intel IXP series of network processors use StrongARM processors that are based on ARM architecture [18], and
Broadcom BCM products [19] have used MIPS processors in communication-processor products. The comparative results of both
ARM and MIPS platforms then yield architectural guidelines for
114
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
developing application-specific processing engines for network
applications. To have a realistic analysis, we use packet traces of
real networks. An excellent repository of traces collected at several
monitoring points is maintained by the National Laboratory for Applied Network Research (NLANR/PMA) [20]. We have selected
many traces of this trace repository as our input packet traces
and the average value of the results are presented here. For each
application, the extracted properties are the frequencies of load,
store, and branch instruction, instruction distribution, instruction
patterns, frequent instruction sequences, offset size of branches,
rate of the taken branches, execution cycles, and performance
bottlenecks.
2.1. Execution time analysis
We use the execution time analysis to evaluate and compare
the performance of MIPS and ARM processors when running each
of the selected applications. To present comparative results, we ensure the reproducibility principle in reporting performance measurements, such that another researcher would duplicate the
results in different platforms. The execution time of an application
is calculated according to the following formula [21] in terms of
Instruction Count (IC), Clock Per Instruction (CPI), and Clock Period
(CP).
Execution time of application ¼ IC CPI CP
In our previous work [22], we employ this reproducible analysis
for packet-processing applications as the number of required cycles for processing a packet (IC CPI) and consequently involve
the processor architectural dependencies. IC CPI is a part of the
presented formula and the remaining parameter is CP, therefore,
having known the frequency of different processors, the reported
number of the required clock cycles of an application simply leads
to calculation of the application execution time. This then makes
different architectures universally comparable when running the
target applications. To find the instruction count and clock count
of packet-processing tasks and have a comparative analysis, we
have profiled benchmark applications on both of ARM- and
MIPS-based simulators. The results in Table 1 shows the number
of instructions and also the required clock cycles for processing a
packet in each of the specified applications using both of the MIPSand ARM-based SimpleScalar environments.
As shown in Table 1, the computational properties of the selected applications vary when they are executed with different
processors. Despite the simple instruction set of MIPS, ARM has a
more powerful instruction set. In ARM processor, each operation
can be performed conditionally according to the results of the previous instructions [23]. Furthermore, ARM supports complex
instructions that perform shift as well as arithmetic or logic operation in a single instruction. These instructions can lead to lower
instruction count in loops and also in codes with complicated logic/arithmetic operations. However, these complex instructions
complicate the pipeline architecture and may cause the reduction
of ARM clock frequency that consequently may lead to more execution time and lower performance. A smart compiler can take
advantage of complex ARM instructions and produce more optimized codes in terms of lower number of instructions. Throughout
an application compilation with ARM compiler, when the complex
instructions are not applicable, general instructions are used and
number of instructions in the generated code is expected to be less
than or equal to when the code is compiled with MIPS compiler.
According to the results of compiling selected applications with
ARM and MIPS compilers, payload processing applications have
about 18% lower instructions when compiled with ARM compiler.
However, in header processing applications ARM results are worse
than MIPS. Both of the selected cross compilers are based on
gcc2.95.2. With the same cross compiler, instruction counts of
the header processing applications in ARM are 10–80% higher than
instruction counts of these applications in MIPS (Table 1). When
multiplying the ‘‘number of clock cycles of running an application’’
with a processor, with the ‘‘clock period’’, of a specific implementation of that processor, the total execution or elapsed time is
achieved, that would make the results universally comparable
among different implementations of various processors.
2.2. The role of compiler
As shown in Table 1, the instructions count of IPV4-radix, is 80%
higher when compiled with ARM cross compiler. The instruction
count of the IPV4-radix based on its containing functions is summarized in Table 2. As shown in this table the instruction count of validate_packet and inet_ntoa functions are similar but the instruction
count of the lookup_tree in ARM is about six times more than in
MIPS.
Table 3 shows the instruction count of the sub-functions of the
lookup_tree. The instruction count of the inet_aton function in ARM
is about 10 times more than MIPS. This is because of the strtoul
function generated with ARM compiler which is optimized properly with MIPS cross compiler. This observation shows the effect
of cross compiler on the number of generated instructions when
the same gcc compiler is used.
To reveal the effect of compiler version on instruction count of
the compiled application codes and to compare different compilers
together, we obtain the results with another version of gcc for ARM
cross compiler and the instruction count of the representative
applications on both 2.95.2 and 3.4.3 versions of ARM cross compiler are calculated. According to the results (Table 4), despite
the 80% difference in IPV4-radix results in ARM and MIPS 2.95.2
cross compilers, compiling IPV4-radix with MIPS gcc2.95.2 and
ARM gcc3.4.3 cross compilers yield almost equal instruction counts
and for the other applications different versions of MIPS and ARM
cross compilers have negligible effects on instruction count of the
compiled application codes. Therefore, from now on we use the
optimum compiler results of each processor for further comparisons in this paper.
2.3. Instruction set operations
The supported operations by most instruction set architectures
are categorized in Table 5 [10]. All processors must have some
instruction support for basic system functions and generally provide a full set of the first three categories. The amount of support
for the last categories may vary from none to an extensive set of
special instructions. Floating-point instructions will be provided
Table 1
Computational complexity of the packet-processing applications based on TSH [20] traces.
ARM [22]
# of instructions
# of clock cycles
MIPS
IPV4-radix
IPV4-trie
Flow-Class
IPSec
MD5
IPV4-radix
IPV4-trie
Flow-Class
IPSec
MD5
4205
5092
206
494
152
340
100998
113108
8911
14202
2376
3630
186
398
113
274
123394
227330
11043
17570
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
Table 2
The number of instructions for different functions of the IPV4-radix when compiled
with ARM and MIPS gcc2.95 cross compilers.
Function
ARM
MIPS
validate_packet
inet_ntoa
lookup_tree
Total
115
1510
2354
4096
96
1630
397
2160
Table 3
Instructions count of lookup_treefunction based on gcc2.95.2 ARM compiler.
Function
ARM
MIPS
Bzero
inet_aton
rn_match
22
2168
137
18
213
144
Table 4
Instruction count comparison of the representative applications with ARM gcc.2.95.2
and gcc.3.4.3.
Application
IPV4-radix
Processor
ARM
gcc version
# of instructions
# of clock cycles
2.95
4164
5044
MIPS
3.4
2358
2927
2.95
2376
3630
Table 5
Categories of instruction operators [10].
Operation type
Examples
Arithmetic and
logical
Data transfer
Integer arithmetic and logical operations: add, subtract,
and, or, multiply, divide
Loads/stores (move instructions on computers with
memory addressing)
Branch, jump, procedure call and return, traps
Operating system call, virtual memory management
instructions
Floating point operations: add multiply, divide, compare
Decimal add, decimal multiply, decimal-to-character
conversions
String move, string compare, string search
Pixel and vertex operations, compression/decompression
operations
Control
System
Floating point
Decimal
String
Graphics
in any processor that is intended to be used in an application that
makes much use of floating point. Decimal and string instructions
are sometimes primitives, or may be synthesized by the compiler
from simpler instructions. Based on five SPECint92 integer programs, it is shown in [24] that the most widely-executed instructions are some simple operations of an instruction set such as
‘‘load’’, ‘‘store’’, ‘‘add’’, ‘‘subtract’’, ‘‘and’’, register-register ‘‘move’’,
and ‘‘shift’’ that account for 96% of instructions executed on the
popular Intel 8086. Hence, the architect should be sure to make
these common cases fast. Multiplies and multiply-accumulates
are added to this simple set of primitives for DSP applications.
Usage patterns of the top 15 frequently-used ARM instructions
in packet processing applications are presented in Table 6. According to this table the most frequent instructions reside in the category of primitive instructions including the operations of the first
three categories in Table 5. Thus, we have divided the instruction
set to three main categories including memory and logic/arithmetic instructions, control flow, and special purpose instructions, and
compare the profiling results of representative sample codes for
both ARM and MIPS processors in the following sections.
115
3. Quantitative analysis of network-specific instructions
To analyze the results and extract architectural guidelines for
instruction set of an optimized packet-processing engine, we have
divided the instruction set into three main categories and compared the profiling results of some representative sample examples
for both ARM and MIPS processors. We have also considered the effects of the compiler on the generated codes. The first category is
memory instructions. Before analyzing memory instructions, we
must define how memory addresses are interpreted and how they
are specified. Addressing modes have the ability to significantly reduce instruction counts of an application; they also add to the
complexity of hardware and may increase the average CPI of processor. Therefore, the usage of different addressing modes is quite
important in helping the architect choose what to support. The old
VAX architecture has the richest set of addressing modes including
10 different addressing modes leading to fewest restrictions on
memory addressing. Ref. [10] presents the results of measuring
addressing-mode-usage patterns in three benchmark programs
on the VAX architecture and concludes that immediate and displacement dominate memory-addressing-mode usage. As network
applications migrate towards larger programs and hence rely on
compilers, so addressing modes must match the ability of the compilers developed for embedded processors. As packet-processing
applications head towards relying on compiled code, we expect
increasing emphasis on simpler addressing modes. Therefore, dislike Ref. [10] we do not profile network applications on VAX and
select displacement and immediate addressing modes for network
applications. Other addressing modes such as register-indirect, indexed, direct, memory-indirect, and scaled can be easily synthesized with displacement mode.
3.1. Memory and logic/arithmetic instructions
To show the effect of different memory and logic/arithmetic
instructions on the instruction count of the generated codes, Table
7 presents two sample codes that are compiled with both ARM and
MIPS cross compilers. The selected codes are representative codes
for two distinguished categories: high memory access and excessive logic/arithmetic operations. According to the results, the former has less instruction count in MIPS (eight instructions
compared to 13 instructions in ARM) and the latter is optimized
when compiled with ARM (eight instructions compared to 10
instructions in MIPS). The reason is that MIPS supports byte
(‘‘lb’’, ‘‘sb’’), 2-byte (‘‘lh’’, ‘‘sh’’), and 4-byte (‘‘lw’’, ‘‘sw’’) loads and
stores. Therefore 8-bit, 16-bit, and 32-bit data types are read from
or write to memory with a single instruction. However, ARM only
supports byte (‘‘ldrb’’, ‘‘strb’’) and 4-byte (‘‘ldr’’, ‘‘str’’) memory
accesses. Therefore, 2-byte load stores should be simulated with
a sequence of ‘‘ldrb’’, ‘‘strb’’, arithmetic, and logic instructions in
ARM, as shown in the assembly code of the Fibonacci. This is why
this code has more instructions when compiled with ARM cross
compiler. Besides, the conditional and combined arithmetic/logic
instructions of ARM lead to less instruction counts for shift-andadd multiplier code which needs more logic/arithmetic instructions when compared to the Fibonacci code. In this case as seen,
MIPS code has more instructions.
According to this observation and the fact that some variables
such as checksum, IP packet type, and source/destination port
numbers are all 16-bit values, the 2-byte loads/stores can considerably affect the instruction count of packet-processing applications. Distribution of the 2-byte loads/stores, logic, and
arithmetic instructions for the selected applications based on the
MIPS compiler are shown in Table 8. As shown in this table the
usage of 2-byte loads/stores in IPV4-lctrie and Flow-Class are higher
116
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
Table 6
Usage pattern of the top15 frequently-used instructions based on ARM.
Instruction
IPV4radix (%)
IPV4trie (%)
FlowClass (%)
IPSec
(%)
MD5
(%)
Average
(%)
Ldr
Add
Mov
Cmp
Orr
Ldrb
And
Sub
Str
Strb
Eor
Bne
Bcc
Beq
Bic
8.7
2.6
10.8
18.5
0.6
3.7
0.4
2.8
3.8
0.6
0.0
3.9
2.1
4.0
0.0
7.7
13.9
12.9
10.8
7.2
12.9
4.6
7.7
0.0
1.5
0.0
2.1
0.0
3.6
0.0
28.4
9.9
12.1
9.1
1.2
7.8
0.0
1.8
10.5
5.4
0.0
6.0
0.0
1.8
1.2
33.6
0.5
16.3
0.1
14.7
0.4
16.5
0.3
0.6
0.4
10.0
0.1
0.0
0.0
3.6
6.0
33.7
2.4
7.9
6.9
5.6
2.7
8.9
2.8
6.6
3.4
0.2
7.6
0.1
2.0
16.9
12.1
10.9
9.3
6.1
6.1
4.8
4.3
3.5
2.9
2.7
2.5
1.9
1.9
1.4
than other applications. Since 2-byte loads/stores are not compiled
as optimally as MIPS with ARM compiler, the instruction counts of
such memory-intensive applications are higher when compiled
with ARM. However, as shown in Table 8, the logic and shift operations are higher in MD5 and IPSec applications which are good
candidates to be annotated with ARM combined instructions and
produce codes with fewer instructions.
Another important measurement for designing instruction set is
size of the displacement field in memory instructions and also size
of the immediate value in instructions, in which, one of the operands is an immediate value. Since these sizes can affect the instruction length, a decision should be made to choose the optimized size
for these fields. Based on the representative network benchmark
measurements, we expect the size of the displacement field to be
at least 9 bits. As shown in Fig. 1, this size captures about 88%
and 95% of the displacements in benchmark programs in MIPS
and ARM, respectively.
Fig. 2 presents usage patterns of the immediate sizes used in
different instructions based on the profiling results of network
benchmarks on ARM and MIPS. According to these results we suggest 9-bit for size of the immediate value which covers about 88%
of the immediate values in ARM and MIPS.
3.2. Control flow instructions
The instruction that changes the flow of control in a program is
called either transfer, branch, or jump. Throughout this paper we
will use jump for unconditional and branch for conditional changes
in control flow. There are four common types for control flow
instructions: conditional branches, jumps, subroutine calls, and returns from subroutines. According to the frequencies of these control-flow instructions extracted from running packet-processing
benchmarks on ARM and MIPS profiling environments, conditional
branches are dominant. There are three common implementations
for conditional branches in recent processors. One of them implements the conditional branch with a single instruction which performs the comparison as well as the decision in a single
instruction, for example, ‘‘beq’’ and ‘‘bne’’ instructions in the MIPS
instruction set [21]. The other methods need two instructions for
conditional branches, the first one performs the comparison and
the second one makes a decision based on the comparison results.
The comparison result is saved in a register, (such as the ‘slt’
instruction in MIPS), or will modify the processor status flags (such
as the ‘cmp’ instruction in ARM instruction set [23]).
We have developed some simple C codes to represent a wide
variety of conditional codes including ‘‘case’’, ‘‘if’’ and ‘‘else’’ conditional statements and have compiled them with both ARM and
MIPS cross compilers. According to the results, the conditional
assignments of ARM are implemented with a sequence of compare,
branch, and assignments in MIPS. These complex instructions may
yield less instructions in the codes generated with ARM cross compiler. Besides, ‘‘beq’’ and ‘‘bne’’ instructions in MIPS compare two
registers and jump to the branch target in a single instruction which
can be done with at least two instructions in ARM, one for compare
and another for branch. The extracted results represent the effectiveness of each of the indicated advantages of ARM and MIPS. To
compare the profiling results of MIPS and ARM together in a practical example, we have extracted the conditional statement of packet-validation function which is used in all of the benchmark
applications. The conditional statement of this function is a combination of simple if statements that are combined with logical ‘‘or’’ or
logical ‘‘and’’. According to the presented results in Table 9, the code
is compiled to equal number of instructions in both of ARM and
MIPS processors. It means that although simple conditional codes
may lead to different instruction counts when compiled with MIPS
or ARM, compiling practical conditional codes with simple instructions of MIPS has same instruction counts comparable to when it is
compiled with more powerful instructions of ARM.
The most common way to specify the destination of a branch is
to supply an immediate value called offset that is added to the program counter (PC). Size of branch-offset field would affect the
encoding density of instructions and therefore, restrict the operand
variety in terms of number of the operands as well as the operand
size. Another important measurement for instruction set is branch
offset size. According to [10], short offset fields often suffice for
branches and offset size of the most frequent branches in the integer programs can be encoded in 4–8 bits. Fig. 3shows the usage
pattern for branch offset sizes of both conditional and unconditional branches that are used in the selected packet-processing
applications. According to the results, 95% of conditional branches
need offset sizes of 5–10 bits and offset size of 91% of unconditional branches range from 5 to 13 bits.
3.3. Special-purpose instructions
Packet header and payload are read from a non-cachable memory that is located on the system bus. We use the PacketBench
expression for this component that is called packet memory [15].
The other local memories are also called non-packet memories.
According to the profiling results of the layer two switching [25],
packet-memory accesses are about 2% of the total instructions
and since the packet memory is on the system bus, each access
to the packet memory takes about 15 processor clock cycles [25].
It is also observed that packet-memory accesses consume 26% of
the total execution time [25]. The percent of packet-memory
accesses of applications are summarized in Table 10. According
to this table, packet memory accesses range from 3.7% to 45.2%
and IPV4-trie and Flow-Class have the highest packet memory
accesses among the other applications.
A good solution for reducing the latency of packet-memory
accesses is to reduce the bus-access overhead with burst load
and store instructions. Also, an IO controller can exploit a directmemory access (DMA) device and transfer the packet data to the
local memory of the processor and reduce the bus access overhead
[2]. By reducing bus latencies the maximum achievable improvement with the burst memory instructions is evaluated in Table
11. According to these results, proper use of the burst memory
instructions in a code can significantly improve the execution time
of the application. The results show the maximum achievable performance improvements, however the DMA transfer or burst memory transfers are inserted to code manually. For automatic burst
insertions an algorithm such as the one proposed by Biswas [26]
can be used.
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
117
Table 7
The compiled code for MIPS and ARM, (a) Fibonacci series, where hundred numbers are generated and written to memory (b)
computationally-intensive shift-and-add multiplier.
(a) Fibonacci series (main function C code)
short int A[100]; int i;
A[0]=1; A[1]=1;
for (i=2;i<100;i++)
A[i]=A[i-1]+A[i-2];
Compiled code for ARM
Compiled code for MIPS
00008554<main+0x28>ldrb r0, [ip, #-201]
00008558<main+0x2c>ldrb r1, [ip, #-203]
0000855c<main+0x30>ldrbr2, [ip, #-202]
00008560<main+0x34>ldrb r3, [ip, #-204]
00008564<main+0x38>orr r2, r2, r0, lsl #8
00008568<main+0x3c>orrr3, r3, r1, lsl #8
0000856c<main+0x40>addr0, r2, r3
00008570<main+0x44>movr1, r0, asr #8
00008574<main+0x48>subslr, lr, #1; 0x1
00008578<main+0x4c>strbr1, [ip, #-199]
0000857c<main+0x50>strbr0, [ip, #-200]
00008580<main+0x54>addip, ip, #2; 0x2
00008584<main+0x58>bpl00008554
(b) Multiply with add and shift (main function C code)
p=0;p |= q;
for (i=0;i<8;i++) {
if (p & 0x1) p=((p >> 8)+m) << 8;
p = (p & 0x8000) | (p>>1);}
Compiled code for ARM
020001cc<main+0x14>tstr3, #1
020001d0<main+0x18>addne r3, r0, r3, asr #8
020001d4<main+0x1c>movne r3, r3, lsl #8
020001d8<main+0x20>andr2, r3, #32768
020001dc<main+0x24>movr3, r3, asr #1
020001e0<main+0x28>orrr3, r2, r3
020001e4<main+0x2c>subs r1, r1, #1
020001e8<main+0x30>bpl020001 cc
Table 8
Distribution of 2-byte load/ store, arithmetic, and logic instructions for the selected
applications when compiled with MIPS gcc2.95.2 cross compiler.
IPV4-radix
IPV4-trie
Flow-Class
IPSec
MD5
2-byte store (%)
2-byte load (%)
Arithmetic (%)
Logic (%)
0.1
0.5
0.5
0.0
0.0
1.6
7.0
6.6
0.0
0.1
32.7
33.1
27.0
19.9
33.2
16.3
29.4
13.3
57.6
38.3
4. Proposed instruction set for embedded packet-processing
engines
In the field of network processor design, some packet-processing-specific instructions are proposed in [27–30]. There is also a
lot of research in synthesizing instruction set for embedded application-specific processors which propose complex instructions for
high performance extensible processors [31–38]. All of these researches start with primitive instructions and refine them to boost
target performance. In this section, we propose the optimized primitive instruction sets for flexible and low-power packet-processing
engines. The proposed general-purpose instructions are used to
give flexibility for any further changes in packet-processing flow
and special-purpose instructions can be used to boost the execution
of the packet-processing tasks and therefore, increase performance.
As network applications migrate towards larger programs and
hence become more attracted to compilers, they have been trying
to use the compiler technology developed for desktop and embedded computers. Traditional compilers have difficulty in taking
00400230<main+0x40>lhu $2,-2($4)
00400238<main+0x48>lhu $3,-4($4)
00400240<main+0x50>addiu $5,$5,1
00400248<main+0x58>addu $2,$2,$3
00400250<main+0x60>sh $2,0($4)
00400258<main+0x68>addiu $4,$4,2
00400260<main+0x70>slti $2,$5100
00400268<main+0x78>bne $2,$0,00400230
Compiled code for MIPS
00400240<main+0x50>andi $2,$4,1
00400248<main+0x58>beq $2,$0,00400268
00400250<main+0x60>sra $2,$4,0x8
00400258<main+0x68>addu $2,$2,$6
00400260<main+0x70>sll $4,$2,0x8
00400268<main+0x78>andi $3,$4,32768
00400270<main+0x80>sra $2,$4,0x1
00400278<main+0x88>or $4,$3,$2
00400280<main+0x90>addiu $5,$5,-1
00400288<main+0x98>bgez $5,00400240
high-level language code and producing special-purpose instructions. However, new retargetable compiler technology deployed
for extensible processors (i.e. Tensilica [39] and CoWare [40], along
with the CoSy compiler [41]) might be used for optimum code
generations using special-purpose instructions.
The proposed instruction set is designed based on the requirements of different packet-processing applications, quantified in
terms of the required micro-operations and their frequencies. We
propose the general-purpose instructions according to the distribution of different instructions in the representative benchmark
applications including both header- and payload-processing algorithms. Table 12 presents the instruction distribution of the packet-processing applications in ARM and MIPS processors. The results
are sorted according to the maximum values of the instruction
occurrences among the selected applications. The top 25 instructions of the applications are listed in Table 12. The required basic
general-purpose instructions can be extracted from this table.
The reduced general-purpose instruction set trades performance
for power. We select the minimum number of instructions to have
the lowest power consumption and also convince an acceptable
performance. Therefore, all of the instructions that have a high
occurrence in the selected applications on both ARM and MIPS processors should contribute in the proposed list. However, we skip
the instructions that are rarely used and can be synthesized with
other instructions. This leads to the lowest required number of
instruction count.
As shown in Table 12, the selected applications have almost a
similar arithmetic, logic, memory, and branch instruction distribution with both of ARM and MIPS processors. As shown, the frequent
arithmetic instructions are ‘‘Add’’ and ‘‘Sub’’ with both register and
118
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
Fig. 1. Usage patterns of the displacement size in memory instructions based on the profiling results of network benchmark on (a) MIPS and (b) ARM processors.
Fig. 2. Usage patterns of immediate sizes of instructions based on the profiling results of network benchmark on (a) MIPS and (b) ARM processors.
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
119
Table 9
Representative code for condition checking. The code is extracted from the validate-packet function of the IPV4 lookup
applications.
if ((ll_length>MIN_IP_DATAGRAM) &&
(in_checksum&&
(ip_v == 4) &&
(ip_hl>= MIN_IP_DATAGRAM/4) &&
(ip_len>= ip_hl))
return 1;
else return 0;
Compiled code for ARM (gcc3.4)
Compiled code for MIPS (gcc. 2.95)
0000866c<validate_packet>cmp r1, #0
00008670<validate_packet+0x4>cmpne r0, #20
00008674<validate_packet+0x8>movr1, r3
00008678<validate_packet+0xc>ble00008684
0000867c < validate_packet+0x10>cmp r2, #4
00008680<validate_packet+0x14>beq 0000868c
00008684<validate_packet+0x18>mov r0, #0
00008688<validate_packet+0x1c>mov pc, lr
0000868c<validate_packet+0x20>ldrr3, [sp]
00008690<validate_packet+0x24>cmp r3, r1
00008694<validate_packet+0x28>cmpge r1, #4
00008698<validate_packet+0x2c>mov r0, #1
0000869c<validate_packet+0x30>movgt pc, lr
000086a0<validate_packet+0x34>b00008684
00400400<validate_packet>addu $2,$0,$0
00400408<validate_packet+0x8>lw $8,16($29)
00400410<validate_packet+0x10>addiu $3,$0,20
00400418<validate_packet+0x18>slt $3,$3,$4
00400420<validate_packet+0x20>beq $3,$0,00400468
00400428<validate_packet+0x28>beq $5,$0,00400468
00400430<validate_packet+0x30>addiu $3,$0,4
00400438<validate_packet+0x38>bne $6,$3,00400468
00400440<validate_packet+0x40>slti $3,$7,5
00400448<validate_packet+0x48>bne $3,$0,00400468
00400450<validate_packet+0x50>slt $3,$8,$7
00400458<validate_packet+0x58>bne $3,$0,00400468
00400460<validate_packet+0x60>addiu $2,$0,1
00400468<validate_packet+0x68>jr $31
immediate operands. The ‘‘Slt’’ instructions of MIPS which are used
for LESS THAN or LESS THAN OR EQUAL comparisons can be omitted and substituted with ‘‘Bl’’ and ‘‘Ble’’ instructions. MIPS and
ARM follow different approaches for implementing the branch.
MIPS compares two registers and based on the comparison results
jumps to the target address in a single instruction called ‘‘Beq’’ or
‘‘Bne’’. ARM implements the branch with two separate instructions; the first one does the comparison and the second checks
the flags of the processor and jumps to the target address if the
branch condition is satisfied. Since branch instructions have a high
contribution in the selected applications, we propose the singleinstruction comparison-and-jump approach. Therefore, the ‘‘Slt’’
and ‘‘Cmp’’ instructions are not included in arithmetic instructions
and ‘‘Beq’’, ‘‘Bne’’, ‘‘Bl’’, and ‘‘Ble’’ are proposed for branch instructions. The other branch instructions can be implemented with
these instructions.
The frequent logic instructions are ‘‘And’’, ‘‘Or’’, and ‘‘Xor’’.
‘‘Nor’’ and the immediate modes of logic instructions can be synthesized with the corresponding register modes. Table 8 shows
that the 2-byte loads/stores can considerably affect the instruction
count of an application. Since 2-byte load/stores are not compiled
optimally with ARM compiler, the instruction counts of IPV4-trie
and Flow_class applications are higher when compiled with ARM.
Therefore, the proposed memory access instructions support 8bit, 16-bit, and 32-bit loads and stores. The frequent shift instructions are ‘‘Sll’’, ‘‘Srl’’, and ‘‘Sra’’. As shown in Table 8, the logic and
shift operations are higher in MD5 and IPSec applications which are
good candidates to be synthesized with ARM complex instructions
to produce codes with fewer instructions. Usage patterns of ARM
complex instructions in the selected benchmark applications are
summarized in Table 13.
According to Table 13, 4.7% and 3.6% of IPSec and MD5 instructions are of these types, respectively. Each of ARM complex instructions would be synthesized with at least two MIPS instructions.
Over all, as shown sum of the average values of complex instruction usages for all representative applications is about 10%. Therefore, properly employing these instructions would improve the
instruction count about 10%. However, because of the complexity
of these instructions, they would complicate the pipeline design
and hence, might elongate the overall clock period of the processor.
Therefore, one should decide on using such instructions considering compiler potentials and also architecture design issues.
IPSec and MD5 are two large applications that contain about
100,000 and 9000 instruction in ARM. According to the results,
the instruction counts of these applications are about 23% higher
when compiled for MIPS. One reason for this difference in instruction counts is the usage of ARM complex instructions (Table 13). As
another source for this difference, we have observed the effect of
burst load and store instructions in ARM (‘‘stdmb’’ and ‘‘ldmdb’’)
which perform a block transfer to/from memory. These instructions are widely used in function calls and returns for saving and
restoring the function parameters to the stack. These instructions
can also be used for accessing memory for some registers which
should be ‘‘push’’ to or ‘‘pop’’ from stack. Since the investigated
payload processing applications are composed of too many small
functions, these instructions improve the instruction counts in
function calls and returns. According to our profiling results, the
burst memory access instructions of ARM improve the instruction
count of IPSec and MD5 about 3.5% and 9%, respectively. Some of
the widely-used instructions of MIPS and ARM (according to the
profiling results), and also the proposed optimum instruction set
are summarized in Table 14.
5. Retargetable instruction set compilation and simulation
framework
To evaluate the proposed instruction set, we have customized
the gcc compiler [42] and developed a retargetable compilation
framework for exploring the instruction set space for packet-processing applications. The GNU compiler collection (usually shortened to gcc) is a Linux-based compiler produced by the GNU
project which supports a variety of programming languages. GCC
has two distinct sections called machine dependent and machine
independent parts. The machine dependent part is responsible
for the final compilation process. In this part, the machine-independent intermediate output is compiled to the target machine
code using the machine definition (MD) file. The general structure
of the GCC compiler is shown in Fig. 4 (based on [42]).
The MD codes state all the microarchitectural specifications
such as: number of registers, supported instructions, instruction
set architecture, and execution flow of each instruction. The machine dependent codes consist of two basic files, the MD file that
contains the instruction patterns of the target processors and a C
file containing some macro definitions. The MD file defines the
120
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
Fig. 3. Usage patterns of offset sizes for conditional and unconditional branches in network benchmarks for ARM and MIPS processors.
patterns of target processor instructions by using a register transfer language (RTL) which is an intermediate representation similar
to the final assembly code.
Our proposed retargetable compilation and simulation framework is shown in Fig. 5. The exploration starts with the required
modifications to the MD or C files to specify the instruction set
121
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
Table 10
Percentage of packet and non-packet memory accesses from total instructions in the
selected applications.
IPV4-radix
IPV4-trie
Flow-Class
MD5
IPSec
Packet memory (%)
Non-packet memory (%)
5.0
45.2
43.1
27.6
3.7
29.8
6.9
23.7
15.4
21.5
Table 13
Usage patterns of ARM complex instructions in the selected benchmark applications.
Table 11
Maximum achieved performance improvement by reducing the bus overhead for
packet-memory accesses.
IPV4-radix
IPV4-trie
Flow-Class
MD5
IPSec
Clock count
New clock count
Improvement (%)
3630
398
274
17570
227330
3470
238
169
13257
219897
4.4
40.2
38.4
24.5
3.3
and microarchitecture of the target processor. After that, the modified GCC codes are compiled for generating the target compiler.
With the help of the generated compiler, the application source
codes are compiled for the new processor. The SimpleScalar is also
used as a retargetable simulator. The machine definition file (DEF)
of the SimpleScalar is modified according to the architecture of the
target processor for supporting the simulation of the generated
binary codes. The binary codes are then executed by SimpleScalar
to get the application profiling information (cycle count and
instruction count). This flow can iteratively explore all the proposed modifications to the instruction set and compiler and investigate the effects of these modifications on the performance of the
processor.
Instruction
IPV4radix (%)
IPV4trie (%)
FlowClass (%)
IPSec
(%)
MD5(%)
Average
(%)
subcs
Bic
orrcs
movcc
cmpcc
Mvn
Subs
movne
addcs
movs
Tsts
moveq
movnes
ldreq
cmpne
ldrne
10.4
0.0
5.5
4.7
3.1
0.6
1.4
2.1
1.6
0.0
1.6
1.5
1.4
0.5
0.4
0.3
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.5
2.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.6
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.1
0.0
2.0
0.0
0.0
0.0
1.3
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2.1
1.4
1.1
0.9
0.6
0.6
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.1
0.1
0.1
6. Experimental results
In previous sections we have quantitatively compared the effectiveness of MIPS and ARM instruction sets for the networking
application benchmarks and proposed some architectural guidelines for instruction set of an optimized packet-processing engine.
In this Section we use our developed compilation and simulation
framework to obtain post compilation quantitative comparisons.
We evaluate the effectiveness of the proposed instruction set by
comparing its performance for packet-processing benchmarks
with that of the MIPS instruction set. Interested reader can generalized these comparisons to ARM using the mentioned results in previous sections.
Table 12
Distribution of different instructions in the representative applications.
MIPS
ARM
Category
Instruction
IPV4radix (%)
IPV4trie (%)
FlowClass (%)
IPSec
(%)
MD5
(%)
MAX
(%)
Instruction
Memory
Lw
Sw
Lhu
Sb
Lbu
Lb
Addu
Addiu
Slti
Sltu
Lui
Subu
Or
Andi
Srl
Xor
Sll
And
Srav
Srlv
Nor
Ori
Beq
Bne
J
7.6
5.3
1.0
1.7
5.1
2.3
15.7
13.8
0.7
2.5
1.1
1.8
1.4
3.1
1.1
0.0
2.4
0.6
0.0
0.7
0.0
0.6
10.9
7.5
2.4
7.8
1.6
7.1
0.5
0.5
0.0
13.6
18.0
7.1
0.5
1.1
1.3
2.2
5.5
3.8
0.0
3.7
1.1
0.0
1.3
0.5
1.6
9.0
6.3
1.1
25.5
11.9
5.0
1.9
5.0
0.0
11.4
11.3
1.5
0.0
0.0
1.5
0.0
1.5
0.0
0.0
0.8
0.8
3.0
3.0
0.0
0.8
2.7
8.8
1.4
17.6
1.0
0.0
0.3
0.5
0.0
13.9
6.0
0.8
0.0
1.7
0.0
13.9
13.5
13.0
8.2
6.4
0.8
0.0
0.0
0.0
0.8
0.1
0.9
0.1
3.6
3.0
0.1
5.6
4.7
0.0
24.0
8.9
0.0
6.4
3.6
0.2
8.8
0.2
3.7
2.6
7.2
3.6
0.0
0.0
2.5
3.6
0.3
6.6
0.1
25.5
11.9
7.1
5.6
5.1
2.3
24.0
18.0
7.1
6.4
3.6
1.8
13.9
13.5
13.0
8.2
7.2
3.6
3.0
3.0
2.5
3.6
10.9
8.8
2.4
Ldr
Ldrb
Str
Strb
Arithmetic
Logic
Branch
IPV4radix (%)
IPV4trie (%)
FlowClass (%)
IPSec
(%)
MD5
(%)
MAX
(%)
8.7
3.7
3.8
0.6
7.7
12.9
0.0
1.5
28.4
7.8
10.5
5.4
33.6
0.4
0.6
0.4
6.0
5.6
2.8
6.6
33.6
12.9
10.5
6.6
Add
Cmp
Sub
Subcs
Cmpcc
2.6
18.5
2.8
6.4
1.8
13.9
10.8
7.7
0.0
0.0
9.9
9.1
1.8
0.0
0.0
0.5
0.1
0.3
0.0
0.0
33.7
7.9
8.9
0.0
0.0
33.7
18.5
8.9
6.4
1.8
And
Mov
Orr
Eor
Orrcs
Movcc
Movs
Movne
Movnes
Bic
Bcc
Bne
Bgt
Beq
Bl
Ble
0.4
10.8
0.6
0.0
5.6
3.9
0.1
1.8
1.6
0.0
2.1
3.9
0.4
4.0
1.1
0.9
4.6
12.9
7.2
0.0
0.0
0.0
2.1
0.0
0.0
0.0
0.0
2.1
5.2
3.6
1.5
1.5
0.0
12.1
1.2
0.0
0.0
0.0
0.0
0.0
0.0
1.2
0.0
6.0
0.0
1.8
1.2
1.2
16.5
16.3
14.7
10.0
0.0
0.0
0.0
0.0
0.0
3.6
0.0
0.1
0.0
0.0
0.2
0.0
2.7
2.4
6.9
3.4
0.0
0.0
0.0
0.0
0.0
2.0
7.6
0.2
0.0
0.1
0.4
0.0
16.5
16.3
14.7
10.0
5.6
3.9
2.1
1.8
1.6
3.6
7.6
6.0
5.2
4.0
1.5
1.5
122
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
Table 14
Proposed instruction set for embedded packet-processing engines.
Arithmetic
Logic
Memory
Branch
MIPS
ARM
Proposed instructions
Addu
Addiu
Slti
Sltu
Subu
Or
Ori
And
Andi
Xor
Nor
Srl
Sll
Srav
Srlv
Lw
Lhu
Lbu
Lb
Sw
Sb
Add
Sub
Subcs
Cmp
Cmpcc
Or
Orcs
And
Eor
Mov
Movcc
Movs
Movne
Movnes
Bic
Ldr
Ldrb
Str
Strb
Ldm
Stm
Add: add two registers
Addi: add register and immediate
Sub: Sub two registers
Subi: Sub immediate from register
Beq
Bne
JJal
Beq
Bne
Bgt
Bl
Ble
Bcc
And, Andi: and two operands
Or, Ori: or two operand
Xor, Xori: xor two operands
Sll, Sllv: shift left logical
Srl, Srlv: shift right logical
Sra, Srav: shift right arithmetic
Bic: bit clear
Ldw: load 32-bit from memory
Ldh: load 16-bit from memory
Ldb: load 8-bit from memory
Stw: store 32-bit to memory
Sth: store 16-bit to memory
Stb: store 8-bit to memory
Ldm: load a block from memory
Stm: store a block to memory
Beq: branch if registers are equal
Bne: branch if registers are not equal
Blt: branch if a register is less than a register
Ble: branch if a register is less than ro equal
to a register
B: unconditional branch
Br: branch to the address in the register
Bal: branch and link
Fig. 4. General structure of the GCC.
Exploiting the proposed exploration framework, we have
started with the MIPS instruction set [21] as the starting point
and have modified it to converge to the optimized instruction
set. We have evaluated the effectiveness of each modification to
the instruction set in terms of execution cycles and instruction
count for each of the representative benchmark applications.
According to the results, there are some instructions that are
least-frequently or never used when compiling the selected applications. We have found multiply and divide among these instructions. We have excluded these instructions from the instruction
list with negligible performance degradation. Some of the least-frequent-used instructions of MIPS in the selected benchmark
applications are shown in Table 15.
Fig. 5. Our retargetable compilation and simulation framework.
Table 15
Some of the least-frequently used instructions of MIPS in selected benchmark
applications.
Instruction
IPV4radix (%)
IPV4trie (%)
FlowClass (%)
IPSec
(%)
MD5
(%)
Average
(%)
mult
mfhi
divu
blez
bgtz
bgez
bltz
xori
slt
dsw
dlw
0.73
0.59
0.59
0.30
0.29
0.29
0.25
0.17
0.15
0.15
0.15
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.00
0.00
0.00
0.00
0.02
0.00
0.00
0.15
0.12
0.12
0.06
0.06
0.06
0.05
0.03
0.03
0.03
0.03
Decisions on whether excluding or including instructions from/
to the instruction set is made based on the profiling results of the
exploration framework. Some examples of the effect of excluding
(e.g. immediate logical and shift instructions) or including (e.g.
branch instructions, and byte- and half-word-loads and stores)
new instructions on the performance and code size for the selected
applications are shown in Fig. 6. As shown, excluding the immediate logical instructions (i.e. ‘‘andi’’, ‘‘ori’’, and ‘‘xori’’) and immediate shifts (i.e. ‘‘sll’’, ‘‘srl’’, and ‘‘sra’’) increases the execution
cycles of representative applications about 7% and 11% on average,
respectively. Therefore, it is not recommended to exclude these
instructions from the MIPS instruction set. Including the proposed
branch instructions (i.e. ‘‘blt’’ and ‘‘ble’’) reduces the execution cycles of the selected applications by 8% on average. Excluding the
half word loads/stores (i.e. ‘‘lh’’, ‘‘lhu’’/‘‘sh’’) increases the execution
cycles about 3/1% on average, respectively. In addition excluding
byte loads/stores (i.e. ‘‘lb’’, ‘‘lbu’’/‘‘sb’’) reduces the performance
about 5.4/6.9% on average, respectively. Furthermore, by excluding
all half-word- and byte-loads and stores, the maximum degradation in performance is 23% for MD5 and 27% for Flow-Class applications. Comparing to the instruction sets that only support 32-bit
loads/stores, the proposed instruction set can provide considerable
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
123
Fig. 6. Peformance comparison of the proposed architecture with MIPS: effect of omitting immediate logical instructions, and some types of load/stores, and also adding new
branch instructions on performance and code size for the selected applications, (a) improved performance, (b) code size.
Table 16
Area, power, and delay improvements on MIPS for the proposed instruction set.
Total cell area
Power (mW)
Delay (ns)
MIPS
Proposed
Improvement (%)
96425.9
8.7
3.4
60270.2
7.4
2.8
37
15
17
performance improvements. Because when the compiler supports
only 32-bit loads/stores, the 8-bit and 16-bit loads/stores should
be synthesized with a ‘‘lw’’/‘‘sw’’ followed by a sequence of logical
and shift operations for extracting/modifying the required part of
the 32-bit value.
We have also modeled the MIPS processor in synthesizable Verilog HDL. This development has been verified against the PISA
model utilized in SimpleScalar [9,16]. The modeled processor has
a 5-stage single issue pipeline architecture considering the data
hazards and resolving them by forwarding and interlock techniques. The Verilog model of the proposed processor is developed
based on the developed instructions and is consistent with the
machine definition files in compiler and simulator as well. The Verilog model is synthesized with a digital logic synthesizer using a
CMOS 90 nm standard cell library and the effects of the proposed
instruction additions and omissions on area, frequency, and power
consumption of the implied processor is evaluated. Since the clock
period and also the clock cycles are both improved, therefore the
performance is enhanced with our proposed instruction set. Furthermore, the power consumption of the proposed processor is
also reduced, thus the performance per watt is improved. Based
on the results of Table 16, and 8% performance improvement of
new branch instructions (Fig. 6), about 48% improvements in performance per watt is achieved in our approach (normalized execution time divided by normalized power, i.e. 1.081.17/0.85 = 1.48).
7. Conclusion
We have presented a quantitative analysis of networking
benchmarks to extract architectural guidelines for designing optimized embedded packet-processing engines. Exhaustive quantitative analysis of MIPS and ARM instruction sets for selected
124
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
benchmarks has been made. SimpleScalar simulation and profiling
environments are deployed to obtain comparative results, based
upon which instruction set architectural guidelines are developed.
The reproducible profile-driven results are based on representative
header- and payload-processing tasks. The experiments recommend load-store architecture with displacement and immediate
addressing modes supporting 8-, 16-, and 32-bit memory operations for packet-processing tasks. We also recommend new compare-and-branch instructions for conditional branches which
support registers as operands.
To validate the proposed instruction set guidelines by considering the mutual interaction of architecture and compiler in an integrated environment, a retargetable compilation and simulation
framework has been developed. This framework utilizes GCC and
SimpleScalar machine definition capabilities in the aforementioned
development. It is shown that the proposed basic set of networking
instructions provides low-power and cost-sensitive operation for
embedded packet-processing engines. These optimized engines
can be employed in massively parallel NP architectures or embedded processors customized for packet processing in the future packetized world. Furthermore, the proposed instructions can also be
used as the base set for accommodating application-specific custom
instructions for more-complex processors.
References
[1] J.C. Niemann, C. Puttmann, et al., Resource efficiency of the GigaNetIC chip
multiprocessor architecture, Journal of Systems Architecture 53 (2007) 285–
299.
[2] K. Vlachos, T. Orphanoudakis, et al., Design and performance evaluation of a
programmable packet processing engine (PPE) suitable for high-speed network
processors units, Microprocessors & Microsystems 31 (3) (2007) 188–199.
[3] Patrick Crowley, Mark A. Franklin, Haldun Hadimioglu, Peter Z. Onufryk,
Network Processor Design, Issues and Practices, The Morgan Kaufmann Series
in Computer Architecture and Design, vol. 1, Elsevier Inc., 2005.
[4] J. Allen, B. Bass, et al., IBM PowerNP network processor: hardware, software,
and applications, IBM Journal of Research and Development 47 (2/3) (2003)
177–194.
[5] Panos C. Lekkas, Network Processors: Architectures, Protocols, and Platforms,
McGraw-Hill Professional Publishing, 2003.
[6] R. Ohlendorf, A. Herkersdorf, T. Wild, FlexPath NP – a network processor
concept with application-driven flexible processing paths, in: third IEEE/ACM/
IFIP International Conference on Hardware/Software Codesign and System,
Synthesis, (CODES+ISSS), September 2005, pp. 279–284.
[7] R. Ohlendorf, T. Wild, M. Meitinger, H. Rauchfuss, A. Herkersdorf, Simulated
and measured performance evaluation of RISC-based SoC platforms in network
processing applications, Journal of Systems Architecture 53 (2007) 703–718.
[8] M. Okuno, S. Nishimura, S. Ishida, H. Nishi, Cache-based network processor
architecture: evaluation with real network traffic, IEICE Transaction on
Electron, E89–C (11) (2006) 1620–1628.
[9] D. Burger, T. Austin, The SimpleScalar tool set version 2.0, Computer
Architecture News 25 (3) (1997) 13–25.
[10] J.L. Hennessy, D.A. Patterson, Computer architecture: a quantitative approach,
fourth ed., The Morgan Kaufmann Series in Computer Architecture and Design,
Elsevier Inc., 2007.
[11] IETF RFCs, available from: <http://www.ietf.org/>.
[12] T. Wolf, M.A. Franklin, CommBench – a telecommunications benchmark for
network processors, in: Proc. of IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS), April 2000, pp. 154–162.
[13] G. Memik, W.H. Mangione-Smith, W. Hu, Net-Bench: a benchmarking suite for
network processors, in: Proc. of IEEE/ACM International Conference on,
Computer-Aided Design, November 2001, pp. 39–42.
[14] B.K. Lee, L.K. John, NpBench: a benchmark suite for control plane and data
plane applications for network processors, in: Proc. of IEEE International
Conference on Computer Design (ICCD 03), October 2003, pp. 226–233.
[15] R. Ramaswamy, T. Wolf, PacketBench: a tool for workload characterization of
network processing, in: Proc. of IEEE International Workshop on Workload
Characterization, October 2003, pp. 42–50.
[16] SimpleScalar LLC, available from: <http://www.simplescalar.com>.
[17] EEMBC, The Embedded Microprocessor Benchmark Consortium, Available
from: <http://www.eembc.org/home.php>.
[18] IntelÒ IXP4XX Product Line of Network Processors, Available from: <http://
www.intel.com/design/network/products/npfamily/ixp4xx.htm>.
[19] Broadcom corporation, Communications Processors, available from: <http://
www.broadcom.com/products/Data-Telecom-Networks/CommunicationsProcessors#tab=products-tab>.
[20] National Laboratory for Applied Network Research – Passive Measurement and
Analysis. Passive Measurement and Analysis, available from: <http://
pma.nlanr.net/PMA/>.
[21] J.L. Hennessy, D.A. Patterson, Computer organization and design: the
hardware/software interface, third ed., The Morgan Kaufmann Series in
Computer Architecture and Design, Elsevier Inc., 2005.
[22] M.E. Salehi, S.M. Fakhraie, Quantitative analysis of packet-processing
applications regarding architectural guidelines for network-processingengine development, Journal of Systems Architecture 55 (2009) 373–386.
[23] ARM Processor Instruction Set Architecture, available from: <http://
www.arm.com/products/CPUs/architecture.html>.
[24] Jurij Silc, Borut Robic, Th. Ungerer, Processor Architecture: From Dataflow to
Superscalar and Beyond, Springer-Verlag, 1999.
[25] M.E. Salehi, R. Rafati, F. Baharvand, S.M. Fakhraie, A quantitative study on
layer-2 packet processing on a general purpose processor. in: Proc. of
International Conference on Microelectronic (ICM06), December 2006, pp.
218–221.
[26] Partha Biswas, Kubilay Atasu, Vinay Choudhary, Laura Pozzi, Nikil Dutt, Paolo
Ienne, Introduction of local memory elements in instruction set extensions. in:
Proc. of the 41st Design Automation Conference, San Diego, CA, June 2004, pp.
729–734.
[27] H. Mohammadi, N. Yazdani, A genetic-driven instruction set for high speed
network processors, in: Proc. of IEEE International Conference on Computer
Systems and Applications (ICCSA 06), March 2006, pp. 1066–1073.
[28] Gary Jones, Elias Stipidis, Architecture and instruction set design of an ATM
network processor, Microprocessors and Microsystems 27 (2003) 367–379.
[29] N.T. Clark, H. Zhong, S.A. Mahlke, Automated custom instruction generation for
domain-specific processor acceleration, IEEE Transaction on Computers 54
(10) (2005) 1258–1270.
[30] M. Grünewald, D. Khoi Le, et al. Network application driven instruction set
extensions for embedded processing clusters, in: Proc. International
Conference on Parallel Computing in, Electrical Engineering, September
2004, pp. 209–214.
[31] Muhammad Omer Cheema, Omar Hammami, Application-specific SIMD
synthesis
for
reconfigurable
architectures,
Microprocessors
and
Microsystems 30 (2006) 398–412.
[32] Pan Yu, Tulika Mitra, Scalable custom instructions identification for instruction
set extensible processors, in: Proc. of International Conference on Compilers,
Architecture and Synthesis for Embedded Systems, September 2004, pp. 69–
78.
[33] Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang, Application-specific
instruction generation for configurable processor architectures, in: Proc. of
the ACM/SIGDA international symposium on Field programmable gate arrays,
2004, pp. 183–189.
[34] S.K. Lam, T. Srikanthan, Rapid design of area-efficient custom instructions for
reconfigurable embedded processing, Journal of System Architecture 55 (2009)
1–14.
[35] K. Atasu, C. Ozturan, G. Dundar, O. Mencer, W. Luk, CHIPS: custom hardware
instruction processor synthesis, IEEE Transactions on Computer Aided Design
of Integrated Circuits and Systems 27 (2008) 528–541.
[36] L. Pozzi, K. Atasu, P. Ienne, Exact and approximate algorithms for the extension
of embedded processor instruction sets, IEEE Transaction on Computer-Aided
Design of Integrated Circuits and Systems 25 (2006) 1209–1229.
[37] F. Sun, S. Ravi, A. Raghunathan, N.K. Jha, A synthesis methodology for hybrid
custom instruction and co-processor generation for extensible processors, IEEE
Transactions on Computer-Aided Design 26 (11) (2007) 2035–2045.
[38] Philip Brisk, Adam Kaplan, Majid Sarrafzadeh, Area-efficient instruction set
synthesis for reconfigurable system-on-chip designs, in: Proc. of the 41st
annual Design Automation Conference (DAC 04), 2004, pp. 395–400.
[39] Tensilica: Customizable Processor Cores for the Dataplane, available from:
<http://www.tensilica.com/>.
[40] Karl Van Rompaey, DiederikVerkest, Ivo Bolsens, Hugo De Man, ‘‘CoWare – a
design environment for heterogeneous hardware/software systems’’, in: Proc.
of European Design Automation Conference, 1996, pp. 252–257.
[41] ACE CoSy compiler development system, available from: <http://www.ace.nl/
compiler/cosy.html>.
[42] GCC, the GNU Compiler Collection, available from: <http://gcc.gnu.org/>.
Mostafa Ersali Salehi Nasab was born in Kerman, Iran,
in 1978. He received the B.Sc. degree in computer
engineering from University of Tehran, Tehran, Iran, and
the M.Sc. degree in computer architecture from University of Amirkabir, Tehran, Iran, in 2001 and 2003,
respectively. He has received his Ph.D. degree in school
of Electrical and Computer Engineering, University of
Tehran, Tehran, Iran in 2010. He is now an Assistant
Professor in University of Tehran. From 2004 to 2008, he
was a senior digital designer working on ASIC design
projects with SINA Microelectronics Inc., Technology
Park of University of Tehran, Tehran, Iran. His research
interests include novel techniques for high-speed digital design, low-power logic
design, and system integration of networking devices.
M.E. Salehi et al. / Journal of Systems Architecture 58 (2012) 112–125
Sied Mehdi Fakhraie was born in Dezfoul, Iran, in 1960.
He received the M.Sc. degree in electronics from the
University of Tehran, Tehran, Iran, in 1989, and the Ph.D.
degree in electrical and computer engineering from the
University of Toronto, Toronto, ON, Canada in 1995.
Since 1995, he has been with the School of Electrical and
Computer Engineering, University of Tehran, where he
is now an Associate Professor. He is also the Director of
Silicon Intelligence and the VLSI Signal Processing Laboratory. From September 2000 to April 2003, he was
with Valence Semiconductor Inc. and has worked in
Dubai, UAE, and Markham, Canada offices of Valence as
Director of application-specific integrated circuit and system-on-chip (ASIC/SoC
Design) and also technical lead of Integrated Broadband Gateway and Family Radio
System baseband processors. During the summers of 1998, 1999, and 2000, he was
a Visiting Professor at the University of Toronto, where he continued his work on
efficient implementation of artificial neural networks. He is coauthor of the book
VLSI-Compatible Implementation of Artificial Neural Networks (Boston, MA: Kluwer,
1997). He has also published more than 200 reviewed conference and journal
papers. He has worked on many industrial IC design projects including design of
network processors and home gateway access devices, digital subscriber line (DSL)
modems, pagers, and one- and two-way wireless messaging systems, and digital
125
signal processors for personal and mobile communication devices. His research
interests include system design and ASIC implementation of integrated systems,
novel techniques for high-speed digital circuit design, and system-integration and
efficient VLSI implementation of intelligent systems.
Amir Yazdanbakhsh was born in Shiraz, Iran, in 1984.
He received the B.Sc. degree in computer engineering
from Shiraz University, Shiraz, Iran, and the M.Sc.
degree in computer architecture from University of
Tehran, Tehran, Iran in 2007 and 2010, respectively. His
research interests include novel high-performance and
low-power architecture models for microprocessor and
embedded systems and customization for specific
application domains.