LImbiC: An Adaptable Architecture Description Language Model for

advertisement
2013 IEEE Computer Society Annual Symposium on VLSI
LImbiC: An Adaptable Architecture Description
Language Model for Developing an
Application-Specific Image Processor
Carsten Tradowsky∗ , Tanja Harbaum∗ , Shaver Deyerle† and Jürgen Becker∗
∗
†
Institute for Information Processing Technology (ITIV), Karlsruhe Institute of Technology (KIT)
Email: {tradowsky, harbaum, becker}@kit.edu
Bradley Department of Electrical & Computer Engineering, Virginia Polytechnic Institute and State University
Email: sdeyerle@vt.edu
Abstract—Due to their ease of integration and widespread
adoption, General Purpose Processors (GPP) are presently used
in a wide range of applications. However, the highly flexible
nature of a GPP leads to overhead in terms of power, performance
and area for a specific application. Another approach, proposed
by this paper, is to use Application-Specific Instruction-set Processors (ASIP) that are specifically adapted to a given application.
To decrease development time and effort and consequently
time-to-market, a model-based development process is used. The
high-level model allows for automated generation of software
development tools, simulation models and RTL models from a
single source. An adaptable LISA model representing a simplified
ARM Cortex-M1 processor is used as a base, which is then
supplemented by application-specific features requested by the
software developer or system architect.
This paper presents a working example of this concept, in
which a state-of-the-art processor model, we call LImbiC, is
extended to meet the requirements of a specific application.
Specifically, custom instructions are added to the LImbiC processor to improve its performance in the particular task of
image processing. In addition, during the development process
infrequently used or obsoleted instructions can be removed,
which allows for separate versions of LImbiC to meet varying
design goals within the design space exploration.
I. I NTRODUCTION
Historically, development of application-specific processors has
been cost-prohibitive for all but the largest semiconductor companies
[1]. The Architecture Description Language (ADL) Language for
Instruction-Set Architectures (LISA), used in combination with Synopsys Processor Designer, is advertised to change this situation. The
border between hardware and software is softened due to automated
generation of development tools, simulators and hardware description
language (HDL) out of a single LISA source model.
We published this approach to application-specific processor development in [2]. However, very limited documentation exists on
the application of high-level model-based development approaches.
This raises several questions that as of yet remain unanswered: Is
it possible to generate a custom processor out of a single high-level
model? Can automatically generated tools like simulators, compiler
and hardware be used for a complete processor development cycle
or even for commercial production of a processor? If this is the case,
could a processor be expanded with additional functionality that is
tailored to meet the exact requirements of a given application?
Complex software tasks could be migrated to a basic hardware
processor architecture to enable more efficient computation in a
specific application such as image processing. This would enable
new possibilities that until now have not been realized in commercial processors due to the costly development cycle and resulting
prohibitively long time-to-market.
978-1-4799-1331-2/13/$31.00 ©2013 IEEE
This paper provides an approach and gives initial answers to the
questions regarding the applicability of an ADL-based development
flow proposed above. To begin, an overview of the state-of-the-art is
given in section II. In section III, a concept for application-specific
extensions of the used instruction-set is developed. The realization
of the suggested concept using LISA is presented in section IV.
Subsequently in section V, the LISA model is evaluated regarding
the introduced extensions and the modeling overhead in relation to
power, performance and area trade-offs. The resulting model is then
prototyped on an FPGA platform exploring the design space in each
stage of the flow. Finally in section VI, the approach of this paper
is discussed and a conclusion with ideas for future work is given in
section VII.
II. R ELATED WORK
In this section, a summary of the state of the art in the field
of application-specific processors and reconfigurable architectures
is presented. Afterwards, an overview of Architecture Description
Languages (ADL) and a brief introduction to the Language for
Instruction-set Architectures (LISA) is given.
A. Application-Specific Instruction-set Processors (ASIP) and Reconfigurable Architectures
The development of Application-Specific Instruction-set Processors (ASIP) has historically been reserved for processor developers
in the semiconductor industry [1]. To be able to develop an ASIP, one
needs the have in-depth knowledge in several domains, e.g., algorithm
analysis, system modeling, simulation, synthesis, and verification.
Thus, qualified specialists are required to use the multitude of needed
models, tools and tasks leading to a development cycle that is costly
and time-consuming [1].
Recent developments in the field of processor architectures are
in the direction of optimized application-specific processors. By
utilizing this approach, extended by reconfigurable instruction-sets
and hardware extensions, optimizations for measures such as performance per watt and performance per mm2 can be achieved [3]. In
[4], the authors present the Proteus processor. This reconfigurable
approach extends the processor through the use of a tightly coupled
fabric. Lysecky et al. present the WARP processor [5], which detects
run-time-critical tasks and deploys them to hardware. The Rotating
Instruction-Set Processing Platform (RISPP), which is introduced
by Bauer et al. [6], adds new special instructions together with
a runtime system that supports them. In [7], Thoma et al present
MORPHEUS, which enables different sized flexible platforms for
heterogeneous hardware/software co-design. König et al present a
new architectural approach [8], containing coarse-grained and finegrained runtime reconfigurable processor arrays, which can be used
to accelerate complex algorithms.
B. Architecture Description Languages
ADLs should be able to represent the software and hardware view
of a processor in a single source model. There are several ADLs that
34
only support a software view, e.g., Not a Machine Language (nML)
[9], Instruction-set Description Language (ISDL) [10], and Machine
Independent Microprogramming Language (MIMOLA) [11]. Their
syntax represents only the programmers’ view of the architecture.
Therefore, these languages do not support a cycle-accurate model
and mainly describe the instruction coding and assembly syntax.
On the other hand, the LISP-like ADL EXPRESSION includes
such a cycle-accurate description [12]. This mixed instruction-set
and architecture specific language is particularly useful throughout
the entire ASIP design process. The language is able to support
design space exploration, tool generation, all the way to architecture
implementation.
LISA, which is used for the work presented in this paper, was conceptually developed at RWTH Aachen and combines the perspectives
of both structure and behavior. In contrast to behavioral models, like
open virtual platform (OVP), cycle-accurate models in LISA provide
a more detailed representation of the architecture. The tool chain, the
simulator as well as the hardware description can be generated from
this source model [13].
There a few promising projects where an ASIP was developed
using LISA. At the National Institute of Technology, a small processor with 19 instructions and a three-stage pipeline was developed
[14]. The processor was extended by an FIR-filter and constraint to
eight instructions. In the evaluation of the examples, the ASIP was
only compared to the basic processor and no comment was made
on the quality of the code using the LISA development approach.
Synopsys has developed a processor for use in video compression
applications [15]. This project is focused on cost-efficiency during the
development using Processor Generator, instead of the optimization
of an existing processor. In this paper an extension of the instructionset is utilized to adapt the processor’s microarchitecture as described
in the following sections.
III. D ESIGN OF LI MBIC
This subsection describes the design of the LISA model implementing the base functionality of LImbiC. LImbiC is based on the
ARM Cortex-M1 and its ARMv6-M architecture [16]. LImbiC is
extended by two special instructions, whose syntax and coding is defined as an extension of the existing instruction-set architecture (ISA).
This design approach enables the comparison of three processors in
section V:
• Basic-LImbiC - without extensions
• LImbiC - with extensions
• Small-LImbiC - reduced variant for edge detection with
Convolution and Canny filter
A. LImbiC
LImbiC is designed using a 32-bit Harvard architecture. One
read-only memory is used to store the program and one read-write
memory is dedicated to data. The data memory is optimized for image
processing applications by supporting two-byte accesses to address
individual pixels.
The instruction-set of the ARM Cortex-M1 is used as a guideline in
the design of LImbiC’s ISA. In contrast to Cortex-M1, LImbiC does
not need to support interrupts and exceptions for the particular target
application. LImbiC therefore comprises of the following classes of
instructions according to [16]:
• Data transport and branch instructions
• Arithmetic, logical and bit-manipulation instructions
Since the Cortex-M1 and therefore LImbiC use the ARM Thumb
instruction-set, LImbiC supports 16-bit as well as 32-bit instructions.
All instructions are supported with the pre-UAL (Thumb Assembler
Language) Thumb Syntax to be able to use existing ARM tool-chains
for the software development. The goal is to be 100 % machine code
compatible to the Cortex-M1.
LImbiC has 13 32-bit general-purpose registers, as well as a 32-bit
stack pointer, link register and program counter. The program status
register was scaled down to an application-specific status register just
representing the ALU flags and the processor mode. No additional
registers are added to maintain machine-code compatibility to the
Thumb ISA.
The LImbiC processor leverages a 3-stage pipeline comprised of
fetch, decode and execute stages. There are two pipeline registers,
init
init
load pixels
load three pixel
M
>=
T1
part of
convolution
no
no
n=3
end
image
no
yes
|S x |
>
2|S |
yes
=90
y
no
store new pixel
2|S x|
>
|S y |
pixel++
yes
Sx S y
>=
0
yes
yes
=135
no
no
new
row
yes
image
yes
yes
no
no end
pixel++
=45
pixel+=2
(a) Convolution
=0
(b) Canny filter gradient ϕ
init
load pixel i
M
>=
T1
yes
no
i++
end yes
image
no
=0
j:=i+width; k:=i-width
=45
j:=i+width-1; k:=i-width+1
=135
j:=i+width+1; k:=i-width-1
M[j] <= M
&&
M[k] < M
no
yes
=90
j:=i+1; k:=i-1
M
>=
T2
yes
no
pixel:=254
pixel:=1
(c) Canny filter non-maximum suppression
Fig. 1: UML models of the algorithms
which pass the program counter and machine code between the stages
of the pipeline. In the fetch stage, 32-bit words are fetched from
the instruction memory. If the fetched word consists of two 16-bit
instructions, the individual 16-bit instructions are forwarded one after
the other to the decode stage and are then executed during the execute
stage.
B. Algorithms
This subsection describes the composition of the extension algorithms, which are used for image processing in this paper. Firstly,
an algorithm for image filtering using Convolution is presented.
Secondly, the individual steps of a Canny filter algorithm are
developed.
1) Convolution: Figure 1a illustrates the convolution algorithm of
a 3x3 matrix. Firstly, the variables that are needed to execute the
convolution are initialized. Second, the memory address and the size
of the image have to be determined. Out of that, the loop variables
are identified. The convolution for each point is realized in three
parts. Three pixels are loaded from memory, computed and stored
into registers. This is done because most of the instructions can only
access the low registers r0 to r7 leaving only three registers available
for use in calculations. The number of cycles of the algorithm’s
execution is thus minimized taking the limitations of the hardware
into account. Each pixel is then tested to determine if the end of a
line has been reached. If so, two pixels are skipped, otherwise the
neighboring pixels are calculated. The output image size is decreased
by one pixel in each direction. This algorithm can be applied to a
Gaussian as well as to a Sobel filter in both directions [17]; only
the algorithm parameters need to be varied.
2) Canny Filter: The algorithms developed for the Canny filter
are based on the Integrated Vision Toolkit [18]. The output images
of the convolutions with the Sobel matrices Sx and Sy need to
be available in order to be able to calculate the absolute value of
the gray value gradient M and the direction ϕ of the gradient. An
approximation of the gray scale value is computed using Equation 1
to avoid compute-intensive root- and square-values:
M ≈ |Sx | + |Sy |
(1)
35
=0
=45
=90
opcode
00
01
10
11
=135
(a) Convolution computing one pixel
(b) Convolution computing five pixels
(c) Canny filter computing five pixels
Fig. 3: Neighbor pixels necessary to compute index pixels
direction ϕ is used as well in Equation 2:
if |Sx | > 2|Sy |
if 2|Sx | > |Sy | and Sx Sy ≥ 0
if 2|Sx | > |Sy | and Sx Sy < 0
else
convolution
reserved
Horizontal Sobel operator
Vertical Sobel operator
Gaussian filter
TABLE I: Defined variants of the Convolution
Fig. 2: Gradient ϕ and necessary neighbor pixels
An approximation for the
 ◦
90 ,



135◦ ,
ϕ≈
45◦ ,


 ◦
0 ,
type
SX
SY
GA
(2)
The multiplication by two can be realized by a left-shift within one
cycle. Every approximation exploits that arctan( 12 ) = 25.57 ◦ ≈
22.5 ◦ is always valid. However, only neighboring pixels need to
be taken into account such that an inaccuracy of 3 % is acceptable
without distorting the image. Figure 2 visualizes the four possible
gradient values in connection with the neighbor pixels as stated in
Equation 2. The current pixel (black), whose gradient values are
calculated, is in the center. The grey pixels show possible edges and
thus the pixels that are tested with the non-maximum suppression.
Figure 1b shows the algorithm of the gradient computation explained above. The pixels of the Sx and Sy convolution are loaded
at the current position of the image. M is tested to determine if it
is larger than the threshold T1 . The pixel is a detected edge only
if this is the case. The next pixel is tested if this is not the case.
However, if the tested pixel could be a detected edge, the gradient ϕ
is computed. The algorithm aborts execution after testing all pixels
and the resulting data is filtered using the non-maximum suppression.
This algorithm is shown in Figure 1c. The current pixel is loaded
in the first step. If M is greater than threshold T1 , the variables j
and k are set according to ϕ. j and k represent the direction of the
gradient according to Equation 2. The width of the image is needed
to determine the exact position. The current pixel is a local maximum
if the gray value gradient of theses pixels is smaller than the grey
value gradient of the current pixel. The current pixel is only marked
as an edge if the grey value gradient is greater than threshold T2 .
This computation is done for each pixel. Finally, the whole image is
run through to set the marked pixels white and all the others to black.
The result is a monochrome image, in which the detected lines are
marked white.
C. Extension of LImbiC
This subsection presents the extension of LImbiC with the algorithms introduced above. The main focus is on the reduction of the
execution time and thus the cycle count needed for the computations
that are directly dependent on the number of memory accesses. The
evaluation of the different trade-offs including performance and area
is presented in section V.
1) Convolution: All pixels need to be loaded from memory nine
times during a convolution of an image with a convolution matrix.
Figure 3a illustrates how the grey convolution matrix moves through
the 25-pixel image. The black pixel marks the pixel that needs to
be loaded from memory nine times. Even if the algorithm were to
store the pixel into the stack, the number of cycles needed for the
calculation would not be significantly reduced. There are not enough
low registers to hold all nine new pixels from the stack that are
required for the computation. Thus, the only possibility is to put the
three pixels of one line on the stack. Two pixels are required for the
next step of the convolution as well. Because of this, push and pop
instructions are used instead of load instructions. This approach is
especially beneficial for large matrices.
By determining several pixels within one clock cycle, it is possible
reduce the number of memory accesses significantly. Figure 3b shows
the approach for the concurrent calculation of five pixels. Only 21
instead of 45 memory accesses are necessary in this case. In general
3(n + 2) instead of 9n memory accesses are needed. Because of
memory latency, memory hierarchy and differing cache strategies, it
is assumed that the cycle count is reduced by an even greater number.
However, an overhead of 6 pixel accesses per additional cluster is
necessary if neighboring pixel clusters are calculated.
The coding of the new instruction is machine code compatible
to ARMv6-M. An undefined coding is taken as the basis for the
new instruction coding. The lower four bits are zero according to
the architecture manual [16]. The lowest two bits are taken to define
the convolution variants. Table I shows the defined variants, their
coding and syntax. The 00 coding is reserved. There are six more
bits apart from the two type bits to define the start address of the
input image and of the output image in data memory. This results in
the instruction syntax and coding shown in Figure 4a.
Another application-specific adaptation is the automatic increment
of register values like in the LDM- or STM-instructions. In this
application the registers are incremented by n calculated pixels after
each computation step. No additional add instructions are necessary
to run the convolution due to the auto-increment implementation.
2) Canny Filter: As with the convolution example, the hardware
implementation should reduce the execution time by minimizing
memory accesses and computation cycles. Again, the instruction is
implemented such that several pixels are processed at the same time.
The Canny filter requires more than just the directly neighboring
pixels as shown in Figure 3c. To compute the pixels, information
about the direction ϕ of the gradients needs to be computed. For this
computation, the grey value gradients M of neighboring dark grey
pixels are determined. In this case, however, the neighboring light
grey pixels are necessary. They need to be additionally loaded such
that 5(n + 4) pixels are needed to compute n pixels of the image.
If every pixel would be computed individually, 25 pixels need to be
loaded from memory, which results in 25n pixels. Consequently as
shown in Figure 3c, following the argumentation above the number
of memory accesses is reduced from 125 to 45 for the simultaneous
computation of five pixels. As in the case of the convolution, there
are 15 pixel accesses overhead.
An instruction coding is chosen out of the unused codings of
the ARMv6-M architecture. In addition to the two registers, which
hold the memory addresses of the input and output image, two
thresholds T1 and T2 need to be represented in the instruction. Even
if registers would be used to hold the threshold values, twelve bits
are necessary for the four register addresses. There is no coding left
with twelve available bits in the ARMv6-M. Therefore, the Canny
filter instruction is coded as a 32-bit instruction.
In the ARMv6-M ISA, 32-bit instructions always start with a
111-prefix [16]. The following bits are not allowed to be 00 to
avoid wrongly interpreting a branch instruction. Figure 4b shows
the adaptation of the instruction coding for the Canny filter with
coding 01. Each threshold coding needs to hold values from 0 to
255 necessitating an 8-bit immediate. The two address registers are
high registers in contrast to the convolution instruction.
36
15 14 13 12 11 10 9
8
7
6
5
1 0 1 1 1 1 1 1 Rm
4
3
2
Rd
1
0
type
C<type> <Rd>,<Rm>
(a) Convolution
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
1110 1 00 0
Rm
Rd
imm8-H
8
7
6
5
4
3
2
1
0
imm8-L
CAN <Rd>,<Rm>,#<imm8-L>,#<imm8-H>
(a) Lego
(b) CSX
(c) CSY
(d) CAN 10-20
(e) CAN 20-30
(f) CAN 20-50
(b) Canny filter
Fig. 4: Instruction syntax and coding
IV. R EALIZATION OF LI MBI C
LISA is used to model LImbiC and this model is extended to
improve the performance of image processing computations using
Convolution and Canny-filter operations as presented in Section
III-B. The evaluation of this extension is presented in section V.
A. Convolution
Initially, the Gaussian and the Sobel filter are realized in hardware to support the implementation of a Canny filter. The realized
operation awaits a 256x256 pixel input image. Out of information
and the content of the register holding the start address, all necessary
pixels for the convolution are indexed and loaded from memory. For
the Sobel filter, a 254x254 pixel input image is expected. According
to the line size the necessary pixels of the input image are loaded
from memory. After loading all necessary pixels from memory,
the convolution is started with a particular convolution matrix. The
absolute value is computed and divided by four for the Sobel filter.
This is done to have an 8-bit grey scale image as a result.
Afterwards, the four computed pixels are written into memory.
The starting address in data memory is determined by the second
register value. Finally, the registers Rd and Rn are incremented by
four to compute the next four pixels of the output image. All image
data processing is done on a one-dimensional array in LISA code
because Processor Generator does not support two-dimensional arrays
for HDL-code generation.
B. Canny-Filter
Two high registers and two 8-bit immediates are declared. The
memory addresses of the input and output image are loaded. The
Canny filter reduces the image size by two pixels in each dimension.
Consequently, two more pixels in each direction need to be loaded
but not stored again. There are arrays for the input image, the results
of the Sobel filter in x and y direction, the absolute value of the grey
value gradient M , the direction ϕ of the gradient and the calculated
output pixels. An image is loaded and the Sobel filter is applied. The
absolute result is stored as well as the sum of the absolute values and
the gray value gradient. With this information, the direction of the
gradient is computed and stored. Non-maximum suppression is then
used to reduce the detected edges to a line width of one. Finally,
the calculated pixels are stored to memory and the registers are
incremented by five to calculate the next pixels.
V. E VALUATION OF LI MBI C
This section presents the evaluation of LImbiC. First, the functionality is verified by executing the implemented algorithms. Afterwards,
HDL-code is generated for the different versions of LImbiC using
Processor Generator, as introduced in section III. The three different LImbiC variants are synthesized for different target platforms
including Altera and Xilinx FPGAs. An evaluation of the applicationspecific extensions, based on the synthesized processors, is accomplished. Additionally, the results of the synthesis are compared to give
a conclusion on the code quality and usability of this development
approach. Finally, the processor model is modified enabling it to run
on a FPGA prototyping platform.
A. Execution of the edge detection
The instruction-set of LImbiC is tested by using armcc to compile
several C and assembly programs for the ARMv6-M architecture.
These are executed in the automatically compiled cycle-accurate
simulator called Processor Debugger. Thus the syntax, coding and
cycle time of instructions, functions and programs is verified.
Figure 5 shows the output image of the Sobel operators in x and
y direction of the Lego input image (Fig. 5a). To get this result, the
instructions ”CSX” (Fig. 5b) and ”CSY ” (Fig. 5c) are executed. The
Fig. 5: Lego input image and output images after CSX, CSY and
CAN instruction with different thresholds
detected edges of the input image are represented by bright pixels.
The pixel’s brightness is based on the probability that it is part of
an edge. The contours of the Lego are well detected in both images.
However, the edges are blurry due to the detection algorithm.
The Canny filter extracts the edges and provides a clear representation of the Lego. Figures 5d-5f show the output images that
are computed using the ”CAN ” instruction for the Canny filter.
The parameters for the threshold values are varied to determine the
influence of the threshold value on the quality of the produced output.
The input image is the same Lego as used for the convolution. On
the left side, the threshold T1 is set to 10 and T2 is set to 20 (Fig.
5d). The center image is computed with T1 =20 and T2 =30 (Fig. 5e).
The image on the right side is computed with T1 =20 and T2 =50 (Fig.
5f). As expected, the increase of threshold T2 reduces the amount of
detected edges. The output images have more details with the choice
of a lower T2 .
The additional instructions work well in conjunction with the
ARMv6-M architecture and are able to successfully detect edges in
images.
B. Generation of LImbiC
The three different variants of LImbiC are synthesized using
Synplify Pro and the respective vendor tool for place and route for
different FPGA target platforms. The synthesizable Verilog HDL is
automatically generated out of the LISA model description. In doing
so, a design space exploration is carried out to evaluate the quality
of the different versions of the HDL code. The goal is to compare
the versions of LImbiC with performance and area trade-offs.
1) Basic-LImbiC and generation settings: Basic-LImbiC is synthesized for five different FPGA platforms Xilinx Virtex5 (XUPV5),
Viretex6 (ML605), Virtex7 (VC707) and Altera StratixV (SVGX),
CycloneV (CVGX). Here the different settings of the Processor
Generator script are explored. The standard setting is best code
readability.Other options are area-optimized, timing-optimized and
power-optimized, which should generate optimized code for each
constraint.
Figure 6 shows the results of the implementations. The maximum
frequency (Fig. 6a) and the absolute number of used logic elements
(Fig. 6b) are plotted for the four different optimization settings
and the five FPGA platforms. The maximum frequency does not
vary much between the different optimization options. However, the
Xilinx Virtex6 is the only FPGA that shows significant variation in
maximum frequency. Virtex6 is the last generation of Xilinx FPGAs
that use the ISE place and route engine. It is assumed that the variance
in frequency is due to the usage of old algorithms that struggle with
consistent performance on very large modern FPGAs. The results
have considerably less variance in maximum frequency for Virtex7
using the newly released Vivado tool chain.
The settings area- and time-optimized appear to always deliver
the same result. Only the power-optimized setting varies: achieving
slightly better performance per area on Xilinx platform and slightly
worse performance per area on Altera FPGAs. Note that the impact
of this setting on power consumption was not investigated in the
scope of this paper. The best code readability setting achieves the
best implementation results based on the performance per area tradeoff. Only the Altera StratixV consumes significantly more LUTs for
this implementation setting while the area of the other FPGAs drops
by 10 %. For this reason, the best code readability setting is chosen
for the presentation of the results in the next subsections.
37
50
100
40
30
80
10
60
0
Percent
Maximum frequency in MHz
20
−10
−20
40
−30
−40
20
−50
Area−optimized
Power−optimized
Time−optimized
Best code readability
−60
Relative maximum frequency
Relative utilized logic
−70
0
X
G
X
G
CV
SV
7
70
VC
05
5
X
PV
L6
M
XU
G
CV
X
G
SV
7
70
VC
5
0
L6
M
5
PV
XU
Fig. 7: LImbiC: frequency and logic relative to Basic-LImbiC
(a) Maximum frequency
4
3
x 10
200
175
2.5
150
125
100
75
50
1.5
Percent
Utilized logic in LUT
2
25
0
1
−25
−50
−75
0.5
Area−optimized
Power−optimized
Time−optimized
Best code readability
−100
−125
Relative maximum frequency
Relative utilized logic
0
X
G
CV
X
G
SV
7
70
3) Small-LImbiC - An image processor: The development of
LImbiC presented above, shows the potential of introducing new
instructions to the ISA. However, doing so makes other standard instructions obsolete in this particular application. Thus, the instructionset can be reduced to the minimum number of instructions necessary
to execute the image processing. This results in the development
of Small-LImbiC whose ISA consists of the new instructions CGA,
CSX, CSY , CAN and of branch and arithmetic instructions ADD,
LSL and CM P . As discussed in the previous subsection, the new
instructions come with an area overhead that is now reduced by
reducing the instruction set. Figure 8 presents the synthesis results
of Small-LImbiC. The maximum achievable frequency more than
doubles due to the reduction of complexity in the processor. The
implementation area is reduced significantly by 75 %, which is just
slightly more than a single implementation of each extension instruction alone. The performance per area trade-off, already addressed
above, is between 10 and 20 times higher compared to the BasicLImbiC. Additionally, the reduction of the overall instruction count
VC
2) LImbiC - Basic-LImbiC with extensions: LImbiC consists of
the Basic-LImbiC with extensions for convolution and a Canny
filter. Figure 7 shows the maximum frequency and the number of
LUTs in relation to the Basic-LImbiC. For most of the FPGAs the
new instructions result in decreased performance per area. This comes
with a cost of up to 40 % more LUTs. However, it is important to
note that this area overhead also includes new instructions. These
instructions significantly reduce the number of required instructions
to execute the algorithm. A 254x254 pixel image has 64, 516 pixels,
which need to be processed inside a loop of 35 instructions. This
results in 2.26 million instructions for a convolution in software
that needs to be run twice. In contrast, the hardware solution just
needs 5 instructions per loop computing four pixels at the same time
resulting in 81, 650 instructions. This results in a 98 % reduction
of instructions. This also reduces the total execution time, however,
performance gains are limited by the memory bottleneck and the use
of load/store multiple instructions. The trade-off is discussed in more
detail with the introduction of Small-LImbiC in the next subsection.
05
Fig. 6: Basic-LImbiC synthesized for area, timing, power optimization and best code readability
L6
5
PV
(b) Logic utilization
M
XU
X
G
CV
X
G
SV
7
70
VC
05
L6
M
5
PV
XU
−150
Fig. 8: Small-LImbiC: frequency and logic relative to Basic-LImbiC
is still appropriate leading to a significant shorter execution time
applying this enhanced performance per area trade-off.
C. Prototyping of LImbiC
The LISA language includes many features that are specifically
targeted towards use in models that will be synthesized. For models
that are only needed for simulation and software tool generation,
it is not necessary to implement the design using these additional
features. However, if the design is to be synthesized, Synopsys
recommends several guidelines and features that can improve the
design significantly. In particular, resource sharing in the LISA model
must be explicitly stated for optimal HDL output. If this is not done,
ALU functionality that could be shared will likely end up being
duplicated within many operations. In addition, in the case of register
bank accesses, this can lead to a significant number of muxes being
generated to accommodate accesses from many functional blocks
[19]. This can also lead the design requiring multi-ported memory
when the tool is not capable of properly detecting that accesses to
memory from separate operations are mutually exclusive.
The LImbiC processor was developed primarily for use as a
simulation model. For this reason, a comparison was performed
between LImbiC and a modified synthesis-optimized version of the
LImbiC processor. Both of these models are based on the BasicLImbiC design. This was done because, on the original model, the
tool was not capable of properly generating an HDL design that was
capable of handling multiple accesses to an external memory, as was
required by the added custom instructions.
The synthesis-optimized version of the LImbiC processor was
significantly modified to follow Synopsys’s recommendations for a
synthesizable design. All register accesses were moved to a single
operation to enable resource sharing among all instructions as well as
create a centralized method for register forwarding. ALU operations
were combined into shared operations where possible and operation
coding formats were logically grouped to simplify the required
decoding logic. In addition, a secondary memory access pipeline was
added to the model to ensure only a single port memory would be
required and better tolerate variable latency out of the memory.
Generated HDL code from both the original Basic-LImbiC model
as well as the optimized version was synthesized targeting a Virtex5 LX110T. Both LImbiC processors implement the same ISA and
38
Basic-LImbiC
Maximum frequency
in MHz
Utilized Registers
Utilized logic in LUT
87.5
Basic-LImbiC
(Synthesis-optimized)
87.3
ARM
(Cortex-M1)
200
612
6587
968
3211
not published
2900
TABLE II: Comparison of Basic-LImbiC with a synthesis-optimized
version and the ARM Cortex-M1
are synthesized without a debug unit. For comparison, the ARM
Cortex-M1 numbers are included from the data in ARMs marketing
material [20]. Table II shows the results from these runs. Note that
for better comparability these numbers only account for the processor
core itself, and do not include external memories or bus interfaces.
As can be seen of these results, optimizing the LISA model
for synthesis was able to reduce its required LUT count by 50 %,
although the register count was increased. This occurred due to the
increased use of pipeline registers within the processor. The changes
were also found to only have a negligible impact on the maximum
operating frequency of the final synthesized processor.
In terms of comparison to the ARM Cortex-M1, both models were
found to have larger area and slower maximum speed, as could be
expected. The synthesis-optimized LImbiC design was found to be
just 10 % larger than the ARM design, although the maximum clock
frequency is less than half of the ARM supplied core.
VI. L IMITATIONS OF THE LISA- BASED D EVELOPMENT
There are some notable limitations when developing a custom
processor with LISA. The LISA language itself has some limitations
that decrease productivity. For instance, because LISA operations
have no concept of input parameters, global variables must be used
to pass data to an operation. This increases code size and complicates debugging. In addition, LISA does not support typical C-style
functions, so these must be implemented using C-style preprocessor
macros instead. This has a detrimental effect code-readability and
also can create bugs that are difficult to locate.
Another problem is that coding styles between Processor Generator
and Processor Debugger differ, limiting model portability between
use cases. As was shown in subsection V-B, a significant amount of
code had to be re-written to create an improved HDL implementation,
even though the design was already successfully simulating with the
generated tool chain. This also occurs in some unique cases such as
in the coding section of an operation. In contrast to the Processor
Designer, Processor Generator does not support arithmetic functions
inside the coding section although this is explicitly defined in the
LISA reference [21].
Finally, the results have shown that the timing- and area-optimized
setting of the processor generator result in the same LUT count
and maximum achievable frequency, which limits the design space
exploration that can be performed without significant modification to
the source code.
VII. C ONCLUSION AND F UTURE W ORK
A LISA model in Processor Designer can be used to produce three
different types of outputs: software development tools, simulation
models, and synthesizable HDL models. In many cases, only the
first two outputs may be necessary, for instance, when the tool
is used to enable a pre-silicon software development environment.
However, in many cases, it may be desirable to generate a functioning
design directly from the high level model. By doing so, the generated
software tools are guaranteed to work with the generated design, and
total design time is drastically reduced.
By using the Processor Designer, a processor can be modeled
comparably fast with minimal knowledge of HDL-based microarchitecture design as is typically done today. The ADL LISA is
well suited to develop an application-specific processor with a small
number of instructions. The development of a complete processor
is constrained due to the overhead of the automatically generated
HDL code. This approach is very beneficial for evaluating the new
ideas in a comparably very short time, especially when adding
new instructions. The LISA model and the generated HDL code
are platform-independent and the HDL code is then synthesized,
including pace and route for the target platform, resulting in different
area vs. frequency trade-off solutions as presented in section V.
Additionally, this modeling approach is very well suited for development of an ISA and the necessary software development tool chain,
e.g., the assembler, linker and simulator to test the new design.
The model with its tools as well as the HDL model can be used
as a base for the development of an optimized HDL-model, where
additional knowledge in the domain of processor development in RTL
is necessary.
For future work, the execution time of the software on the different
version of LImbiC should be investigated in more detail. Additionally,
the impact on power consumption and the consumed energy for the
processing of an image will be studied. It seems especially interesting
to compare these numbers to an implementation on an Actel FPGA,
as the flash logic should consume significantly less energy compared
to the FPGAs chosen in this paper.
Another very interesting idea is to build a tool kit for using
different versions of LImbiC with LISA. The Integrated Vision
Toolkit (IVT) contains a couple of image processing algorithms. Each
algorithm could be modeled with one or more instructions in order
to build an even more comprehensive flexible processor model. The
user can then choose which algorithms should be included in each
version of the processor and whether it can be a stripped down version
especially for the particular application.
ACKNOWLEDGMENT
This work was supported by the German Research Foundation
(DFG) as part of the Transregional Collaborative Research Center
”Invasive Computing” (SFB/TR 89) and by the BMBF as part of the
joint project CONDOR.
R EFERENCES
[1] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wahlen, and H. Meyr, “A
methodology for the design of application specific instruction set processors
(asip) using the machine description language lisa,” in Proceedings of the
IEEE/ACM international conference on Computer-aided design, 2001.
[2] C. Tradowsky, F. Thoma, M. Hubner, and J. Becker, “Lisparc: Using an
architecture description language approach for modelling an adaptive processor
microarchitecture,” in 7th IEEE International Symposium on Industrial Embedded Systems (SIES), 2012.
[3] J. Henkel, “Closing the soc design gap,” Computer, vol. 36, pp. 119 – 121, 2003.
[4] M. Dales, “Managing a reconfigurable processor in a general purpose workstation
environment,” in Design, Automation and Test in Europe Conference and
Exhibition, 2003.
[5] R. Lysecky, G. Stitt, and F. Vahid, “Warp processors,” in ACM Transactions on
Design Automation of Electronic Systems (TODAES), 2004.
[6] L. Bauer, M. Shafique, S. Kramer, and J. Henkel, “Rispp: Rotating instruction
set processing platform,” in DAC ’07. 44th ACM/IEEE Design Automation
Conference, 2007.
[7] F. Thoma, M. Kuhnle, P. Bonnot, E. Panainte, K. Bertels, S. Goller, A. Schneider,
S. Guyetant, E. Schuler, K. Muller-Glaser, and J. Becker, “Morpheus: Heterogeneous reconfigurable computing,” in International Conference on Field
Programmable Logic and Applications, 2007.
[8] R. Koenig, L. Bauer, T. Stripf, M. Shafique, W. Ahmed, J. Becker, and J. Henkel,
“Kahrisma: A novel hypermorphic reconfigurable-instruction-set multi-grainedarray architecture,” in Design, Automation Test in Europe Conference Exhibition
(DATE), 2010.
[9] A. Fauth, J. Van Praet, and M. Freericks, “Describing instruction set processors
using nml,” in European Design and Test Conference, 1995.
[10] G. Hadjiyiannis, S. Hanono, and S. Devadas, “Isdl: An instruction set description
language for retargetability,” in Proceedings of the 34th Design Automation
Conference, 1997.
[11] S. Bashford, U. Bieker, B. Harking, R. Leupers, P. Marwedel, A. Neumann, and
D. Voggenauer, “The mimola language v 4.1,” Forschungsbericht, Universit at
Dortmund, FB Informatik, 1994.
[12] A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, and A. Nicolau, “Expression:
a language for architecture exploration through compiler/simulator retargetability,” in Design, Automation and Test in Europe Conference and Exhibition, 1999.
[13] A. Hoffmann, H. Meyr, and R. Leupers, Architecture Exploration for Embedded
Processors with Lisa. Kluwer Academic Publishers, 2002.
[14] V. Dodani, N. Kumar, U. Nanda, and K. Mahapatra, “Optimization of an application specific instruction set processor using application description language,” in
International Conference on Industrial and Information Systems (ICIIS), 2010.
[15] A. Nohl, F. Schirrmeister, and D. Taussig, “Application specific processor design
architectures, design methods and tools,” in Proceedings of the International
Conference on Computer-Aided Design, 2010.
[16] ARMv6-M Architecture Reference Manual, ARM Limited, 2010.
[17] R. Dillmann, “Vorlesung kognitive systeme.”
[18] P. Azad, T. Gockel, and R. Dillmann, Computer Vision: principles and practice.
Elektor Electronics Publishing, 2008.
[19] Synopsys, Inc., Processor Designer Product Family: Processor Design Guide,
2010.
[20] ARM. Cortex-m1 processor - performance. [Online]. Available: http://www.
arm.com/products/processors/cortex-m/cortex-m1.php
[21] Synopsys, Inc., Processor Designer Product Family: LISA Modeling Fundamentals, 2010.
39
Download