2013 IEEE Computer Society Annual Symposium on VLSI LImbiC: An Adaptable Architecture Description Language Model for Developing an Application-Specific Image Processor Carsten Tradowsky∗ , Tanja Harbaum∗ , Shaver Deyerle† and Jürgen Becker∗ ∗ † Institute for Information Processing Technology (ITIV), Karlsruhe Institute of Technology (KIT) Email: {tradowsky, harbaum, becker}@kit.edu Bradley Department of Electrical & Computer Engineering, Virginia Polytechnic Institute and State University Email: sdeyerle@vt.edu Abstract—Due to their ease of integration and widespread adoption, General Purpose Processors (GPP) are presently used in a wide range of applications. However, the highly flexible nature of a GPP leads to overhead in terms of power, performance and area for a specific application. Another approach, proposed by this paper, is to use Application-Specific Instruction-set Processors (ASIP) that are specifically adapted to a given application. To decrease development time and effort and consequently time-to-market, a model-based development process is used. The high-level model allows for automated generation of software development tools, simulation models and RTL models from a single source. An adaptable LISA model representing a simplified ARM Cortex-M1 processor is used as a base, which is then supplemented by application-specific features requested by the software developer or system architect. This paper presents a working example of this concept, in which a state-of-the-art processor model, we call LImbiC, is extended to meet the requirements of a specific application. Specifically, custom instructions are added to the LImbiC processor to improve its performance in the particular task of image processing. In addition, during the development process infrequently used or obsoleted instructions can be removed, which allows for separate versions of LImbiC to meet varying design goals within the design space exploration. I. I NTRODUCTION Historically, development of application-specific processors has been cost-prohibitive for all but the largest semiconductor companies [1]. The Architecture Description Language (ADL) Language for Instruction-Set Architectures (LISA), used in combination with Synopsys Processor Designer, is advertised to change this situation. The border between hardware and software is softened due to automated generation of development tools, simulators and hardware description language (HDL) out of a single LISA source model. We published this approach to application-specific processor development in [2]. However, very limited documentation exists on the application of high-level model-based development approaches. This raises several questions that as of yet remain unanswered: Is it possible to generate a custom processor out of a single high-level model? Can automatically generated tools like simulators, compiler and hardware be used for a complete processor development cycle or even for commercial production of a processor? If this is the case, could a processor be expanded with additional functionality that is tailored to meet the exact requirements of a given application? Complex software tasks could be migrated to a basic hardware processor architecture to enable more efficient computation in a specific application such as image processing. This would enable new possibilities that until now have not been realized in commercial processors due to the costly development cycle and resulting prohibitively long time-to-market. 978-1-4799-1331-2/13/$31.00 ©2013 IEEE This paper provides an approach and gives initial answers to the questions regarding the applicability of an ADL-based development flow proposed above. To begin, an overview of the state-of-the-art is given in section II. In section III, a concept for application-specific extensions of the used instruction-set is developed. The realization of the suggested concept using LISA is presented in section IV. Subsequently in section V, the LISA model is evaluated regarding the introduced extensions and the modeling overhead in relation to power, performance and area trade-offs. The resulting model is then prototyped on an FPGA platform exploring the design space in each stage of the flow. Finally in section VI, the approach of this paper is discussed and a conclusion with ideas for future work is given in section VII. II. R ELATED WORK In this section, a summary of the state of the art in the field of application-specific processors and reconfigurable architectures is presented. Afterwards, an overview of Architecture Description Languages (ADL) and a brief introduction to the Language for Instruction-set Architectures (LISA) is given. A. Application-Specific Instruction-set Processors (ASIP) and Reconfigurable Architectures The development of Application-Specific Instruction-set Processors (ASIP) has historically been reserved for processor developers in the semiconductor industry [1]. To be able to develop an ASIP, one needs the have in-depth knowledge in several domains, e.g., algorithm analysis, system modeling, simulation, synthesis, and verification. Thus, qualified specialists are required to use the multitude of needed models, tools and tasks leading to a development cycle that is costly and time-consuming [1]. Recent developments in the field of processor architectures are in the direction of optimized application-specific processors. By utilizing this approach, extended by reconfigurable instruction-sets and hardware extensions, optimizations for measures such as performance per watt and performance per mm2 can be achieved [3]. In [4], the authors present the Proteus processor. This reconfigurable approach extends the processor through the use of a tightly coupled fabric. Lysecky et al. present the WARP processor [5], which detects run-time-critical tasks and deploys them to hardware. The Rotating Instruction-Set Processing Platform (RISPP), which is introduced by Bauer et al. [6], adds new special instructions together with a runtime system that supports them. In [7], Thoma et al present MORPHEUS, which enables different sized flexible platforms for heterogeneous hardware/software co-design. König et al present a new architectural approach [8], containing coarse-grained and finegrained runtime reconfigurable processor arrays, which can be used to accelerate complex algorithms. B. Architecture Description Languages ADLs should be able to represent the software and hardware view of a processor in a single source model. There are several ADLs that 34 only support a software view, e.g., Not a Machine Language (nML) [9], Instruction-set Description Language (ISDL) [10], and Machine Independent Microprogramming Language (MIMOLA) [11]. Their syntax represents only the programmers’ view of the architecture. Therefore, these languages do not support a cycle-accurate model and mainly describe the instruction coding and assembly syntax. On the other hand, the LISP-like ADL EXPRESSION includes such a cycle-accurate description [12]. This mixed instruction-set and architecture specific language is particularly useful throughout the entire ASIP design process. The language is able to support design space exploration, tool generation, all the way to architecture implementation. LISA, which is used for the work presented in this paper, was conceptually developed at RWTH Aachen and combines the perspectives of both structure and behavior. In contrast to behavioral models, like open virtual platform (OVP), cycle-accurate models in LISA provide a more detailed representation of the architecture. The tool chain, the simulator as well as the hardware description can be generated from this source model [13]. There a few promising projects where an ASIP was developed using LISA. At the National Institute of Technology, a small processor with 19 instructions and a three-stage pipeline was developed [14]. The processor was extended by an FIR-filter and constraint to eight instructions. In the evaluation of the examples, the ASIP was only compared to the basic processor and no comment was made on the quality of the code using the LISA development approach. Synopsys has developed a processor for use in video compression applications [15]. This project is focused on cost-efficiency during the development using Processor Generator, instead of the optimization of an existing processor. In this paper an extension of the instructionset is utilized to adapt the processor’s microarchitecture as described in the following sections. III. D ESIGN OF LI MBIC This subsection describes the design of the LISA model implementing the base functionality of LImbiC. LImbiC is based on the ARM Cortex-M1 and its ARMv6-M architecture [16]. LImbiC is extended by two special instructions, whose syntax and coding is defined as an extension of the existing instruction-set architecture (ISA). This design approach enables the comparison of three processors in section V: • Basic-LImbiC - without extensions • LImbiC - with extensions • Small-LImbiC - reduced variant for edge detection with Convolution and Canny filter A. LImbiC LImbiC is designed using a 32-bit Harvard architecture. One read-only memory is used to store the program and one read-write memory is dedicated to data. The data memory is optimized for image processing applications by supporting two-byte accesses to address individual pixels. The instruction-set of the ARM Cortex-M1 is used as a guideline in the design of LImbiC’s ISA. In contrast to Cortex-M1, LImbiC does not need to support interrupts and exceptions for the particular target application. LImbiC therefore comprises of the following classes of instructions according to [16]: • Data transport and branch instructions • Arithmetic, logical and bit-manipulation instructions Since the Cortex-M1 and therefore LImbiC use the ARM Thumb instruction-set, LImbiC supports 16-bit as well as 32-bit instructions. All instructions are supported with the pre-UAL (Thumb Assembler Language) Thumb Syntax to be able to use existing ARM tool-chains for the software development. The goal is to be 100 % machine code compatible to the Cortex-M1. LImbiC has 13 32-bit general-purpose registers, as well as a 32-bit stack pointer, link register and program counter. The program status register was scaled down to an application-specific status register just representing the ALU flags and the processor mode. No additional registers are added to maintain machine-code compatibility to the Thumb ISA. The LImbiC processor leverages a 3-stage pipeline comprised of fetch, decode and execute stages. There are two pipeline registers, init init load pixels load three pixel M >= T1 part of convolution no no n=3 end image no yes |S x | > 2|S | yes =90 y no store new pixel 2|S x| > |S y | pixel++ yes Sx S y >= 0 yes yes =135 no no new row yes image yes yes no no end pixel++ =45 pixel+=2 (a) Convolution =0 (b) Canny filter gradient ϕ init load pixel i M >= T1 yes no i++ end yes image no =0 j:=i+width; k:=i-width =45 j:=i+width-1; k:=i-width+1 =135 j:=i+width+1; k:=i-width-1 M[j] <= M && M[k] < M no yes =90 j:=i+1; k:=i-1 M >= T2 yes no pixel:=254 pixel:=1 (c) Canny filter non-maximum suppression Fig. 1: UML models of the algorithms which pass the program counter and machine code between the stages of the pipeline. In the fetch stage, 32-bit words are fetched from the instruction memory. If the fetched word consists of two 16-bit instructions, the individual 16-bit instructions are forwarded one after the other to the decode stage and are then executed during the execute stage. B. Algorithms This subsection describes the composition of the extension algorithms, which are used for image processing in this paper. Firstly, an algorithm for image filtering using Convolution is presented. Secondly, the individual steps of a Canny filter algorithm are developed. 1) Convolution: Figure 1a illustrates the convolution algorithm of a 3x3 matrix. Firstly, the variables that are needed to execute the convolution are initialized. Second, the memory address and the size of the image have to be determined. Out of that, the loop variables are identified. The convolution for each point is realized in three parts. Three pixels are loaded from memory, computed and stored into registers. This is done because most of the instructions can only access the low registers r0 to r7 leaving only three registers available for use in calculations. The number of cycles of the algorithm’s execution is thus minimized taking the limitations of the hardware into account. Each pixel is then tested to determine if the end of a line has been reached. If so, two pixels are skipped, otherwise the neighboring pixels are calculated. The output image size is decreased by one pixel in each direction. This algorithm can be applied to a Gaussian as well as to a Sobel filter in both directions [17]; only the algorithm parameters need to be varied. 2) Canny Filter: The algorithms developed for the Canny filter are based on the Integrated Vision Toolkit [18]. The output images of the convolutions with the Sobel matrices Sx and Sy need to be available in order to be able to calculate the absolute value of the gray value gradient M and the direction ϕ of the gradient. An approximation of the gray scale value is computed using Equation 1 to avoid compute-intensive root- and square-values: M ≈ |Sx | + |Sy | (1) 35 =0 =45 =90 opcode 00 01 10 11 =135 (a) Convolution computing one pixel (b) Convolution computing five pixels (c) Canny filter computing five pixels Fig. 3: Neighbor pixels necessary to compute index pixels direction ϕ is used as well in Equation 2: if |Sx | > 2|Sy | if 2|Sx | > |Sy | and Sx Sy ≥ 0 if 2|Sx | > |Sy | and Sx Sy < 0 else convolution reserved Horizontal Sobel operator Vertical Sobel operator Gaussian filter TABLE I: Defined variants of the Convolution Fig. 2: Gradient ϕ and necessary neighbor pixels An approximation for the ◦ 90 , 135◦ , ϕ≈ 45◦ , ◦ 0 , type SX SY GA (2) The multiplication by two can be realized by a left-shift within one cycle. Every approximation exploits that arctan( 12 ) = 25.57 ◦ ≈ 22.5 ◦ is always valid. However, only neighboring pixels need to be taken into account such that an inaccuracy of 3 % is acceptable without distorting the image. Figure 2 visualizes the four possible gradient values in connection with the neighbor pixels as stated in Equation 2. The current pixel (black), whose gradient values are calculated, is in the center. The grey pixels show possible edges and thus the pixels that are tested with the non-maximum suppression. Figure 1b shows the algorithm of the gradient computation explained above. The pixels of the Sx and Sy convolution are loaded at the current position of the image. M is tested to determine if it is larger than the threshold T1 . The pixel is a detected edge only if this is the case. The next pixel is tested if this is not the case. However, if the tested pixel could be a detected edge, the gradient ϕ is computed. The algorithm aborts execution after testing all pixels and the resulting data is filtered using the non-maximum suppression. This algorithm is shown in Figure 1c. The current pixel is loaded in the first step. If M is greater than threshold T1 , the variables j and k are set according to ϕ. j and k represent the direction of the gradient according to Equation 2. The width of the image is needed to determine the exact position. The current pixel is a local maximum if the gray value gradient of theses pixels is smaller than the grey value gradient of the current pixel. The current pixel is only marked as an edge if the grey value gradient is greater than threshold T2 . This computation is done for each pixel. Finally, the whole image is run through to set the marked pixels white and all the others to black. The result is a monochrome image, in which the detected lines are marked white. C. Extension of LImbiC This subsection presents the extension of LImbiC with the algorithms introduced above. The main focus is on the reduction of the execution time and thus the cycle count needed for the computations that are directly dependent on the number of memory accesses. The evaluation of the different trade-offs including performance and area is presented in section V. 1) Convolution: All pixels need to be loaded from memory nine times during a convolution of an image with a convolution matrix. Figure 3a illustrates how the grey convolution matrix moves through the 25-pixel image. The black pixel marks the pixel that needs to be loaded from memory nine times. Even if the algorithm were to store the pixel into the stack, the number of cycles needed for the calculation would not be significantly reduced. There are not enough low registers to hold all nine new pixels from the stack that are required for the computation. Thus, the only possibility is to put the three pixels of one line on the stack. Two pixels are required for the next step of the convolution as well. Because of this, push and pop instructions are used instead of load instructions. This approach is especially beneficial for large matrices. By determining several pixels within one clock cycle, it is possible reduce the number of memory accesses significantly. Figure 3b shows the approach for the concurrent calculation of five pixels. Only 21 instead of 45 memory accesses are necessary in this case. In general 3(n + 2) instead of 9n memory accesses are needed. Because of memory latency, memory hierarchy and differing cache strategies, it is assumed that the cycle count is reduced by an even greater number. However, an overhead of 6 pixel accesses per additional cluster is necessary if neighboring pixel clusters are calculated. The coding of the new instruction is machine code compatible to ARMv6-M. An undefined coding is taken as the basis for the new instruction coding. The lower four bits are zero according to the architecture manual [16]. The lowest two bits are taken to define the convolution variants. Table I shows the defined variants, their coding and syntax. The 00 coding is reserved. There are six more bits apart from the two type bits to define the start address of the input image and of the output image in data memory. This results in the instruction syntax and coding shown in Figure 4a. Another application-specific adaptation is the automatic increment of register values like in the LDM- or STM-instructions. In this application the registers are incremented by n calculated pixels after each computation step. No additional add instructions are necessary to run the convolution due to the auto-increment implementation. 2) Canny Filter: As with the convolution example, the hardware implementation should reduce the execution time by minimizing memory accesses and computation cycles. Again, the instruction is implemented such that several pixels are processed at the same time. The Canny filter requires more than just the directly neighboring pixels as shown in Figure 3c. To compute the pixels, information about the direction ϕ of the gradients needs to be computed. For this computation, the grey value gradients M of neighboring dark grey pixels are determined. In this case, however, the neighboring light grey pixels are necessary. They need to be additionally loaded such that 5(n + 4) pixels are needed to compute n pixels of the image. If every pixel would be computed individually, 25 pixels need to be loaded from memory, which results in 25n pixels. Consequently as shown in Figure 3c, following the argumentation above the number of memory accesses is reduced from 125 to 45 for the simultaneous computation of five pixels. As in the case of the convolution, there are 15 pixel accesses overhead. An instruction coding is chosen out of the unused codings of the ARMv6-M architecture. In addition to the two registers, which hold the memory addresses of the input and output image, two thresholds T1 and T2 need to be represented in the instruction. Even if registers would be used to hold the threshold values, twelve bits are necessary for the four register addresses. There is no coding left with twelve available bits in the ARMv6-M. Therefore, the Canny filter instruction is coded as a 32-bit instruction. In the ARMv6-M ISA, 32-bit instructions always start with a 111-prefix [16]. The following bits are not allowed to be 00 to avoid wrongly interpreting a branch instruction. Figure 4b shows the adaptation of the instruction coding for the Canny filter with coding 01. Each threshold coding needs to hold values from 0 to 255 necessitating an 8-bit immediate. The two address registers are high registers in contrast to the convolution instruction. 36 15 14 13 12 11 10 9 8 7 6 5 1 0 1 1 1 1 1 1 Rm 4 3 2 Rd 1 0 type C<type> <Rd>,<Rm> (a) Convolution 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 1110 1 00 0 Rm Rd imm8-H 8 7 6 5 4 3 2 1 0 imm8-L CAN <Rd>,<Rm>,#<imm8-L>,#<imm8-H> (a) Lego (b) CSX (c) CSY (d) CAN 10-20 (e) CAN 20-30 (f) CAN 20-50 (b) Canny filter Fig. 4: Instruction syntax and coding IV. R EALIZATION OF LI MBI C LISA is used to model LImbiC and this model is extended to improve the performance of image processing computations using Convolution and Canny-filter operations as presented in Section III-B. The evaluation of this extension is presented in section V. A. Convolution Initially, the Gaussian and the Sobel filter are realized in hardware to support the implementation of a Canny filter. The realized operation awaits a 256x256 pixel input image. Out of information and the content of the register holding the start address, all necessary pixels for the convolution are indexed and loaded from memory. For the Sobel filter, a 254x254 pixel input image is expected. According to the line size the necessary pixels of the input image are loaded from memory. After loading all necessary pixels from memory, the convolution is started with a particular convolution matrix. The absolute value is computed and divided by four for the Sobel filter. This is done to have an 8-bit grey scale image as a result. Afterwards, the four computed pixels are written into memory. The starting address in data memory is determined by the second register value. Finally, the registers Rd and Rn are incremented by four to compute the next four pixels of the output image. All image data processing is done on a one-dimensional array in LISA code because Processor Generator does not support two-dimensional arrays for HDL-code generation. B. Canny-Filter Two high registers and two 8-bit immediates are declared. The memory addresses of the input and output image are loaded. The Canny filter reduces the image size by two pixels in each dimension. Consequently, two more pixels in each direction need to be loaded but not stored again. There are arrays for the input image, the results of the Sobel filter in x and y direction, the absolute value of the grey value gradient M , the direction ϕ of the gradient and the calculated output pixels. An image is loaded and the Sobel filter is applied. The absolute result is stored as well as the sum of the absolute values and the gray value gradient. With this information, the direction of the gradient is computed and stored. Non-maximum suppression is then used to reduce the detected edges to a line width of one. Finally, the calculated pixels are stored to memory and the registers are incremented by five to calculate the next pixels. V. E VALUATION OF LI MBI C This section presents the evaluation of LImbiC. First, the functionality is verified by executing the implemented algorithms. Afterwards, HDL-code is generated for the different versions of LImbiC using Processor Generator, as introduced in section III. The three different LImbiC variants are synthesized for different target platforms including Altera and Xilinx FPGAs. An evaluation of the applicationspecific extensions, based on the synthesized processors, is accomplished. Additionally, the results of the synthesis are compared to give a conclusion on the code quality and usability of this development approach. Finally, the processor model is modified enabling it to run on a FPGA prototyping platform. A. Execution of the edge detection The instruction-set of LImbiC is tested by using armcc to compile several C and assembly programs for the ARMv6-M architecture. These are executed in the automatically compiled cycle-accurate simulator called Processor Debugger. Thus the syntax, coding and cycle time of instructions, functions and programs is verified. Figure 5 shows the output image of the Sobel operators in x and y direction of the Lego input image (Fig. 5a). To get this result, the instructions ”CSX” (Fig. 5b) and ”CSY ” (Fig. 5c) are executed. The Fig. 5: Lego input image and output images after CSX, CSY and CAN instruction with different thresholds detected edges of the input image are represented by bright pixels. The pixel’s brightness is based on the probability that it is part of an edge. The contours of the Lego are well detected in both images. However, the edges are blurry due to the detection algorithm. The Canny filter extracts the edges and provides a clear representation of the Lego. Figures 5d-5f show the output images that are computed using the ”CAN ” instruction for the Canny filter. The parameters for the threshold values are varied to determine the influence of the threshold value on the quality of the produced output. The input image is the same Lego as used for the convolution. On the left side, the threshold T1 is set to 10 and T2 is set to 20 (Fig. 5d). The center image is computed with T1 =20 and T2 =30 (Fig. 5e). The image on the right side is computed with T1 =20 and T2 =50 (Fig. 5f). As expected, the increase of threshold T2 reduces the amount of detected edges. The output images have more details with the choice of a lower T2 . The additional instructions work well in conjunction with the ARMv6-M architecture and are able to successfully detect edges in images. B. Generation of LImbiC The three different variants of LImbiC are synthesized using Synplify Pro and the respective vendor tool for place and route for different FPGA target platforms. The synthesizable Verilog HDL is automatically generated out of the LISA model description. In doing so, a design space exploration is carried out to evaluate the quality of the different versions of the HDL code. The goal is to compare the versions of LImbiC with performance and area trade-offs. 1) Basic-LImbiC and generation settings: Basic-LImbiC is synthesized for five different FPGA platforms Xilinx Virtex5 (XUPV5), Viretex6 (ML605), Virtex7 (VC707) and Altera StratixV (SVGX), CycloneV (CVGX). Here the different settings of the Processor Generator script are explored. The standard setting is best code readability.Other options are area-optimized, timing-optimized and power-optimized, which should generate optimized code for each constraint. Figure 6 shows the results of the implementations. The maximum frequency (Fig. 6a) and the absolute number of used logic elements (Fig. 6b) are plotted for the four different optimization settings and the five FPGA platforms. The maximum frequency does not vary much between the different optimization options. However, the Xilinx Virtex6 is the only FPGA that shows significant variation in maximum frequency. Virtex6 is the last generation of Xilinx FPGAs that use the ISE place and route engine. It is assumed that the variance in frequency is due to the usage of old algorithms that struggle with consistent performance on very large modern FPGAs. The results have considerably less variance in maximum frequency for Virtex7 using the newly released Vivado tool chain. The settings area- and time-optimized appear to always deliver the same result. Only the power-optimized setting varies: achieving slightly better performance per area on Xilinx platform and slightly worse performance per area on Altera FPGAs. Note that the impact of this setting on power consumption was not investigated in the scope of this paper. The best code readability setting achieves the best implementation results based on the performance per area tradeoff. Only the Altera StratixV consumes significantly more LUTs for this implementation setting while the area of the other FPGAs drops by 10 %. For this reason, the best code readability setting is chosen for the presentation of the results in the next subsections. 37 50 100 40 30 80 10 60 0 Percent Maximum frequency in MHz 20 −10 −20 40 −30 −40 20 −50 Area−optimized Power−optimized Time−optimized Best code readability −60 Relative maximum frequency Relative utilized logic −70 0 X G X G CV SV 7 70 VC 05 5 X PV L6 M XU G CV X G SV 7 70 VC 5 0 L6 M 5 PV XU Fig. 7: LImbiC: frequency and logic relative to Basic-LImbiC (a) Maximum frequency 4 3 x 10 200 175 2.5 150 125 100 75 50 1.5 Percent Utilized logic in LUT 2 25 0 1 −25 −50 −75 0.5 Area−optimized Power−optimized Time−optimized Best code readability −100 −125 Relative maximum frequency Relative utilized logic 0 X G CV X G SV 7 70 3) Small-LImbiC - An image processor: The development of LImbiC presented above, shows the potential of introducing new instructions to the ISA. However, doing so makes other standard instructions obsolete in this particular application. Thus, the instructionset can be reduced to the minimum number of instructions necessary to execute the image processing. This results in the development of Small-LImbiC whose ISA consists of the new instructions CGA, CSX, CSY , CAN and of branch and arithmetic instructions ADD, LSL and CM P . As discussed in the previous subsection, the new instructions come with an area overhead that is now reduced by reducing the instruction set. Figure 8 presents the synthesis results of Small-LImbiC. The maximum achievable frequency more than doubles due to the reduction of complexity in the processor. The implementation area is reduced significantly by 75 %, which is just slightly more than a single implementation of each extension instruction alone. The performance per area trade-off, already addressed above, is between 10 and 20 times higher compared to the BasicLImbiC. Additionally, the reduction of the overall instruction count VC 2) LImbiC - Basic-LImbiC with extensions: LImbiC consists of the Basic-LImbiC with extensions for convolution and a Canny filter. Figure 7 shows the maximum frequency and the number of LUTs in relation to the Basic-LImbiC. For most of the FPGAs the new instructions result in decreased performance per area. This comes with a cost of up to 40 % more LUTs. However, it is important to note that this area overhead also includes new instructions. These instructions significantly reduce the number of required instructions to execute the algorithm. A 254x254 pixel image has 64, 516 pixels, which need to be processed inside a loop of 35 instructions. This results in 2.26 million instructions for a convolution in software that needs to be run twice. In contrast, the hardware solution just needs 5 instructions per loop computing four pixels at the same time resulting in 81, 650 instructions. This results in a 98 % reduction of instructions. This also reduces the total execution time, however, performance gains are limited by the memory bottleneck and the use of load/store multiple instructions. The trade-off is discussed in more detail with the introduction of Small-LImbiC in the next subsection. 05 Fig. 6: Basic-LImbiC synthesized for area, timing, power optimization and best code readability L6 5 PV (b) Logic utilization M XU X G CV X G SV 7 70 VC 05 L6 M 5 PV XU −150 Fig. 8: Small-LImbiC: frequency and logic relative to Basic-LImbiC is still appropriate leading to a significant shorter execution time applying this enhanced performance per area trade-off. C. Prototyping of LImbiC The LISA language includes many features that are specifically targeted towards use in models that will be synthesized. For models that are only needed for simulation and software tool generation, it is not necessary to implement the design using these additional features. However, if the design is to be synthesized, Synopsys recommends several guidelines and features that can improve the design significantly. In particular, resource sharing in the LISA model must be explicitly stated for optimal HDL output. If this is not done, ALU functionality that could be shared will likely end up being duplicated within many operations. In addition, in the case of register bank accesses, this can lead to a significant number of muxes being generated to accommodate accesses from many functional blocks [19]. This can also lead the design requiring multi-ported memory when the tool is not capable of properly detecting that accesses to memory from separate operations are mutually exclusive. The LImbiC processor was developed primarily for use as a simulation model. For this reason, a comparison was performed between LImbiC and a modified synthesis-optimized version of the LImbiC processor. Both of these models are based on the BasicLImbiC design. This was done because, on the original model, the tool was not capable of properly generating an HDL design that was capable of handling multiple accesses to an external memory, as was required by the added custom instructions. The synthesis-optimized version of the LImbiC processor was significantly modified to follow Synopsys’s recommendations for a synthesizable design. All register accesses were moved to a single operation to enable resource sharing among all instructions as well as create a centralized method for register forwarding. ALU operations were combined into shared operations where possible and operation coding formats were logically grouped to simplify the required decoding logic. In addition, a secondary memory access pipeline was added to the model to ensure only a single port memory would be required and better tolerate variable latency out of the memory. Generated HDL code from both the original Basic-LImbiC model as well as the optimized version was synthesized targeting a Virtex5 LX110T. Both LImbiC processors implement the same ISA and 38 Basic-LImbiC Maximum frequency in MHz Utilized Registers Utilized logic in LUT 87.5 Basic-LImbiC (Synthesis-optimized) 87.3 ARM (Cortex-M1) 200 612 6587 968 3211 not published 2900 TABLE II: Comparison of Basic-LImbiC with a synthesis-optimized version and the ARM Cortex-M1 are synthesized without a debug unit. For comparison, the ARM Cortex-M1 numbers are included from the data in ARMs marketing material [20]. Table II shows the results from these runs. Note that for better comparability these numbers only account for the processor core itself, and do not include external memories or bus interfaces. As can be seen of these results, optimizing the LISA model for synthesis was able to reduce its required LUT count by 50 %, although the register count was increased. This occurred due to the increased use of pipeline registers within the processor. The changes were also found to only have a negligible impact on the maximum operating frequency of the final synthesized processor. In terms of comparison to the ARM Cortex-M1, both models were found to have larger area and slower maximum speed, as could be expected. The synthesis-optimized LImbiC design was found to be just 10 % larger than the ARM design, although the maximum clock frequency is less than half of the ARM supplied core. VI. L IMITATIONS OF THE LISA- BASED D EVELOPMENT There are some notable limitations when developing a custom processor with LISA. The LISA language itself has some limitations that decrease productivity. For instance, because LISA operations have no concept of input parameters, global variables must be used to pass data to an operation. This increases code size and complicates debugging. In addition, LISA does not support typical C-style functions, so these must be implemented using C-style preprocessor macros instead. This has a detrimental effect code-readability and also can create bugs that are difficult to locate. Another problem is that coding styles between Processor Generator and Processor Debugger differ, limiting model portability between use cases. As was shown in subsection V-B, a significant amount of code had to be re-written to create an improved HDL implementation, even though the design was already successfully simulating with the generated tool chain. This also occurs in some unique cases such as in the coding section of an operation. In contrast to the Processor Designer, Processor Generator does not support arithmetic functions inside the coding section although this is explicitly defined in the LISA reference [21]. Finally, the results have shown that the timing- and area-optimized setting of the processor generator result in the same LUT count and maximum achievable frequency, which limits the design space exploration that can be performed without significant modification to the source code. VII. C ONCLUSION AND F UTURE W ORK A LISA model in Processor Designer can be used to produce three different types of outputs: software development tools, simulation models, and synthesizable HDL models. In many cases, only the first two outputs may be necessary, for instance, when the tool is used to enable a pre-silicon software development environment. However, in many cases, it may be desirable to generate a functioning design directly from the high level model. By doing so, the generated software tools are guaranteed to work with the generated design, and total design time is drastically reduced. By using the Processor Designer, a processor can be modeled comparably fast with minimal knowledge of HDL-based microarchitecture design as is typically done today. The ADL LISA is well suited to develop an application-specific processor with a small number of instructions. The development of a complete processor is constrained due to the overhead of the automatically generated HDL code. This approach is very beneficial for evaluating the new ideas in a comparably very short time, especially when adding new instructions. The LISA model and the generated HDL code are platform-independent and the HDL code is then synthesized, including pace and route for the target platform, resulting in different area vs. frequency trade-off solutions as presented in section V. Additionally, this modeling approach is very well suited for development of an ISA and the necessary software development tool chain, e.g., the assembler, linker and simulator to test the new design. The model with its tools as well as the HDL model can be used as a base for the development of an optimized HDL-model, where additional knowledge in the domain of processor development in RTL is necessary. For future work, the execution time of the software on the different version of LImbiC should be investigated in more detail. Additionally, the impact on power consumption and the consumed energy for the processing of an image will be studied. It seems especially interesting to compare these numbers to an implementation on an Actel FPGA, as the flash logic should consume significantly less energy compared to the FPGAs chosen in this paper. Another very interesting idea is to build a tool kit for using different versions of LImbiC with LISA. The Integrated Vision Toolkit (IVT) contains a couple of image processing algorithms. Each algorithm could be modeled with one or more instructions in order to build an even more comprehensive flexible processor model. The user can then choose which algorithms should be included in each version of the processor and whether it can be a stripped down version especially for the particular application. ACKNOWLEDGMENT This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Center ”Invasive Computing” (SFB/TR 89) and by the BMBF as part of the joint project CONDOR. R EFERENCES [1] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wahlen, and H. Meyr, “A methodology for the design of application specific instruction set processors (asip) using the machine description language lisa,” in Proceedings of the IEEE/ACM international conference on Computer-aided design, 2001. [2] C. Tradowsky, F. Thoma, M. Hubner, and J. Becker, “Lisparc: Using an architecture description language approach for modelling an adaptive processor microarchitecture,” in 7th IEEE International Symposium on Industrial Embedded Systems (SIES), 2012. [3] J. Henkel, “Closing the soc design gap,” Computer, vol. 36, pp. 119 – 121, 2003. [4] M. Dales, “Managing a reconfigurable processor in a general purpose workstation environment,” in Design, Automation and Test in Europe Conference and Exhibition, 2003. [5] R. Lysecky, G. Stitt, and F. Vahid, “Warp processors,” in ACM Transactions on Design Automation of Electronic Systems (TODAES), 2004. [6] L. Bauer, M. Shafique, S. Kramer, and J. Henkel, “Rispp: Rotating instruction set processing platform,” in DAC ’07. 44th ACM/IEEE Design Automation Conference, 2007. [7] F. Thoma, M. Kuhnle, P. Bonnot, E. Panainte, K. Bertels, S. Goller, A. Schneider, S. Guyetant, E. Schuler, K. Muller-Glaser, and J. Becker, “Morpheus: Heterogeneous reconfigurable computing,” in International Conference on Field Programmable Logic and Applications, 2007. [8] R. Koenig, L. Bauer, T. Stripf, M. Shafique, W. Ahmed, J. Becker, and J. Henkel, “Kahrisma: A novel hypermorphic reconfigurable-instruction-set multi-grainedarray architecture,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2010. [9] A. Fauth, J. Van Praet, and M. Freericks, “Describing instruction set processors using nml,” in European Design and Test Conference, 1995. [10] G. Hadjiyiannis, S. Hanono, and S. Devadas, “Isdl: An instruction set description language for retargetability,” in Proceedings of the 34th Design Automation Conference, 1997. [11] S. Bashford, U. Bieker, B. Harking, R. Leupers, P. Marwedel, A. Neumann, and D. Voggenauer, “The mimola language v 4.1,” Forschungsbericht, Universit at Dortmund, FB Informatik, 1994. [12] A. Halambi, P. Grun, V. Ganesh, A. Khare, N. Dutt, and A. Nicolau, “Expression: a language for architecture exploration through compiler/simulator retargetability,” in Design, Automation and Test in Europe Conference and Exhibition, 1999. [13] A. Hoffmann, H. Meyr, and R. Leupers, Architecture Exploration for Embedded Processors with Lisa. Kluwer Academic Publishers, 2002. [14] V. Dodani, N. Kumar, U. Nanda, and K. Mahapatra, “Optimization of an application specific instruction set processor using application description language,” in International Conference on Industrial and Information Systems (ICIIS), 2010. [15] A. Nohl, F. Schirrmeister, and D. Taussig, “Application specific processor design architectures, design methods and tools,” in Proceedings of the International Conference on Computer-Aided Design, 2010. [16] ARMv6-M Architecture Reference Manual, ARM Limited, 2010. [17] R. Dillmann, “Vorlesung kognitive systeme.” [18] P. Azad, T. Gockel, and R. Dillmann, Computer Vision: principles and practice. Elektor Electronics Publishing, 2008. [19] Synopsys, Inc., Processor Designer Product Family: Processor Design Guide, 2010. [20] ARM. Cortex-m1 processor - performance. [Online]. Available: http://www. arm.com/products/processors/cortex-m/cortex-m1.php [21] Synopsys, Inc., Processor Designer Product Family: LISA Modeling Fundamentals, 2010. 39