An Image Processing Coprocessor Implementation for Xilinx XC6000 Series FPGAs K. Benkrid, K. Alotaibi, D. Crookes, A. Bouridane and A. Benkrid School of Computer Science, The Queen’s University of Belfast, Belfast BT7 1NN, UK ABSTRACT This paper presents an Image Processing Coprocessor implementation for XC6000 series FPGAs. The FPGA acts as a semi-autonomous abstract coprocessor carrying out image processing operations independently. This paper outlines the main structure of the image processing coprocessor in addition to its high level programming environment. The environment provides a library of very high level, parametrised architecture descriptions which are scaleable and general. 1. INTRODUCTION FPGAs offer the potential of high performance in Image Processing at relatively low cost. However, the direct programming model of FPGAs is at far too low level for application developers to use. Our approach of the problem is to provide an FPGA-based image processing coprocessor (IPC) with a very high level, extensible instruction set based on a core level containing the operations of Image Algebra1. A previously reported IPC2 was based on a set of static configurations (e.g. fixed window size, fixed word length, etc.). Our new environment allows all specific details of image level operations to be implemented. This requires actual configurations to be generated at runtime. Our implementation of the coprocessor is based on the HotWorks TM PCI board (HOT1) with an XC6216 FPGA. Our design exploits the RAM on the HotWorks board to speed up the image processing operations. Part of the XC6216 FPGA is configured as a simple controller (including address generator units); the remainder is dynamically configured for each image-processing operation, as it is required at runtime (e.g. a convolver). First, the paper outlines the programming environment at the user level. This includes facilities for defining architectures for low level Image Processing algorithms without having to know any internal hardware implementation details. Next, the design architectures necessary to implement the IPC instruction set are presented. Following that, section 4 shows how the high level descriptions are represented and synthesised into a low level hardware description using Prolog as a hardware description language. Then, section 5 presents the runtime execution environment. Finally, some conclusions are drawn: some from the user perspective, and some on the suitability of the XC6000 hardware platform. 2. THE USER PROGRAMMING ENVIRONMENT The goal of our environment is to provide the user with the ability to dynamically create and use, at a very high level, FPGA configurations for a wide range of image processing operations. At the basic level, we provide facilities for simple neighbourhood operations based on Image Algebra. The programming model provides templates (static windows with user defined weights) and the set of IA operators. A neighbourhood operation is assumed to be in two stages: Email: K.Benkrid@qub.ac.uk; Telephone: +44 (0) 1232 27 4616; Fax: +44 (0) 1232 683890 A ‘local’ operator applied between an image pixel and the corresponding window coefficient. A ‘global’ operator applied to the set of local operation results to generate a result image pixel. The set of local operators contains the ‘Add’ (‘+’) and the ‘multiplication’ (‘*’) operators, whereas the global operator one contains the ‘Accumulation’ (‘’), the ‘Maximum’ (‘Max’) and the ‘Minimum’ (‘Min’) operators. With these local and global operators, the following neighbourhood operations can be built: Neighbourhood Operation Convolution Additive maximum Additive minimum Multiplicative maximum Multiplicative minimum Local operation * + + * * Global operation Max Min Max Min For instance, a simple Laplace operation would be performed by doing convolution with the following template: ~ -1 ~ -1 4 -1 ~ -1 ~ A high level description of such a neighbourhood operation has the general format: Neighbourhood_instruction (Input_image(Size, pixel_width), Local_op, Global_op, Window_info, filename) where ‘Window_info’ defines the static window to be used, and ‘filename’ is the name of the resulting netlist configuration file. For the Laplace operation presented above, and for a 256x256 window size of 16-bit pixels, the high level description would be: Mylaplace=Neighbourhood_instruction (input_image(256,256,16), local(mult), global(sum), window(3,3,[ [ ~, -1, ~], [-1, 4, -1], [~, -1, ~]]), ‘my_laplace’). In practical image processing applications, many algorithms comprise more than a single operation. Such complex operations can be broken into a number of primitive operations. Sometimes, these operations will be concurrent, in which case they could be implemented in parallel using separate regions of the FPGA. For example, the Sobel edge detection algorithm can be performed by adding the absolute results of two separate convolutions as shown in Figure 1. Neighbourhood operation Sobel_horizontal Absolute value Neighbourhood operation Sobel_vertical Absolute value Figure 1: Sobel complex operation. The whole operation is therefore equivalent to one single complex operation, which can be described as follows in our high level environment: Sobel_horizontal = image_operator ( Input_image(256,256,16), local(mult), global(sum), window(3,3 , [ [ 1, 2, 1 ], [ 0, 0, 0 ], [-1, -2, -1] ] )); Sobel_vertical = image_operator ( Input_image(256,256,16), local(mult), global(sum), window(3,3, [ [ 1, 0, -1 ], [ 2, 0, -2 ], [ 1, 0, -1] ] )); /* Sobel Complex operation */ Sobel = image_instruction(plus( (abs(Sobel_horizontal), abs(Sobel_vertical) ),’MySobel’); Note that an image_operator is essentially the same as a neighbourhood_instruction, except that it does not cause generation of a configuration. This because it is only a component in a compound image_instruction. Another example of a complex neighbourhood operation is where operations are carried out in series; i.e. the operations may be cascaded to form a pipeline. A typical example of two cascaded neighbourhood operations is the ‘Open’ operation. To do an ‘Open’ operation, an ‘Erode’ neighbourhood operation is first performed, and the resulting image is fed into a ‘Dilate’ neighbourhood operation as shown in Figure 2. ERODE DILATE Figure 2: ‘Open’ complex operation. This operation is described as follows in our high level environment: Erode = image_operator ( Input_image(256,256, 16), local(add), global(minimum, window(3,3, [ [ ~ , 0, ~], [ 0, 0, 0 ], [ ~ , 0, ~] ] )); /* ‘Open’ Complex operation, note that ‘Erode’ stands for the input image of ‘Dilate’ operation. */ Open = Neighbourhood_instruction ( Erode, local(add), global(maximum), window(3,3, [ [ ~ , 0, ~ ], [ 0, 0, 0 ], [ ~ , 0, ~ ] ] ), ‘My_Open’); Note that Sobel edge detection above used this form of pipeline where the neighbourhood operations (Sobel_vertical and Sobel_horizontal) are followed by an Image-Scalar absolute operation. When the user wishes to execute an IPC instruction using the FPGA, the operation Apply is invoked. This reconfigures the chip using the specified configuration file, and passes the source image to the FPGA, storing the results in the destination image. 3. ARCHITECTURAL DESIGN ISSUES This section outlines the development and design of the FPGA configurations necessary to implement the high level instruction set. In deriving our design, there are certain key guiding principles which were followed: Scaleability The hardware design should not be inherently limited to a particular data word length or a particular template size or shape. First-time-right place and route A design generated for an individual high level instruction should always guarantee to place and route first time and require no alteration from the user. This will influence the design of building block components. Utilisation of chip functionality Although it is important that chip resources are used efficiently, once a fit for an instruction has been obtained, there is nothing in particular to be gained from further reducing the number of gates used. This factor differs from, say, the VLSI approach. Reusability The instruction set should be implemented using arrangements of readily customisable building block units that may be reused in a number of instructions (e.g. multipliers, adders, etc.). 3.1. A general 2-D neighbourhood operation As mentioned earlier, any neighbourhood image operation involves passing a 2-D window over an image, and carrying out a calculation at each window position. To allow each pixel to be supplied only once to the FPGA, internal line delays are required. This is a common approach used in many hardware realisations 3,4. The idea is that these internal line delays are used to synchronise the supply of input values to the processing elements units 5 ensuring that all the pixel values involved in a particular neighbourhood operation are processed at the same instance. Figure 3 shows the architecture of a generic 2-D neighbourhood operation with an N by M window template where ‘L’ is the local operation and ‘G’ is the global one. Each of the Processing Elements (PEs) stores a template weight (or coefficient) internally and performs the necessary Local and Global operation Line Delay0 Processing Elements (PE) Line DelayM-2 Pixel Delays L L L L L L G G G G G G PE0 PEN-1 PEN*(M-2) PE N*(M-1)-1 PE N*(M-1) PE N*M-1 Figure 3: Architecture of a generic 2-D, N by M neighbourhood operation. One principle feature of our design approach for PEs is to have a standard framework which will accommodate the full range of neighbourhood operators. Into this standard framework will be plugged the appropriate two local and global sub-blocks2. 3.2. Architecture of a Processing Element Before defining the architecture of a Processing Element, two important strategic design decisions need to be made: Arithmetic architecture methodology: should we use bit serial or bit parallel arithmetic units? Arithmetic Representation: which number system should be used to represent and process data? Since a complete convolver (or similar operator) must easily fit on a single FPGA chip, parallel multipliers are not feasible. This leads to our decision to use bit serial arithmetic. Note, secondly, that the need to pipeline bit serial Maximum and Minimum operations suggests we should process data Most Significant Bit first (MSBF). Following on from this choice, because of problems in doing addition MSBF in 2’s complement, it is advantageous to use an alternative number representation to 2’s complement. Therefore, although there are several possible solutions, the solution which we have implemented to meet the design constraints is based on the following choices: (i) Bit serial arithmetic (ii) Signed Digit Number Representation (SDNR) rather than 2’s complement 2. (iii) Most Significant Bit First processing Because image data may have to be occasionally processed on the host processor, the basic storage format for image data is, however, 2’s complement. Therefore, processing elements first convert their incoming image data to SDNR. This also reduces the chip area required for the line buffers (in which data is held in 2’s complement). A final unit to convert a SDNR result into 2’s complement will be needed before any results can be returned to the host system. With these considerations, a more detailed design of a general Processing Element (in terms of a local and a global operation) is given in Figure 4. global Binary to SDNR BufferUnit Localoperation with local constant coefficient C Localoperation Globaloperation with global Globaloperation Figure 4: Architecture of a standard Processing Element. The constant coefficient bits are hardwired in the logic2. The ability to dynamically reconfigure gates means that the image template can be rapidly reconfigured an unlimited number of times. Partial reconfiguration and fast reconfiguration times makes the XC6200 very suitable for this application and using the XACT6000 tool 6, reconfiguration data can be generated easily. 4. DESIGN OF THE SOFTWARE ENVIRONMENT As mentioned earlier, the image processing operations are created using a fixed set of primitive blocks (e.g. Multiply, ). These basic building blocks are described and assembled using a high level hardware description notation called HIDE8 which is based on Prolog9. Primitive blocks are generated off-line and stored in EDIF format; their Prolog description includes information such as width, height and i/o ports. To assemble these primitive blocks into larger components, a small and simple set of constructors is provided including vertical and horizontal block composition and replication. For instance, suppose the task is to describe a scaleable generic NxM neighbourhood architecture as shown in Figure 3, where L(C) represents the local operation (e.g. we could supply MULT(C) for multiplying by a C-bit coefficient), and G is the global operation (e.g. we could supply SUM). The local-global part of the PE is first defined by the object: vertical([G,L(C)]) Then, an N wide block of these is defined by an M horizontal sequence as follows: h_seq(N, vertical([G,L(C)])) A block of M of these is described by an M horizontal sequence as follows: h_seq(M, h_seq(N, vertical([G,L(C)]))) Finally, the two line delays (or line buffers), each holding ‘image_height’ words of size K bits, are assembled vertically to the previous component by: vertical([h_seq(M, h_seq(N, vertical([G, L(C)]))), h_seq(2,BUF(K,image_height)]) For simplicity, we have omitted the Binary to SDNR, and the delays; but it should be clear how these could be incorporated. Also, there are some further details of the description not discussed here which are necessary to achieve the complete configuration (mainly to do with block interconnection) but this is handled automatically by our system. Our software tools, which are written in Prolog, take the above description and generate a data structure representing the entire circuit including placement and routing information. This data structure is finally converted to an EDIF description of the circuit. Using the same simple set of constructors, many image processing configurations (primitive and complex operations) have been generated from a high level description similar to that presented in section 2 (e.g. Sobel and Open). . 5. RUNTIME EXECUTION ENVIRONMENT Our implementation of the coprocessor is based on the HotWorksTM PCI board (HOT1) with an XC6216 FPGA. Our design exploits the RAM on the HotWorks board as a temporary image holder. Part of the XC6216 FPGA is configured as a simple controller (including address generator units); the remainder is dynamically configured for each imageprocessing operation. The controller consists of the SRAM hard macro and an address generator. The SRAM macro consists of three blocks 7: 1- SRAM_REG: consists of two 32 bits registers (input/output) interfacing to IOBs. 2- SRAM_ADDRESS: interfaces with the onboard RAM address register. 3- SRAM_RDWR: interfaces with the onboard RAM RDWR 4bit register. The latter determines the Read/Write state of the RAM banks. The Address generator produces the desired memory addressing sequence. The physical layout of a 3 by 3 convolution, with the necessary memory interface, has been generated automatically from the high level description of the Laplace operation given in section 2. Note that special attention had to be paid to the routing between the IP unit block and the Input/Output pins where data is driven to/from fixed pin positions. This has forced us to dedicate logic blocks for routing, which in turn has reduced the area available for the IP unit. Behind the scenes, when the user creates a coprocessor instruction object, the software environment will use the details of this instruction object to generate the corresponding EDIF description. The latter is then input to the XACT 6000 tool in order to generate the corresponding FPGA configuration bitstream. The EDIF description of the 3 by 3 convolution circuit has been generated in less than one second. Note that the resulting configuration is stored in a library, so it will not be regenerated if exactly the same operation happens to be invoked again. The Apply routine will then download the bitstream to the FPGA board, and tailor any cells if necessary; and finally it will trigger the onboard clock to process the entire image. To the user, the programming model is merely the set of algebraic operators provided by the library. When the operation is complete, the application program will be able to proceed, possibly to another coprocessor instruction, at which point the above process is repeated. The overall design of the software system that provides this high level programming model, and hides details of configurations from the user, is shown in Figure 5. Library of FPGA Configurations files FPGA bitstream UserProgram HIDE Description Already exists? 1- Build an Image instruction HIDE Generator Configurations files EDIF HIDE System XACT 6000 FPGA board Primitives (EDIF descriptions library) Download 2- Apply “Go” Figure 5: Outline structure of the complete environment. 6. CONCLUSIONS This paper has discussed the design of an FPGA-based image processing development environment, which has the primary goal of providing a high level programming interface to a low cost high performance computing engine. This enables the application developer to concentrate on the image processing aspects of a problem, rather than having to get to grips with FPGA technology. The programming model we have adopted is based on the image-level neighbourhood operators of Image Algebra, and an extensive range of algorithms can be implemented efficiently in this simple model. The use of Prolog as a hardware description language has proved beneficial. Indeed, unbound variables are useful in describing blocks with uncommitted factors such as size and port positions. These unbound variables are bound at runtime to satisfy certain requirements. Rules are useful for capturing heuristics for first-time-right place and route. Our implementation of the IP coprocessor is based on the HotWorks TM PCI board (HOT1) where the onboard RAM is exploited to speed up the image processing operation. From our experience, two advantages of the XC6200 should be noted. It has the flexibility of rapid partial reconfiguration for changing image templates' weights; also it is well suited to this kind of structured, heavily pipelined application. On the other hand, we have noted that the lack of additional routing facilities between the IOBs and the logic cells in XC6200 has forced us to dedicate logic blocks for routing. This leads to a significant drop in the design performance. Indeed, according to timing simulation using XACT6000, the maximum clock speed to the design is 45 MHz if data is fed direct to IOBs. This would enable convolution of a 256 by 256 image with a theoretical frame rate of 43 Hz. On the other hand, if data is fed from the onboard memory, the maximum clock rate drops to 20 MHz, which gives a maximum frame rate of 19 Hz. Recent changes in Xilinx plans for the XC6000 series have encouraged us to move towards implementing the IPC on alternative FPGAs platforms. We are currently developing a similar software environment for XC4000 series 12. References 1. 2. Ritter G X, Wilson J N and Davidson J L, ‘Image Algebra: an overview’, Computer Vision, Graphics and Image Processing, No 49, pp 297–331, 1990. P. Donachy, D. Crookes, A. Bouridane, K. Alotaibi, and A. Benkrid, “Design and implementation of a high level image processing machine using reconfigurable hardware”, Proceedings SPIE’98, Vol. 3526, p. 2-13, 1998. Shoup, R G, ‘Parameterised Convolution Filtering in an FPGA’, More FPGAs, W Moore and W Luk (editors), Abington EE&CS Books, pp 274, 1994. 4. Kamp, W, Kunemund, H, Soldner and Hofer, H, ‘Programmable 2D linear filter for video applications’, IEEE journal of Solid State Circuits, pp 735-740, 1990. 5. Hecht, V, Ronner, K and Pirsch, P, ‘A defect-tolerant systolic array implementation for real-time image processing’, Journal of VLSI signal processing, Vol 5, pp 37-47, 1993. 6. Xilinx (1996). XACT STEP Series 6000- user guide, Xilinx Corporation, Scotland, UK. 7. Xilinx - XC6200 Development System DataSheet - August 7, 1997 -Version 1.2. 8. D Crookes, K Alotaibi, A Bouridane, P Donachy and A Benkrid ‘An Environment for Generating FPGA Architectures for Image Algebra-based Algorithms’, ICIP98’ Chicago, October 1998. 9. Clocksin W F and Melish C S, ‘Programming in Prolog’, Springer-Verlag. 10. Crookes D, Morrow P J and McParland P J, ‘IAL: a parallel image processing programming language’, IEE Proceedings, Part I, Vol 137 No 3, pp 176–182, June 1990. 11. Brown T J and Crookes D, ‘A high level language for image processing’, Image and Vision Computing, Vol 12 No 2, pp 67–79, March 1994. 12. K. Benkrid, D. Crookes, A. Bouridane, P. Corr and K.Alotaibi, ‘A High Level Software Environment for FPGA Based Image Processing’, To appear in Proceedings IPA 99’, Manchester, July 1999. 3.