paper (MS Word) - School of Electronics, Electrical

advertisement
An Image Processing Coprocessor Implementation for
Xilinx XC6000 Series FPGAs
K. Benkrid, K. Alotaibi, D. Crookes, A. Bouridane and A. Benkrid
School of Computer Science, The Queen’s University of Belfast, Belfast BT7 1NN, UK
ABSTRACT
This paper presents an Image Processing Coprocessor implementation for XC6000 series FPGAs. The FPGA acts as a
semi-autonomous abstract coprocessor carrying out image processing operations independently. This paper outlines the
main structure of the image processing coprocessor in addition to its high level programming environment. The
environment provides a library of very high level, parametrised architecture descriptions which are scaleable and
general.
1. INTRODUCTION
FPGAs offer the potential of high performance in Image Processing at relatively low cost. However, the direct
programming model of FPGAs is at far too low level for application developers to use. Our approach of the problem is
to provide an FPGA-based image processing coprocessor (IPC) with a very high level, extensible instruction set based
on a core level containing the operations of Image Algebra1.
A previously reported IPC2 was based on a set of static configurations (e.g. fixed window size, fixed word length, etc.).
Our new environment allows all specific details of image level operations to be implemented. This requires actual
configurations to be generated at runtime.
Our implementation of the coprocessor is based on the HotWorks TM PCI board (HOT1) with an XC6216 FPGA. Our
design exploits the RAM on the HotWorks board to speed up the image processing operations. Part of the XC6216
FPGA is configured as a simple controller (including address generator units); the remainder is dynamically configured
for each image-processing operation, as it is required at runtime (e.g. a convolver).
First, the paper outlines the programming environment at the user level. This includes facilities for defining
architectures for low level Image Processing algorithms without having to know any internal hardware implementation
details. Next, the design architectures necessary to implement the IPC instruction set are presented. Following that,
section 4 shows how the high level descriptions are represented and synthesised into a low level hardware description
using Prolog as a hardware description language. Then, section 5 presents the runtime execution environment. Finally,
some conclusions are drawn: some from the user perspective, and some on the suitability of the XC6000 hardware
platform.
2. THE USER PROGRAMMING ENVIRONMENT
The goal of our environment is to provide the user with the ability to dynamically create and use, at a very high level,
FPGA configurations for a wide range of image processing operations. At the basic level, we provide facilities for
simple neighbourhood operations based on Image Algebra. The programming model provides templates (static
windows with user defined weights) and the set of IA operators. A neighbourhood operation is assumed to be in two
stages:

Email: K.Benkrid@qub.ac.uk; Telephone: +44 (0) 1232 27 4616; Fax: +44 (0) 1232 683890


A ‘local’ operator applied between an image pixel and the corresponding window coefficient.
A ‘global’ operator applied to the set of local operation results to generate a result image pixel.
The set of local operators contains the ‘Add’ (‘+’) and the ‘multiplication’ (‘*’) operators, whereas the global operator
one contains the ‘Accumulation’ (‘’), the ‘Maximum’ (‘Max’) and the ‘Minimum’ (‘Min’) operators. With these local
and global operators, the following neighbourhood operations can be built:
Neighbourhood Operation
Convolution
Additive maximum
Additive minimum
Multiplicative maximum
Multiplicative minimum
Local operation
*
+
+
*
*
Global operation

Max
Min
Max
Min
For instance, a simple Laplace operation would be performed by doing convolution with the following template:
~ -1 ~
-1 4 -1
~ -1 ~
A high level description of such a neighbourhood operation has the general format:
Neighbourhood_instruction (Input_image(Size, pixel_width), Local_op, Global_op, Window_info, filename)
where ‘Window_info’ defines the static window to be used, and ‘filename’ is the name of the resulting netlist
configuration file.
For the Laplace operation presented above, and for a 256x256 window size of 16-bit pixels, the high level description
would be:
Mylaplace=Neighbourhood_instruction (input_image(256,256,16), local(mult), global(sum), window(3,3,[ [ ~, -1, ~],
[-1, 4, -1],
[~, -1, ~]]),
‘my_laplace’).
In practical image processing applications, many algorithms comprise more than a single operation. Such complex
operations can be broken into a number of primitive operations. Sometimes, these operations will be concurrent, in
which case they could be implemented in parallel using separate regions of the FPGA. For example, the Sobel edge
detection algorithm can be performed by adding the absolute results of two separate convolutions as shown in Figure 1.
Neighbourhood
operation
Sobel_horizontal
Absolute value

Neighbourhood
operation
Sobel_vertical
Absolute value
Figure 1: Sobel complex operation.
The whole operation is therefore equivalent to one single complex operation, which can be described as follows in our
high level environment:
Sobel_horizontal = image_operator ( Input_image(256,256,16), local(mult), global(sum), window(3,3 , [ [ 1, 2, 1 ],
[ 0, 0, 0 ],
[-1, -2, -1] ] ));
Sobel_vertical = image_operator ( Input_image(256,256,16), local(mult), global(sum), window(3,3, [ [ 1, 0, -1 ],
[ 2, 0, -2 ],
[ 1, 0, -1] ] ));
/* Sobel Complex operation */
Sobel = image_instruction(plus( (abs(Sobel_horizontal), abs(Sobel_vertical) ),’MySobel’);
Note that an image_operator is essentially the same as a neighbourhood_instruction, except that it does not cause
generation of a configuration. This because it is only a component in a compound image_instruction.
Another example of a complex neighbourhood operation is where operations are carried out in series; i.e. the operations
may be cascaded to form a pipeline. A typical example of two cascaded neighbourhood operations is the ‘Open’
operation. To do an ‘Open’ operation, an ‘Erode’ neighbourhood operation is first performed, and the resulting image is
fed into a ‘Dilate’ neighbourhood operation as shown in Figure 2.
ERODE
DILATE
Figure 2: ‘Open’ complex operation.
This operation is described as follows in our high level environment:
Erode = image_operator ( Input_image(256,256, 16), local(add), global(minimum, window(3,3, [ [ ~ , 0, ~],
[ 0, 0, 0 ],
[ ~ , 0, ~] ] ));
/* ‘Open’ Complex operation, note that ‘Erode’ stands for the input image of ‘Dilate’ operation. */
Open = Neighbourhood_instruction ( Erode, local(add), global(maximum), window(3,3, [ [ ~ , 0, ~ ],
[ 0, 0, 0 ],
[ ~ , 0, ~ ] ] ), ‘My_Open’);
Note that Sobel edge detection above used this form of pipeline where the neighbourhood operations (Sobel_vertical
and Sobel_horizontal) are followed by an Image-Scalar absolute operation.
When the user wishes to execute an IPC instruction using the FPGA, the operation Apply is invoked. This reconfigures
the chip using the specified configuration file, and passes the source image to the FPGA, storing the results in the
destination image.
3. ARCHITECTURAL DESIGN ISSUES
This section outlines the development and design of the FPGA configurations necessary to implement the high level
instruction set. In deriving our design, there are certain key guiding principles which were followed:
 Scaleability The hardware design should not be inherently limited to a particular data word length or a particular
template size or shape.
 First-time-right place and route A design generated for an individual high level instruction should always guarantee
to place and route first time and require no alteration from the user. This will influence the design of building block
components.
 Utilisation of chip functionality Although it is important that chip resources are used efficiently, once a fit for an
instruction has been obtained, there is nothing in particular to be gained from further reducing the number of gates
used. This factor differs from, say, the VLSI approach.
 Reusability The instruction set should be implemented using arrangements of readily customisable building block
units that may be reused in a number of instructions (e.g. multipliers, adders, etc.).
3.1. A general 2-D neighbourhood operation
As mentioned earlier, any neighbourhood image operation involves passing a 2-D window over an image, and carrying
out a calculation at each window position. To allow each pixel to be supplied only once to the FPGA, internal line
delays are required. This is a common approach used in many hardware realisations 3,4. The idea is that these internal
line delays are used to synchronise the supply of input values to the processing elements units 5 ensuring that all the
pixel values involved in a particular neighbourhood operation are processed at the same instance. Figure 3 shows the
architecture of a generic 2-D neighbourhood operation with an N by M window template where ‘L’ is the local
operation and ‘G’ is the global one. Each of the Processing Elements (PEs) stores a template weight (or coefficient)
internally and performs the necessary Local and Global operation
Line Delay0
Processing Elements (PE)
Line DelayM-2
Pixel Delays






L
L
L
L
L
L
G
G
G
G
G
G
PE0
PEN-1
PEN*(M-2)
PE N*(M-1)-1 PE N*(M-1)
PE N*M-1
Figure 3: Architecture of a generic 2-D, N by M neighbourhood operation.
One principle feature of our design approach for PEs is to have a standard framework which will accommodate the full
range of neighbourhood operators. Into this standard framework will be plugged the appropriate two local and global
sub-blocks2.
3.2. Architecture of a Processing Element
Before defining the architecture of a Processing Element, two important strategic design decisions need to be made:
 Arithmetic architecture methodology: should we use bit serial or bit parallel arithmetic units?
 Arithmetic Representation: which number system should be used to represent and process data?
Since a complete convolver (or similar operator) must easily fit on a single FPGA chip, parallel multipliers are not
feasible. This leads to our decision to use bit serial arithmetic. Note, secondly, that the need to pipeline bit serial
Maximum and Minimum operations suggests we should process data Most Significant Bit first (MSBF). Following on
from this choice, because of problems in doing addition MSBF in 2’s complement, it is advantageous to use an
alternative number representation to 2’s complement. Therefore, although there are several possible solutions, the
solution which we have implemented to meet the design constraints is based on the following choices:
(i) Bit serial arithmetic
(ii) Signed Digit Number Representation (SDNR) rather than 2’s complement 2.
(iii) Most Significant Bit First processing
Because image data may have to be occasionally processed on the host processor, the basic storage format for image
data is, however, 2’s complement. Therefore, processing elements first convert their incoming image data to SDNR.
This also reduces the chip area required for the line buffers (in which data is held in 2’s complement). A final unit to
convert a SDNR result into 2’s complement will be needed before any results can be returned to the host system. With
these considerations, a more detailed design of a general Processing Element (in terms of a local and a global operation)
is given in Figure 4.

global
Binary to SDNR
BufferUnit
Localoperation
with local
constant
coefficient
C
Localoperation
Globaloperation
with global
Globaloperation
Figure 4: Architecture of a standard Processing Element.
The constant coefficient bits are hardwired in the logic2. The ability to dynamically reconfigure gates means that the
image template can be rapidly reconfigured an unlimited number of times. Partial reconfiguration and fast
reconfiguration times makes the XC6200 very suitable for this application and using the XACT6000 tool 6,
reconfiguration data can be generated easily.
4. DESIGN OF THE SOFTWARE ENVIRONMENT
As mentioned earlier, the image processing operations are created using a fixed set of primitive blocks (e.g. Multiply,
 ). These basic building blocks are described and assembled using a high level hardware description notation called
HIDE8 which is based on Prolog9. Primitive blocks are generated off-line and stored in EDIF format; their Prolog
description includes information such as width, height and i/o ports. To assemble these primitive blocks into larger
components, a small and simple set of constructors is provided including vertical and horizontal block composition and
replication. For instance, suppose the task is to describe a scaleable generic NxM neighbourhood architecture as shown
in Figure 3, where L(C) represents the local operation (e.g. we could supply MULT(C) for multiplying by a C-bit
coefficient), and G is the global operation (e.g. we could supply SUM). The local-global part of the PE is first defined
by the object:
vertical([G,L(C)])
Then, an N wide block of these is defined by an M horizontal sequence as follows:
h_seq(N, vertical([G,L(C)]))
A block of M of these is described by an M horizontal sequence as follows:
h_seq(M, h_seq(N, vertical([G,L(C)])))
Finally, the two line delays (or line buffers), each holding ‘image_height’ words of size K bits, are assembled vertically
to the previous component by:
vertical([h_seq(M, h_seq(N, vertical([G, L(C)]))), h_seq(2,BUF(K,image_height)])
For simplicity, we have omitted the Binary to SDNR, and the delays; but it should be clear how these could be
incorporated. Also, there are some further details of the description not discussed here which are necessary to achieve
the complete configuration (mainly to do with block interconnection) but this is handled automatically by our system.
Our software tools, which are written in Prolog, take the above description and generate a data structure representing
the entire circuit including placement and routing information. This data structure is finally converted to an EDIF
description of the circuit.
Using the same simple set of constructors, many image processing configurations (primitive and complex operations)
have been generated from a high level description similar to that presented in section 2 (e.g. Sobel and Open). .
5. RUNTIME EXECUTION ENVIRONMENT
Our implementation of the coprocessor is based on the HotWorksTM PCI board (HOT1) with an XC6216 FPGA. Our
design exploits the RAM on the HotWorks board as a temporary image holder. Part of the XC6216 FPGA is configured
as a simple controller (including address generator units); the remainder is dynamically configured for each imageprocessing operation.
The controller consists of the SRAM hard macro and an address generator. The SRAM macro consists of three blocks 7:
1- SRAM_REG: consists of two 32 bits registers (input/output) interfacing to IOBs.
2- SRAM_ADDRESS: interfaces with the onboard RAM address register.
3- SRAM_RDWR: interfaces with the onboard RAM RDWR 4bit register. The latter determines the Read/Write state
of the RAM banks.
The Address generator produces the desired memory addressing sequence.
The physical layout of a 3 by 3 convolution, with the necessary memory interface, has been generated automatically
from the high level description of the Laplace operation given in section 2. Note that special attention had to be paid to
the routing between the IP unit block and the Input/Output pins where data is driven to/from fixed pin positions. This
has forced us to dedicate logic blocks for routing, which in turn has reduced the area available for the IP unit.
Behind the scenes, when the user creates a coprocessor instruction object, the software environment will use the details
of this instruction object to generate the corresponding EDIF description. The latter is then input to the XACT 6000
tool in order to generate the corresponding FPGA configuration bitstream. The EDIF description of the 3 by 3
convolution circuit has been generated in less than one second. Note that the resulting configuration is stored in a
library, so it will not be regenerated if exactly the same operation happens to be invoked again. The Apply routine will
then download the bitstream to the FPGA board, and tailor any cells if necessary; and finally it will trigger the onboard
clock to process the entire image. To the user, the programming model is merely the set of algebraic operators provided
by the library. When the operation is complete, the application program will be able to proceed, possibly to another
coprocessor instruction, at which point the above process is repeated. The overall design of the software system that
provides this high level programming model, and hides details of configurations from the user, is shown in Figure 5.
Library of FPGA
Configurations files
FPGA bitstream
UserProgram
HIDE Description
Already exists?
1- Build an Image
instruction
HIDE
Generator
Configurations files
EDIF
HIDE System
XACT 6000
FPGA
board
Primitives
(EDIF descriptions library)
Download
2- Apply
“Go”
Figure 5: Outline structure of the complete environment.
6. CONCLUSIONS
This paper has discussed the design of an FPGA-based image processing development environment, which has the
primary goal of providing a high level programming interface to a low cost high performance computing engine. This
enables the application developer to concentrate on the image processing aspects of a problem, rather than having to get
to grips with FPGA technology. The programming model we have adopted is based on the image-level neighbourhood
operators of Image Algebra, and an extensive range of algorithms can be implemented efficiently in this simple model.
The use of Prolog as a hardware description language has proved beneficial. Indeed, unbound variables are useful in
describing blocks with uncommitted factors such as size and port positions. These unbound variables are bound at
runtime to satisfy certain requirements. Rules are useful for capturing heuristics for first-time-right place and route.
Our implementation of the IP coprocessor is based on the HotWorks TM PCI board (HOT1) where the onboard RAM is
exploited to speed up the image processing operation. From our experience, two advantages of the XC6200 should be
noted. It has the flexibility of rapid partial reconfiguration for changing image templates' weights; also it is well suited
to this kind of structured, heavily pipelined application. On the other hand, we have noted that the lack of additional
routing facilities between the IOBs and the logic cells in XC6200 has forced us to dedicate logic blocks for routing.
This leads to a significant drop in the design performance. Indeed, according to timing simulation using XACT6000,
the maximum clock speed to the design is 45 MHz if data is fed direct to IOBs. This would enable convolution of a 256
by 256 image with a theoretical frame rate of 43 Hz. On the other hand, if data is fed from the onboard memory, the
maximum clock rate drops to 20 MHz, which gives a maximum frame rate of 19 Hz.
Recent changes in Xilinx plans for the XC6000 series have encouraged us to move towards implementing the IPC on
alternative FPGAs platforms. We are currently developing a similar software environment for XC4000 series 12.
References
1.
2.
Ritter G X, Wilson J N and Davidson J L, ‘Image Algebra: an overview’, Computer Vision, Graphics and Image
Processing, No 49, pp 297–331, 1990.
P. Donachy, D. Crookes, A. Bouridane, K. Alotaibi, and A. Benkrid, “Design and implementation of a high level
image processing machine using reconfigurable hardware”, Proceedings SPIE’98, Vol. 3526, p. 2-13, 1998.
Shoup, R G, ‘Parameterised Convolution Filtering in an FPGA’, More FPGAs, W Moore and W Luk (editors),
Abington EE&CS Books, pp 274, 1994.
4. Kamp, W, Kunemund, H, Soldner and Hofer, H, ‘Programmable 2D linear filter for video applications’, IEEE
journal of Solid State Circuits, pp 735-740, 1990.
5. Hecht, V, Ronner, K and Pirsch, P, ‘A defect-tolerant systolic array implementation for real-time image
processing’, Journal of VLSI signal processing, Vol 5, pp 37-47, 1993.
6. Xilinx (1996). XACT STEP Series 6000- user guide, Xilinx Corporation, Scotland, UK.
7. Xilinx - XC6200 Development System DataSheet - August 7, 1997 -Version 1.2.
8. D Crookes, K Alotaibi, A Bouridane, P Donachy and A Benkrid ‘An Environment for Generating FPGA
Architectures for Image Algebra-based Algorithms’, ICIP98’ Chicago, October 1998.
9. Clocksin W F and Melish C S, ‘Programming in Prolog’, Springer-Verlag.
10. Crookes D, Morrow P J and McParland P J, ‘IAL: a parallel image processing programming language’, IEE
Proceedings, Part I, Vol 137 No 3, pp 176–182, June 1990.
11. Brown T J and Crookes D, ‘A high level language for image processing’, Image and Vision Computing, Vol 12 No
2, pp 67–79, March 1994.
12. K. Benkrid, D. Crookes, A. Bouridane, P. Corr and K.Alotaibi, ‘A High Level Software Environment for FPGA
Based Image Processing’, To appear in Proceedings IPA 99’, Manchester, July 1999.
3.
Download