MORPHOLOGICAL IMAGE PROCESSING USING CUSTOM INSTRUCTIONS ON DISTRIBUTED NIOS PROCESSORS Haichen Ren, David J. Jackson Electrical and Computer Engineering The University of Alabama Tuscaloosa, AL 35487-0286 USA Abstract As fundamental image processing block, morphological processing involves intensive computation and contribute significantly to the system overhead. With depending on only spatially local data, several morphological operations could be implemented with parallel hardware to reduce the computation overhead. In this paper, we implemented morphological image operations, which include dilation, erosion, and edge detection based on a 3x3 mask, on a distributed Altera NIOS® soft core system. We also implement custom instructions to improve the system performance. Compared with non-distributed system without custom instruction, the speedup of several morphological operations based upon the distributed system with custom instructions can reach to 11.8. The system architecture and implementation details are presented. Keywords: image processing, morphological operation, embedded processor, programmable logic, soft core. 1. INTRODUCTION Recently, the image processing community has become aware of the potential for massive parallelism and high computational density in hardware. Among all available hardware, embedded programmable processors give the system designer unprecedented freedom in determining which functions should be executed in software and which would benefit the most from dedicated hardware implementation in the form of custom peripherals or coprocessor elements. This flexibility allows a designer to not only rapidly prototype new designs and easily integrate different digital components into one design, fully realize, in hardware, several iterations of a system in a shorter amount of time, but also to explore different options for partitioning to deliver the best possible combination in that product while still meeting the design's functionality and performance requirements [1]. Based upon the Altera Excalibur embedded processor solution, we implemented distributed hardware implementation of several morphological operations, evaluated the custom instruction design of the Altera NIOS soft core processor upon non-distributed and distributed system. This paper is organized as follows. Section 2 presents the Altera ExcaliburTM embedded processor solution. Section 3 reviews several morphological image operations, details an algorithm ORD-4C for morphological image processing. In section 4 a distributed NIOS hardware implementation with custom instructions using ORD-4C algorithm is detailed. Section 5 presents results and performance. 2. NIOS SOFT CORE PROCESSOR One of the more popular solutions of combinations of embedded processors and programmable logic is the use of soft core microprocessors. Soft core processors are a recent digital design method that combines the advantages of programmable logic devices with those of conventional hard core processors. Soft core processors function like hard core processors but are implemented on programmable logic devices (PLD), such as Field Programmable Gate Arrays (FPGA). Scalability and flexibility are the two main advantages of soft core processors derived from implementation on PLDs. Soft cores are flexible in that custom defined logic can be easily integrated to the processor with minimal interfacing requirements. Some soft core processors even allow their internal architecture to be changed to suit a particular design. This gives the designer more flexibility when interfacing the soft core processor to the rest of the embedded system. The scalability of soft core processors allows more than one processor to be implemented in a particular design, but this is limited by the capacity and resources of the PLD used. The scaleable and flexible factors make the soft core processor suitable for use in a variety of applications such as communications or digital signal processing. 2.1 NIOS soft core processor Altera ExcaliburTM solutions include both soft core and hard core embedded processors. As part of Altera Excalibur embedded processor solutions and based on Altera APEX 20K200EFC484-2x, the NIOS soft core embedded processor is a configurable, general-purpose RISC microprocessor with a 16-bit instruction set, userselectable 32- or 16-bit datapath, and configurable register file and barrel shifter size. It can provide up to 50 MIPS performance while being optimized for area in a PLD and can easily fit into an Altera APEX device, leaving most of the logic available for peripherals and custom logic functions. Figure 1 shows the structure of the NIOS embedded processor [2]. PBM SRAM CPU IRQ FLASH Timer Serial Port UART NIOS Area AvailableFor Customization APEX Device Figure 1. The NIOS embedded programmable processor [2]. 2.2 Custom Instructions Altera NIOS processor is one of a few types of soft core processors that allow custom instructions. By designing special custom instructions for the NIOS soft core, system designers can add up to five custom-defined functionalities to the NIOS processor’s arithmetic logic unit (ALU) and instruction set, as shown in Figure 2. Custom instructions consist of custom logic block and software macro. Custom logic block is the hardware that performs the operation. The NIOS processor can include up to five user-defined custom logic blocks. The blocks become part of the NIOS microprocessor’s ALU. Software macro is the user-interface that allows the system designer to access the custom logic through software code. Morphological operations can be separated into two categories: 1) binary morphological filtering of binary images, and 2) grey-level morphological filtering of grey-level images. Four basic types of binary morphological filtering operations are available: erosion, dilation, opening, and closing. Each of these filters uses a mask or structuring element to determine the geometrical filtering process. The erosion filtering operation reduces the geometrical size of an object, while the dilation filtering operation enlarges an object’s geometrical size. However, they are generally not reversible operations. An opening filter is simply an erosion filter followed by a dilation filter, and a closing filter is a dilation filter followed by an erosion filter [4][5]. For more complex morphological operations, most of them are kinds of combinations of these four operations. Such as edge detection can be got by the calculating difference between the dilated image and the original image or the original image and the eroded image. The morphological operations procedure of an input image is a kind of convolution between the input image and structuring element (SE). for example, with a 3x3 all 1’s SE, a block of nine pixels is covered by the SE in each step, and the maximum value or minimum value among the nine pixels is picked for the dilation or erosion operation output of the central pixel. After the current operation is complete, the SE will move to the right by one pixel, or move down to the beginning of the next row if the end of the current row is hitted. 3.2 ORD-4C algorithm If we trace the movement of SE, we can find that some pixels are overlapped between each step or each row. For example, in Figure 4 with 3x3 all 1’s SE, nine double-lined pixels are covered for movement step. Among adjacent movement, 6 bold-lined pixels are always overlapped. And 4 pixels are overlapped among the 4 movement steps shown in the figure 4. Figure 2. Custom instruction logic block and interface of 32-bit NIOS processor [3]. 3. MORPHOLOGIOCAL OPERATIONS Morphology offers a unified and powerful approach to numerous image processing problems [4]. The goal of morphological operations is to smooth the contours of the objects and to decompose an image into its fundamental geometrical shapes. 3.1 Fundamental morphological image processing Obviously, the comparison between nine pixels is the most common computation overhead of the algorithm. If we implement the nine-pixel comparison operation in hardware, we could significant reduce the system overhead compared with high-level language implementation. Considering the custom instruction feature of NIOS soft core processor, it allows at most two input operands and one output, plus a 11-bit control bit providing more choices of the operation on input operands. We can customize a custom instruction base upon 32-bit processor. In this way, we could have up to eight 8-bit pixel values embedded in instruction operands. Based on the observation of the morphological operations on an 8-bit grey-scale image with 3x3 all 1’s SE mentioned above, six pixels are overlapped between each comparison step. We can embed six 8-bit pixel values into two 32-bit input operands. The comparison between nine pixels can be done in two levels: first compare left side six pixels and right side six pixels respectively, and then compare the results of these two 6pixel comparisons to get the final morphological operation result for the current central pixel. But this design has 2 drawbacks: first, the middle results of different rows can not be re-used among 2 adjacent rows. Second and more inefficient is the comparison for the final result of current central pixel is generated by doing comparison of two 8-bit values, which are the middle results of left side six pixels L-6 and right side six pixels R-6. It only uses two 8-bit of two 32-bit input operands. P1,1,1 P1,1,2 P1,1,3 P1,2,1 P1,2,2 P1,2,3 P1,1,4 P1,1,5 P1,1,6 P1,2,4 P1,2,5 P1,2,6 P1,1,7 P1,1,8 P1,1,9 P1,2,7 P1,2,8 P1,2,9 embeds four 8-bit pixels, for each nine pixels comparison, two levels with five 4-pixel comparisons are enough. We need to create two temporal buffers. Every time for each row, when a new pixel value is inputted, we first perform first level comparison: combining the new pixel with one pixel in its left in the same row and two pixels into the input operand of the custom instruction to do the comparison, saving the result in corresponding buffer. Meanwhile, with this new comparison result, we can perform second level comparison by combining 4 values in the two temporal buffers to do the comparison. The result is actually the morphological operation result of the pixel in the last row. So, as we can see, with two comparisons, we can generate one result for the last row. And the 32 bits of input operand are fully used and same for the all the middle level results. It’s much more efficient the six pixels’ comparison custom instruciton. We call is as One Row Delay 4-Pixel Comparison algorithm (ORD-4C). 4. (a) Pixels for step (1, 1) (b) Pixels for step (1, 2) P2,1,1 P2,1,2 P2,1,3 P2,2,1 P2,2,2 P2,2,3 P2,1,4 P2,1,5 P2,1,6 P2,2,4 P2,2,5 P2,2,6 P2,1,7 P2,1,8 P2,1,9 P2,2,7 (c) Pixels for step (2, 1) P1,1,1 P1,1,2 P2,1,1 P1,1,2 P2,1,4 P2,2,8 P2,2,9 (d) Pixels for step (2, 2) P1,1,2 P1,2,1 P1,1,3 P1,2,2 P1,1,5 P1,2,4 P2,1,2 P2,2,1 P1,1,2 P1,2,5 P2,1,3 P2,2,2 P1,1,8 P1,2,7 P2,1,5 P2,2,4 P1,1,2 P1,2,8 P2,1,6 P2,2,5 P2,1,8 P2,2,7 P2,1,9 P2,2,8 P1,2,3 In most image processing applications, significant computation is required, even for the simple morphological operations previously mentioned. For example, let us consider the overhead for dilation calculation on a 256x256 8-bit grey-scale image with a 3x3 SE. Minimally, 256x256x9 8-bit comparisons must be performed. Simplifying the intensive computation is the first concern for improving the system performance. Considering the morphological operations presented in section 3, we can see the morphological operations perform local operations on the image, with each step during the operations only operating on that part of the image pixels that are covered by the SE. This implies parallel algorithm can be used to further improve the system performance. 4.1 P1,2,6 P2,2,3 P1,2,9 P2,2,6 P2,1,7 P2,2,9 (e) Step (1, 1), step (1, 2), step (2, 1), step (2, 2) Figure 4. Related pixels in each step of morphological operations. As mentioned above, there are always also 4 pixels that are always overlapped among 4 steps comparison of the four adjacent steps in the same rows and adjacent rows. If we use only one input operand for the 32-bit customized comparison instruction, which HARDWARE IMPLEMENTATION Architecture of distributed system To determine a reasonable and efficient parallel distributed architecture for a problem, there are many cost metrics that can be used to investigate cost-performance tradeoffs in the network [6]. Here a message-passing parallel architecture with four NIOS processors based connected via 10M Ethernet network cards is applied [7], as shown on Figure 5. 4.2 Parallel morphological operation Assuming we need to perform a morphological operation on a PxQ matrix. For non-parallel architecture with one processor used for computation, we have a host which holds the data matrix and a processor which performs the morphological operation. The time for the operation will consist of communication time for the data transmission between host and processor, and the operation time of the processor on the data. For a parallel architecture, we can split the data matrix into N (here N = 4) independent parts. Compared with the normal nonparallel operation, the theoretical speedup of the parallel architecture can be N times that of the non-parallel architecture. problem, we need to apply the asymmetrical splitting on the image as shown in Figure 6 (b). 5. System Architecture of NIOS Implementation Based upon NIOS development boards with APEX EP20K200EFC484-2x programmable logic device, we re-configure and compile the built-in customized 32bit soft core for the NIOS processor with custom instructions implementing the ORD-4C algorithm for the comparison. We compared dilation, erosion, open, close and edge-detection morphological operations of 256x256 8-bit grey scale “lena.png” on different systems. Figure 7 shows the morphological image operations of the 256x256 8-bit lena image, Figure 8 shows the system performance. According to our results, compared with C/C++ approach based on a non-distributed NIOS soft core processor system, the custom instruction approach based on the non-distributed system shows speedup of approximately 2.99 on basic dilation and erosion operations, and speedup of 1.7 on edge detection operation. The same results are obtained from a distributed NIOS soft core processor system. Compared with non-distributed C/C++ approach, the customized distributed system gains a speedup around 11.8. Figure 5. Architecture of the distributed system. Columns of Part 1 Part 2 Columns of Part 2 (a) Symmetrical splitting Part 3 P2 Part 4 (a) lena.png (b) Dilation (c) Erosion (d) Opening (e) Closing (e) edge detection Rows of part 4 Part 4 Rows of part 2 Part 2 Part 1 P1 Rows of part 3 Part 3 P2 Rows of part 1 Part 1 P1 Columns of Part 3 Columns of Part 4 (b) Asymmetrical splitting Figure 7. Morphological operations on lena. Figure 6. Inner boundary of the splitting of the image. 7.0 E+0 7 For the morphological operation implementation on the distributed NIOS processor system, if we split the image symmetrically to four parts, we will suffer inner boundary problem. As shown in Figure 6 (a), when we perform an operation on pixel P1 in partial image part 1, three pixels of its original 8-neighbor pixels in the bottom of the range of computation are cut off. In this case, we will get an incorrect operation result for pixel P1. For pixel P2 in partial image part 1, the right-most three pixels and bottom-most three pixels of its original neighbor pixels are out of the range of computation. To avoid this 6 .0 E+0 7 5.0 E+0 7 4 .0 E+0 7 Dilat io n 3 .0 E+0 7 Ed g e Det ect io n 2 .0 E+0 7 1.0 E+0 7 0 .0 E+0 0 C/ C++ ap p ro ach (no nd is t rib uted ) Cus t o m Ins t ruct io n (no n-d is t rib uted ) C/ C++ ap p ro ach (d is t rib uted ) Cus t o m Ins t ruct io n (d is t rib uted ) Figure 8. System performance of different implementations on lena. 6. REFERENCE [1] White Paper of Excalibur Backgrounder, version 1, http://www.altera.com, June 2000 NIOS Embedded Processor Development Board, version 2.1, http://www.altera.com, April, 2002 Custom Instructions for the Nios Embedded processor, version 1.1, http://www.altera.com, April, 2002 C. Gonzalez and Richard E. Woods, Digital Image Processing, third edition, AddisonWesley, 1993 Harley R. Myler and Arthur R. Weeks, Computer Imaging Recipes in C, Prentice-hall, 1993 Vipin Kumar, Ananth Grama, Anshul Gupta and George Karypis, Introduction to Parallel [7] [2] [3] [4] [5] [6] Computing-Design and Analysis of Algorithms, Benjamin/Cummings, 1994 Nios Ethernet Development Kit User Guide, version 2.1, http://www.altera.com, April, 2002