MORPHOLOGICAL IMAGE PROCESSING USING CUSTOM

advertisement
MORPHOLOGICAL IMAGE PROCESSING USING CUSTOM INSTRUCTIONS ON
DISTRIBUTED NIOS PROCESSORS
Haichen Ren, David J. Jackson
Electrical and Computer Engineering
The University of Alabama
Tuscaloosa, AL 35487-0286 USA
Abstract
As fundamental image processing block,
morphological processing involves intensive computation
and contribute significantly to the system overhead. With
depending on only spatially local data, several
morphological operations could be implemented with
parallel hardware to reduce the computation overhead. In
this paper, we implemented morphological image
operations, which include dilation, erosion, and edge
detection based on a 3x3 mask, on a distributed Altera
NIOS® soft core system. We also implement custom
instructions to improve the system performance.
Compared with non-distributed system without custom
instruction, the speedup of several morphological
operations based upon the distributed system with custom
instructions can reach to 11.8. The system architecture
and implementation details are presented.
Keywords: image processing, morphological operation,
embedded processor, programmable logic, soft core.
1.
INTRODUCTION
Recently, the image processing community has
become aware of the potential for massive parallelism and
high computational density in hardware. Among all
available hardware, embedded programmable processors
give the system designer unprecedented freedom in
determining which functions should be executed in
software and which would benefit the most from
dedicated hardware implementation in the form of custom
peripherals or coprocessor elements. This flexibility
allows a designer to not only rapidly prototype new
designs and easily integrate different digital components
into one design, fully realize, in hardware, several
iterations of a system in a shorter amount of time, but also
to explore different options for partitioning to deliver the
best possible combination in that product while still
meeting the design's functionality and performance
requirements [1].
Based upon the Altera Excalibur embedded
processor solution, we implemented distributed hardware
implementation of several morphological operations,
evaluated the custom instruction design of the Altera
NIOS soft core processor upon non-distributed and
distributed system.
This paper is organized as follows. Section 2
presents the Altera ExcaliburTM embedded processor
solution. Section 3 reviews several morphological image
operations, details an algorithm ORD-4C for
morphological image processing. In section 4 a
distributed NIOS hardware implementation with custom
instructions using ORD-4C algorithm is detailed. Section
5 presents results and performance.
2.
NIOS SOFT CORE PROCESSOR
One of the more popular solutions of
combinations of embedded processors and programmable
logic is the use of soft core microprocessors. Soft core
processors are a recent digital design method that
combines the advantages of programmable logic devices
with those of conventional hard core processors. Soft core
processors function like hard core processors but are
implemented on programmable logic devices (PLD), such
as Field Programmable Gate Arrays (FPGA).
Scalability and flexibility are the two main
advantages of soft core processors derived from
implementation on PLDs. Soft cores are flexible in that
custom defined logic can be easily integrated to the
processor with minimal interfacing requirements. Some
soft core processors even allow their internal architecture
to be changed to suit a particular design. This gives the
designer more flexibility when interfacing the soft core
processor to the rest of the embedded system. The
scalability of soft core processors allows more than one
processor to be implemented in a particular design, but
this is limited by the capacity and resources of the PLD
used. The scaleable and flexible factors make the soft core
processor suitable for use in a variety of applications such
as communications or digital signal processing.
2.1
NIOS soft core processor
Altera ExcaliburTM solutions include both soft
core and hard core embedded processors. As part of
Altera Excalibur embedded processor solutions and based
on Altera APEX 20K200EFC484-2x, the NIOS soft core
embedded processor is a configurable, general-purpose
RISC microprocessor with a 16-bit instruction set, userselectable 32- or 16-bit datapath, and configurable register
file and barrel shifter size. It can provide up to 50 MIPS
performance while being optimized for area in a PLD and
can easily fit into an Altera APEX device, leaving most of
the logic available for peripherals and custom logic
functions. Figure 1 shows the structure of the NIOS
embedded processor [2].
PBM
SRAM
CPU
IRQ
FLASH
Timer
Serial
Port
UART
NIOS
Area AvailableFor
Customization
APEX Device
Figure 1. The NIOS embedded programmable processor [2].
2.2
Custom Instructions
Altera NIOS processor is one of a few types of
soft core processors that allow custom instructions. By
designing special custom instructions for the NIOS soft
core, system designers can add up to five custom-defined
functionalities to the NIOS processor’s arithmetic logic
unit (ALU) and instruction set, as shown in Figure 2.
Custom instructions consist of custom logic block and
software macro. Custom logic block is the hardware that
performs the operation. The NIOS processor can include
up to five user-defined custom logic blocks. The blocks
become part of the NIOS microprocessor’s ALU.
Software macro is the user-interface that allows the
system designer to access the custom logic through
software code.
Morphological operations can be separated into
two categories: 1) binary morphological filtering of
binary images, and 2) grey-level morphological filtering
of grey-level images. Four basic types of binary
morphological filtering operations are available: erosion,
dilation, opening, and closing. Each of these filters uses a
mask or structuring element to determine the geometrical
filtering process. The erosion filtering operation reduces
the geometrical size of an object, while the dilation
filtering operation enlarges an object’s geometrical size.
However, they are generally not reversible operations. An
opening filter is simply an erosion filter followed by a
dilation filter, and a closing filter is a dilation filter
followed by an erosion filter [4][5]. For more complex
morphological operations, most of them are kinds of
combinations of these four operations. Such as edge
detection can be got by the calculating difference between
the dilated image and the original image or the original
image and the eroded image.
The morphological operations procedure of an
input image is a kind of convolution between the input
image and structuring element (SE). for example, with a
3x3 all 1’s SE, a block of nine pixels is covered by the SE
in each step, and the maximum value or minimum value
among the nine pixels is picked for the dilation or erosion
operation output of the central pixel. After the current
operation is complete, the SE will move to the right by
one pixel, or move down to the beginning of the next row
if the end of the current row is hitted.
3.2
ORD-4C algorithm
If we trace the movement of SE, we can find that
some pixels are overlapped between each step or each
row. For example, in Figure 4 with 3x3 all 1’s SE, nine
double-lined pixels are covered for movement step.
Among adjacent movement, 6 bold-lined pixels are
always overlapped. And 4 pixels are overlapped among
the 4 movement steps shown in the figure 4.
Figure 2. Custom instruction logic block and interface of 32-bit NIOS
processor [3].
3.
MORPHOLOGIOCAL OPERATIONS
Morphology offers a unified and powerful
approach to numerous image processing problems [4].
The goal of morphological operations is to smooth the
contours of the objects and to decompose an image into
its fundamental geometrical shapes.
3.1
Fundamental morphological image processing
Obviously, the comparison between nine pixels
is the most common computation overhead of the
algorithm. If we implement the nine-pixel comparison
operation in hardware, we could significant reduce the
system overhead compared with high-level language
implementation. Considering the custom instruction
feature of NIOS soft core processor, it allows at most two
input operands and one output, plus a 11-bit control bit
providing more choices of the operation on input
operands. We can customize a custom instruction base
upon 32-bit processor. In this way, we could have up to
eight 8-bit pixel values embedded in instruction operands.
Based on the observation of the morphological
operations on an 8-bit grey-scale image with 3x3 all 1’s
SE mentioned above, six pixels are overlapped between
each comparison step. We can embed six 8-bit pixel
values into two 32-bit input operands. The comparison
between nine pixels can be done in two levels: first
compare left side six pixels and right side six pixels
respectively, and then compare the results of these two 6pixel comparisons to get the final morphological
operation result for the current central pixel. But this
design has 2 drawbacks: first, the middle results of
different rows can not be re-used among 2 adjacent rows.
Second and more inefficient is the comparison for the
final result of current central pixel is generated by doing
comparison of two 8-bit values, which are the middle
results of left side six pixels L-6 and right side six pixels
R-6. It only uses two 8-bit of two 32-bit input operands.
P1,1,1
P1,1,2
P1,1,3
P1,2,1
P1,2,2
P1,2,3
P1,1,4
P1,1,5
P1,1,6
P1,2,4
P1,2,5
P1,2,6
P1,1,7
P1,1,8
P1,1,9
P1,2,7
P1,2,8
P1,2,9
embeds four 8-bit pixels, for each nine pixels comparison,
two levels with five 4-pixel comparisons are enough. We
need to create two temporal buffers. Every time for each
row, when a new pixel value is inputted, we first perform
first level comparison: combining the new pixel with one
pixel in its left in the same row and two pixels into the
input operand of the custom instruction to do the
comparison, saving the result in corresponding buffer.
Meanwhile, with this new comparison result, we can
perform second level comparison by combining 4 values
in the two temporal buffers to do the comparison. The
result is actually the morphological operation result of the
pixel in the last row. So, as we can see, with two
comparisons, we can generate one result for the last row.
And the 32 bits of input operand are fully used and same
for the all the middle level results. It’s much more
efficient the six pixels’ comparison custom instruciton.
We call is as One Row Delay 4-Pixel Comparison
algorithm (ORD-4C).
4.
(a) Pixels for step (1, 1)
(b) Pixels for step (1, 2)
P2,1,1
P2,1,2
P2,1,3
P2,2,1
P2,2,2
P2,2,3
P2,1,4
P2,1,5
P2,1,6
P2,2,4
P2,2,5
P2,2,6
P2,1,7
P2,1,8
P2,1,9
P2,2,7
(c) Pixels for step (2, 1)
P1,1,1
P1,1,2
P2,1,1
P1,1,2
P2,1,4
P2,2,8
P2,2,9
(d) Pixels for step (2, 2)
P1,1,2
P1,2,1
P1,1,3
P1,2,2
P1,1,5
P1,2,4
P2,1,2
P2,2,1
P1,1,2
P1,2,5
P2,1,3
P2,2,2
P1,1,8
P1,2,7
P2,1,5
P2,2,4
P1,1,2
P1,2,8
P2,1,6
P2,2,5
P2,1,8
P2,2,7
P2,1,9
P2,2,8
P1,2,3
In most image processing applications,
significant computation is required, even for the simple
morphological operations previously mentioned. For
example, let us consider the overhead for dilation
calculation on a 256x256 8-bit grey-scale image with a
3x3 SE. Minimally, 256x256x9 8-bit comparisons must
be performed.
Simplifying the intensive computation is the first
concern for improving the system performance.
Considering the morphological operations presented in
section 3, we can see the morphological operations
perform local operations on the image, with each step
during the operations only operating on that part of the
image pixels that are covered by the SE. This implies
parallel algorithm can be used to further improve the
system performance.
4.1
P1,2,6
P2,2,3
P1,2,9
P2,2,6
P2,1,7
P2,2,9
(e) Step (1, 1), step (1, 2), step (2, 1), step (2, 2)
Figure 4. Related pixels in each step of morphological operations.
As mentioned above, there are always also 4
pixels that are always overlapped among 4 steps
comparison of the four adjacent steps in the same rows
and adjacent rows. If we use only one input operand for
the 32-bit customized comparison instruction, which
HARDWARE IMPLEMENTATION
Architecture of distributed system
To determine a reasonable and efficient parallel
distributed architecture for a problem, there are many cost
metrics that can be used to investigate cost-performance
tradeoffs in the network [6]. Here a message-passing
parallel architecture with four NIOS processors based
connected via 10M Ethernet network cards is applied [7],
as shown on Figure 5.
4.2
Parallel morphological operation
Assuming we need to perform a morphological
operation on a PxQ matrix. For non-parallel architecture
with one processor used for computation, we have a host
which holds the data matrix and a processor which
performs the morphological operation. The time for the
operation will consist of communication time for the data
transmission between host and processor, and the
operation time of the processor on the data. For a parallel
architecture, we can split the data matrix into N (here N =
4) independent parts. Compared with the normal nonparallel operation, the theoretical speedup of the parallel
architecture can be N times that of the non-parallel
architecture.
problem, we need to apply the asymmetrical splitting on
the image as shown in Figure 6 (b).
5.
System Architecture of NIOS Implementation
Based upon NIOS development boards with
APEX EP20K200EFC484-2x programmable logic device,
we re-configure and compile the built-in customized 32bit soft core for the NIOS processor with custom
instructions implementing the ORD-4C algorithm for the
comparison.
We compared dilation, erosion, open, close and
edge-detection morphological operations of 256x256 8-bit
grey scale “lena.png” on different systems. Figure 7
shows the morphological image operations of the
256x256 8-bit lena image, Figure 8 shows the system
performance. According to our results, compared with
C/C++ approach based on a non-distributed NIOS soft
core processor system, the custom instruction approach
based on the non-distributed system shows speedup of
approximately 2.99 on basic dilation and erosion
operations, and speedup of 1.7 on edge detection
operation. The same results are obtained from a
distributed NIOS soft core processor system. Compared
with non-distributed C/C++ approach, the customized
distributed system gains a speedup around 11.8.
Figure 5. Architecture of the distributed system.
Columns of Part 1
Part 2
Columns of Part 2
(a) Symmetrical splitting
Part 3
P2
Part 4
(a) lena.png
(b) Dilation
(c) Erosion
(d) Opening
(e) Closing
(e) edge detection
Rows of part 4
Part 4
Rows of part 2
Part 2
Part 1
P1
Rows of part 3
Part 3
P2
Rows of part 1
Part 1
P1
Columns of Part 3
Columns of Part 4
(b) Asymmetrical splitting
Figure 7. Morphological operations on lena.
Figure 6. Inner boundary of the splitting of the image.
7.0 E+0 7
For the morphological operation implementation
on the distributed NIOS processor system, if we split the
image symmetrically to four parts, we will suffer inner
boundary problem. As shown in Figure 6 (a), when we
perform an operation on pixel P1 in partial image part 1,
three pixels of its original 8-neighbor pixels in the bottom
of the range of computation are cut off. In this case, we
will get an incorrect operation result for pixel P1. For
pixel P2 in partial image part 1, the right-most three pixels
and bottom-most three pixels of its original neighbor
pixels are out of the range of computation. To avoid this
6 .0 E+0 7
5.0 E+0 7
4 .0 E+0 7
Dilat io n
3 .0 E+0 7
Ed g e Det ect io n
2 .0 E+0 7
1.0 E+0 7
0 .0 E+0 0
C/ C++ ap p ro ach (no nd is t rib uted )
Cus t o m Ins t ruct io n
(no n-d is t rib uted )
C/ C++ ap p ro ach
(d is t rib uted )
Cus t o m Ins t ruct io n
(d is t rib uted )
Figure 8. System performance of different implementations on lena.
6.
REFERENCE
[1]
White Paper of Excalibur Backgrounder, version
1, http://www.altera.com, June 2000
NIOS Embedded Processor Development Board,
version 2.1, http://www.altera.com, April, 2002
Custom Instructions for the Nios Embedded
processor, version 1.1, http://www.altera.com,
April, 2002
C. Gonzalez and Richard E. Woods, Digital
Image Processing, third edition, AddisonWesley, 1993
Harley R. Myler and Arthur R. Weeks,
Computer Imaging Recipes in C, Prentice-hall,
1993
Vipin Kumar, Ananth Grama, Anshul Gupta and
George Karypis, Introduction to Parallel
[7]
[2]
[3]
[4]
[5]
[6]
Computing-Design and Analysis of Algorithms,
Benjamin/Cummings, 1994
Nios Ethernet Development Kit User Guide,
version 2.1, http://www.altera.com, April, 2002
Download