ABSTRACT OF THESIS A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR

advertisement
ABSTRACT OF THESIS
A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR
DISCRETE CONVOLUTION
Two-dimensional discrete convolution is an essential operation in digital image
processing. An ability to simultaneously convolute an (i×j) pixel input image plane with
more than one Filter Coefficient Plane (FC) in a scalable manner is a targeted
performance goal. Assuming k FCs, each of size (n×n), an additional goal is that the
system have the ability to output k convoluted pixels each clock cycle. To achieve these
performance goals, an architecture that utilizes a new systolic array arrangement is
developed and the final architecture design is captured using the VHDL hardware
descriptive language. The architecture is shown to be scalable when convoluting multiple
FCs with the same input image plane. The architecture design is functionally and
performance validated through VHDL post-synthesis and post-implementation (functional
and performance) simulation testing. In addition, the design was implemented to a Field
Programmable Gate Array (FPGA) experimental hardware prototype for further
functional and performance testing and evaluation.
KEYWORDS: Systolic Array Processor, Discrete Convolution, Hardware Prototyping,
Scalable Architecture, Parallel Architecture.
________________________
________________________
A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR
DISCRETE CONVOLUTION
By
Albert Tung-Hoe Wong
____________________________
Director of Thesis
____________________________
Director of Graduate Studies
____________________________
RULES FOR THE USE OF THESIS
Unpublished theses submitted for the Master’s degree and deposited in the University of
Kentucky Library are as a rule open for inspection, but are to be used only with due
regard to the rights of the authors. Bibliographical references may be noted, but
quotations or summaries of parts may be published only with permission of the author,
and with the usual scholarly acknowledgements.
Extensive copying or publication of the thesis in whole or in part also requires the
consent of the Dean of the Graduate School of the University of Kentucky.
A library that borrows this thesis for use by its patrons is expected to secure the signature
of each user.
Name
Date
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
________________________________________________________________________
THESIS
Albert Tung-Hoe Wong
The Graduate School
University of Kentucky
2003
A NEW SCALABLE SYSTOLIC ARRAY PROCESSOR ARCHITECTURE FOR
DISCRETE CONVOLUTION
THESIS
A thesis submitted in partial fulfillment of the
requirements for the degree of Master of Science in Electrical
Engineering in the College of Engineering
at the University of Kentucky
By
Albert Tung-Hoe Wong
Lexington, Kentucky
Director: Dr. J. Robert Heath, Associate Professor of Electrical and Computer
Engineering
Lexington, Kentucky
2003
MASTER’S THESIS RELEASE
I authorize the University of Kentucky
Libraries to reproduce this thesis in
whole or in part for purposes of research.
Signed: _____________________________________
Date: _______________________________________
ACKNOWLEDGEMENTS
The following thesis, while an individual work, benefited from the insights and
direction of several people. First, my Thesis Chair, Dr. J. Robert Heath, exemplifies the
high quality scholarship to which I aspire. In addition, Dr. J. Robert Heath provided
timely and instructive comments and evaluation at every stage of the thesis process,
allowing me to complete this project. Next, I wish to thank the complete Thesis
Committee: Dr. J. Robert Heath, Dr. Hank Dietz, and Dr. William R. Dieter. Each
individual provided insights that guided and challenged my thinking, substantially
improving the finished product. I would also like to thank Dr. Michael Lhamon from
Lexmark Inc. for his technical insights, guidance and comments.
In addition to the technical and instrumental assistance above, I received equally
important assistance from family and friends. My wife, Sze Ying Ng, provided on-going
support through out the thesis process for me to finish the thesis.
iii
CONTENTS
Acknowledgements............................................................................................................ iii
List of Tables ..................................................................................................................... vi
List of Figures ................................................................................................................... vii
Chapter 1 Introduction .........................................................................................................1
Chapter 2 Background and Convolution Architecture Requirements .................................2
Chapter 3 Version 1 Convolution Architecture ...................................................................7
3.1. Arithmetic Unit (AU) .......................................................................................... 7
3.2. Coefficient Shifters (CSs) ................................................................................. 10
3.3. Input Data Shifters (IDSs)................................................................................. 11
3.3.1. Register Bank (RB) ............................................................................... 12
3.3.2. Pattern Generator Pointers (PGPs) ....................................................... 12
3.3.3. Delay Units (DU).................................................................................. 15
3.4. Systolic Flow of Version 1 Convolution Architecture. .................................... 15
3.5. Data Memory Interface (DM I/F) ..................................................................... 17
3.6. Output Correction Unit ..................................................................................... 19
3.7. Controller .......................................................................................................... 19
Chapter 4 Revised Architectural Requirements and Resulting Version 2 Convolution
Architecture...............................................................................................................21
4.1. Version 2 Convolution Architecture for (k = 1) ............................................... 21
4.2. Arithmetic Unit (AU) ........................................................................................ 22
4.2.1. Multiplication Unit (MU) of Multiplication and Add Unit (MAU) ...... 26
4.2.2. Delay Units (DU).................................................................................. 31
4.3. Input Data Shifters (IDS) .................................................................................. 31
4.4. Data Memory Interface (DM I/F) ..................................................................... 32
4.5. Memory Pointers Unit (MPU) .......................................................................... 32
4.6. Systolic Flow of Version 2 Convolution Architecture ..................................... 34
4.7. Controller .......................................................................................................... 35
4.8. Multiple Filter Coefficient Sets when (k > 1) ................................................... 43
Chapter 5 VHDL Description of Version 2 Convolution Architecture..............................45
iv
Chapter 6 Version 2 Convolution Architecture Validation via Virtual Prototyping
(Post-Synthesis and Post-Implementation Simulation Experimentation).................47
6.1. Post-Synthesis Simulation ................................................................................ 48
6.1.1. Adders ................................................................................................... 48
6.1.2. Multiplication Unit................................................................................ 51
6.1.3. Version 2 Convolution Architecture (with k = 1) ................................. 52
6.2. Post-Implementation Simulation ...................................................................... 61
6.2.1. Synthesis and Implementation of Version 2 Convolution Architecture
(with k = 1)...................................................................................................... 61
6.2.2. Version 2 Convolution Architecture (with k = 1) ................................. 62
6.2.3. Synthesis and Implementation of Version 2 Convolution Architecture (k
= 3) .................................................................................................................. 65
6.2.4. Validation of Version 2 Convolution Architecture (with k = 3)........... 66
Chapter 7 Hardware Prototype Development and Testing ................................................72
7.1. Board Utilization Modules and Prototype Setup .............................................. 73
7.2. Hardware Prototyping Flow.............................................................................. 76
7.3. Test Cases ......................................................................................................... 80
Chapter 8 Conclusion.........................................................................................................84
Appendix A VHDL Code for Version 2 Discrete Convolution Architecture ....................86
Appendix B VHDL codes, C++ source codes and Script file for Post-Synthesis
Simulation ...............................................................................................................133
Appendix C C++ Source Codes for Programs Used During Post-Implementation
Simulation ...............................................................................................................140
Appendix D C++ Source Codes for Programs Used During Hardware Prototype
Implementation .......................................................................................................143
Appendix E VHDL Files for Modules External to the Convolution Architecture...........149
References........................................................................................................................157
Vita .................................................................................................................................159
v
LIST OF TABLES
Table 3.1. Filter coefficient array. .....................................................................................11
Table 3.2. 5×5 Filter size (with one output pointer). .........................................................13
Table 3.3. 5×5 Filter size (Convolution with two output pointers). ..................................14
Table 4.1. Gate count comparison between CSA and CLA................................................25
Table 4.2. A summary of the multiplication. .....................................................................26
Table 4.3. Partial Product Selection Table.........................................................................28
Table 4.4. Comparison between method I and method II..................................................30
Table 6.1. Details of FPGA on the XESS protoboard. .......................................................62
Table 6.2. Resource utilization of Version 2 Convolution Architecture (with k = 1)........62
Table 6.3. Resource utilization of Version 2 Convolution Architecture (with k = 3)........66
vi
LIST OF FIGURES
Figure 2.1. Pictorial view of Input Image Plane (IP), Filter Coefficient Plane (FC), and
Output Image Plane (OI).............................................................................................3
Figure 2.2. Example showing how two consecutive output pixels are generated. This
example is shown with a 3×3 size FC. .......................................................................4
Figure 2.3. Example showing that only (n-1) previous rows plus n input image pixels
need to be stored. In this example, 2 previous rows (shaded rows in addition to
IP23,..,IP25, IP33,..,IP35) plus 3 additional input image pixels (IP43,..,IP45) are
needed for a 3×3 filter size..........................................................................................4
Figure 3.1. Top-level view of Version 1 of the convolution architecture (d is assumed
to be 8 in this example)...............................................................................................8
Figure 3.2. A MAU and included functional units. ..............................................................8
Figure 3.3. Systolic array structures of MAUs, where IDSs are outputs from Input Data
Shifters and CSs are the outputs from Coefficient Shifters. .......................................9
Figure 3.4. Functional units within CSs.............................................................................10
Figure 3.5. Arrangement of the filter coefficients within the Coefficient Shifters............11
Figure 3.6. Functional units within IDSs. ..........................................................................11
Figure 3.7. Generalized RB for n×n filter size. (d denotes number of bits for the input
pixels)........................................................................................................................12
Figure 3.8. Additional hardware and modification for convolution of x output pixels in
parallel for (x ≤ n) (Functional units shaded in gray are additional hardware
required for processing two convolutions in parallel). .............................................14
Figure 3.9. Organization of Flip-flops within the Delay Unit (DU). R within the figure
denotes one flip-flop. ................................................................................................15
Figure 3.10. Pictorial view of the data flow within the MAUs for one output pixel..........16
Figure 3.11. Basic functional units within Data Memory I/F............................................17
Figure 3.12. A more detailed look at DM I/F. ...................................................................18
Figure 3.13. Time line for activities, where W denotes Write or R denotes Read (from
external memory device) registers indicated in boxes directly below......................19
Figure 3.14. Output pattern for two convolutions in parallel. ...........................................19
vii
Figure 4.1. A top level view of Version 2 of the convolution architecture for one
distinct filter coefficient set with n = 5 and d = 8 (MAA denotes Multiplication
and Add Array and AT denotes Adder Tree). ...........................................................22
Figure 4.2. Functional units within the Multiplication and Add Array (MAA). ................23
Figure 4.3. A MAU and its functional units. ......................................................................23
Figure 4.4. A possible arrangement of the AT (R denotes a single flip-flop pipeline
stage; a pipeline stage is included within each CSA and CLA) for a 5×5 FC. ..........24
Figure 4.5. One possible arrangement of the AT when CLA is utilized within the
MAUs. .......................................................................................................................25
Figure 4.6. Illustration of the paper and pencil multiplication technique (s on each row
of the partial products denotes sign extension of that particular row of partial
product). ....................................................................................................................27
Figure 4.7. One possible arrangement of Multilevel CSA Tree for six partial products....28
Figure 4.8. Multiplier based on Modified Booth’s Algorithm and Wallace Tree
Structure....................................................................................................................29
Figure 4.9. Illustration of multiplication technique based on Modified Booth’s
Algorithm..................................................................................................................29
Figure 4.10. Partial Product’s sign extension reduced for hardware saving......................30
Figure 4.11. Functional units within the DU for the case of n = 5 and two pipeline
stages within each MAU (PL denotes a pipeline stage composed of flip-flop
registers)....................................................................................................................31
Figure 4.12. Structural view of the IDS with n = 5 and d = 8............................................32
Figure 4.13. External memory devices organization for n = 5 and d = 8. .........................33
Figure 4.14. Functional units within the Memory Pointers Unit (MPU)...........................34
Figure 4.15. Pictorial view of the data flow within the MAAs for one output pixel (td
denotes the time delay between each MAU). ............................................................35
Figure 4.16. Top level view of the Controller Unit (CU). .................................................36
Figure 4.17. Functional Units that receive control signals from the CU. ..........................37
Figure 4.18. System flow chart for Version 2 convolution architecture’s Controller
Unit (CU). .................................................................................................................39
Figure 4.19. Modified Version 2 system flow chart. .........................................................41
viii
Figure 4.20. Version 2 architecture for k (n×n) filter coefficient sets (where k can be
any number). .............................................................................................................43
Figure 5.1. Version 2 Convolution Architecture organization. .........................................46
Figure 6.1. Testing model for lower level functional components. ...................................49
Figure 6.2. Post-Synthesis simulation for 14-bit CLA. ......................................................49
Figure 6.3. A close up view of one segment of Figure 6.2. ...............................................50
Figure 6.4. Post-Synthesis simulation for 15-bit CLA. ......................................................50
Figure 6.5. Post-Synthesis simulation for 16-bit CLA. ......................................................50
Figure 6.6. Post-Synthesis simulation for 17-bit CLA. ......................................................50
Figure 6.7. Post-Synthesis simulation for 19-bit CLA. ......................................................51
Figure 6.8. Post-Synthesis simulation for all possible inputs for the Multiplication Unit
(MU)..........................................................................................................................51
Figure 6.9. A close up view of one segment of the simulation in Figure 6.8 above..........52
Figure 6.10. Test case 1 with IP and OI of size 5×60 (however, only the first seven
columns of both IP and OI are shown due to report page width limit). ...................53
Figure 6.11. The source code for C++ program that generates test vectors to program
the filter coefficients into MAUs...............................................................................54
Figure 6.12. Arrangement of the Filter Coefficients within the Arithmetic Unit. .............55
Figure 6.13. First phase of operation; programming of FCs into MAUs...........................56
Figure 6.14. First phase of operation; receiving the first two rows of the IP (shown in
figure above is the beginning of the second row of the input pixels). ......................56
Figure 6.15. Second phase of operation; output pixels generated. ....................................57
Figure 6.16. Second phase of operation; output pixels of the second row of OI
(superimposed)..........................................................................................................58
Figure 6.17. Third phase of operation; output pixels of the last row of OI
(superimposed)..........................................................................................................58
Figure 6.18. Test case 2; IP, FCs and expected OI (the first seven columns)...................59
Figure 6.19. First phase of operation for test case 2. .........................................................60
Figure 6.20. Second phase of operation for test case 2; output pixels shown are the
first six of row one of OI (superimposed).................................................................60
ix
Figure 6.21. Third phase of operation for test case 2; output pixels shown are the first
six of the last row for OI (superimposed). ................................................................61
Figure 6.22. Second phase of operation for test case 1 (post-implementation
simulation); output pixels of the second row of OI (superimposed). .......................63
Figure 6.23. Third phase of operation for test case 1 (post-implementation simulation);
output pixels of the last row of OI (superimposed). .................................................64
Figure 6.24. Second phase of operation for test case 2 (post-implementation
simulation); output pixels shown are the first six of row one of OI
(superimposed)..........................................................................................................64
Figure 6.25. Third phase of operation for test case 2 (post-implementation simulation);
output pixels shown are the first six of the last row for OI (superimposed).............65
Figure 6.26. Test Case 1: FC planes, IP plane and the predicted OI planes......................67
Figure 6.27. Superimposed output image pixels (start from the 3rd pixel) for first row
of the OIs for test case 1. ..........................................................................................68
Figure 6.28. Superimposed output image pixels (from 3rd pixel onward) of the second
row of the OIs for test case 1. ...................................................................................68
Figure 6.29. Test case 2: FC planes, IP plane and the predicted OI planes. .....................69
Figure 6.30. Superimposed output image pixels (start from the 3rd pixel) for third row
of the OIs for test case 2. ..........................................................................................70
Figure 6.31. Superimposed output image pixels (from 3rd pixel onward) of the fourth
row of the OIs for test case 2. ...................................................................................70
Figure 6.32. A plot of equivalent system gates versus number of FC planes....................71
Figure 7.1. Convolution Architecture hardware implementation. .....................................72
Figure 7.2. XSV-800 prototype board featuring Xilinx Virtex 800 FPGA (picture
obtained from XESS Co. website, http://www.xess.com).........................................73
Figure 7.3. Top level view of the prototyping hardware. ..................................................74
Figure 7.4. Example of a VHDL file for creating an internal Block RAM containing
input image pixels for the convolution system (seed number of 1 is provided to
the program)..............................................................................................................75
Figure 7.5. FPGA configuration and bit stream download program, gxsload from
XESS Co. ...................................................................................................................77
x
Figure 7.6. Execution of the FCs configuration program..................................................78
Figure 7.7. Upload SRAM content using gxsload utility, the high address indicates the
upper bound of the SRAM address space whereas the low address indicates the
lower bound of the SRAM address space. .................................................................79
Figure 7.8. Uploaded SRAM contents stored in a file (Intel hex file format). There are
two segments due to the fact that the program wrote the right bank of the SRAM
(16-bit) first and the left bank of the SRAM next (16 MSB bits)..............................79
Figure 7.9. SRAM contents retrieved for first OI plane for test case 1. .............................81
Figure 7.10. SRAM contents retrieved for second OI plane for test case 1........................81
Figure 7.11. SRAM contents retrieved for third OI plane for test case 1. ..........................81
Figure 7.12. SRAM contents retrieved for first OI plane for test case 2. ...........................82
Figure 7.13. SRAM contents retrieved for second OI plane for test case 2........................82
Figure 7.14. SRAM contents retrieved for third OI plane for test case 2. ..........................83
xi
Chapter 1
Introduction
Performance and cost are both important parameters and criteria in today’s
computing system components, whether the components are an entire computer or
computer accessories and peripherals such as printers. The ever increasing desire for
higher performance from consumers has driven printer manufacturers to develop and
incorporate performance enhancements into their products at a cheapest price. It is a
known fact that cost is, most of the time, directly proportional to performance, but
manufacturers are constantly pursuing higher performance for less cost.
An ability to scan and print exceedingly clear images at a maximum page-perminute rate at the cheapest cost are performance metrics printer manufacturers aim for. In
order to produce highly enhanced clear images, the “discrete convolutional-filtering
algorithm” must be implemented within the scanner or printer. General-purpose signal
processors from various vendors are widely used to implement the convolutional-filtering
algorithm. Many times all the functionality offered by general-purpose signal processors
are not needed or required by the manufacturers, thus the unused functionality becomes a
cost overhead. Also, many times, commercially available general-purpose processors
cannot meet desired performance/cost requirements of having the highest performance at
lowest cost. Thus, a special-purpose architecture signal/image processor is desired to
implement the discrete convolutional filtering algorithm. The subject of this thesis is the
development of an efficient high performance special purpose signal/image processor
architecture which may be used to implement the discrete convolutional-filtering
algorithm at a lowest cost.
1
Chapter 2
Background and Convolution Architecture Requirements
Convolution is one of the essential operations in digital image processing required
for image enhancements [15,16]. It is used in linear filtering operations such as
smoothing, denoising, edge detection and so on [15,16]. In general, image processing is
carried out in a two dimensional space/array [16]. A digital image can be represented
with an array of numbers in a two dimensional space. Each number (or pixel) has an
associated row and column to indicate its coordination (position) in the two dimensional
space and the number’s value represents gray levels for that coordinate [15]. The gray
levels are usually represented with a byte or 8-bit unsigned binary number, ranging from
0 to 255 in decimal. Equation 1 shows the two dimensional discrete convolution
algorithm, where IP is the Input Image Plane, FC is the Filter Coefficient Plane, and OI is
the Output Image Plane [16].
n −1 n −1
OI [ x, y ] = FC[ x, y ] * IP[ x, y ] = ∑∑ FC[ I , J ]IP[ x − I + (
I =0 J =0
n −1
n −1
), y − J + (
)]
2
2
(1)
Figure 2.1 below shows the basic definitions for the Input Image Plane (IP), Filter
Coefficient Plane (FC), and Output Image Plane (OI). Assuming that the IP has a size of
i×j pixels and FC has a size of n×n pixels, then, OI would have a size of i×j pixels. In
most cases, n<i and n<j.
2
IP(0,0)
FC(0,0)
n
OI(0,0)
c
FC(n-1,n-1)
i
x
n
IP(x,y)
i
x OI(x,y)
IP(i-1,j-1)
OI(i-1,j-1)
j
j
Figure 2.1. Pictorial view of Input Image Plane (IP), Filter Coefficient Plane (FC),
and Output Image Plane (OI).
Digital convolution can be thought of as a moving window of operations [16]. As
shown in Equation 1, one output pixel of the OI[x,y] can be obtained by rotating the FC
180 degrees around the center point (denoted c in Figure 2.1 within FC) and place it over
the IP with the center point on top of IP[x,y]. All the overlapping IP pixels would be
multiplied by the corresponding filter coefficients on FC and then all the products are
summed to generate the one pixel OI[x,y]. The next output pixel can be obtained by
sliding the FC plane by one pixel to the right and then repeat all the processes mentioned
above. Figure 2.2 illustrates the idea of the moving window of operations. FC is first
centered at IP[3,4] to compute OI[3,4] and then moves to IP[3,5] for OI[3,5].
From Figure 2.2 below, one can deduce that when an output pixel is computed,
access to entire previous rows or portions of previously input rows of input pixels are
needed. Hence, previous input image pixels must be stored for this purpose. However,
not all of the previous rows of input image pixels are necessarily needed. Instead, only
(n-1) rows plus n input image pixels are required. This can be shown by example in
Figure 2.3 below. Another important observation that can be made from Figure 2.2 is that
for consecutive convolution, only n input image pixels are obsolete and require update.
This is an important observation that influences the design of the convolution
architecture. Figure 2.3 shows an example for a 3×3 filter size where the shaded areas of
3
the IP and the area under the FC plane are the input image pixels that need to be stored.
Hence, these pixels can be stored in a memory device whether it is on chip or off chip.
Filter Coefficients
FC centered at
IP[3,4]
•
•
•
•
•
•
•
•
IP33
•
•
•
•
•
FC02
FC10
FC11
FC12
FC20
FC21
FC22
FC centered at
IP[3,5]
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Current output, OI[3,4] = FC00IP45 + FC01IP44 + FC02IP43
+ FC10IP35 + FC11IP34 + FC12IP33
+ FC20IP25 + FC21IP24 + FC22IP23
•
•
IP26
•
IP36
IP35
•
IP44
•
•
IP25
•
•
IP45
•
•
IP34
•
•
•
IP24
IP35
IP44
•
FC01
IP25
IP34
IP43
•
•
IP24
IP23
•
•
FC00
•
IP46
IP45
•
•
next output, OI[3,5] = FC00IP46 + FC01IP45 + FC02IP44
+ FC10IP36 + FC11IP35 + FC12IP34
+ FC20IP26 + FC21IP25 + FC22IP24
Figure 2.2. Example showing how two consecutive output pixels are generated. This
example is shown with a 3×3 size FC.
Beginning of the
rows
End of the rows
•
•
•
•
•
•
•
•
•
•
IP34
•
IP35
•
IP43
•
IP25
•
IP33
•
•
IP24
IP23
•
•
•
IP44
IP45
Current input image pixel.
The input image pixels
coming in row by row.
Figure 2.3. Example showing that only (n-1) previous rows plus n input image pixels
need to be stored. In this example, 2 previous rows (shaded rows in addition to
IP23,..,IP25, IP33,..,IP35) plus 3 additional input image pixels (IP43,..,IP45) are needed
for a 3×3 filter size.
Convolution is a vital part of image processing and it can be done through both
software and hardware [1]. Much effort has been directed towards speeding up the
convolution process through hardware implementation [1,2,3,9,14]. This is because
convolution is a computation intensive algorithm as shown in Equation 1. For example,
with a 5×5 filter size, each output pixel requires 25 Multiplication and Addition
4
Operations (MAOPS). Thus, as the total number of pixels for an image increases,
MAOPS will increase substantially. Bosi and Bois in [1] propose the use of FPGAs
programmed with a 2D convolver as a coprocessor to an existing digital signal processor
(DSP) to speedup the convolution process. In [2,3,9,14], special purpose convolution
architectures are designed to meet real-time image processing requirements. Hsieh and
Kim in [2] proposed a highly pipelined VLSI convolution architecture. Parallel one
dimensional convolutions and a circular processing module were the approaches used in
the architecture for performance gain and the architecture required n×n processing
elements, each being a multiplier and adder. Both [3] and [14] propose convolution
architectures based on systolic arrays which operate on real time images with a size of
512×512 pixels. Also, both of the architectures performed bit-serial arithmetic. The
architecture of [14] requires on chip memory to store the necessary input pixels.
The focus of this thesis is the development of a high performance real time special
purpose convolution architecture which is desired for scanning and printing applications.
The requirement for the final “production” version (implemented with ASIC technology)
of this architecture is the capability to perform convolution with a 5×5 FC size on input
images of size of 8½”×11” at a rate of 60 Pages Per Minute (PPM) at 600dpi (dot per
inch). A total of 33.66M pixels are generated when a standard paper size of 8½”×11” is
scanned at a resolution of 600dpi. Add to that the requirement to process 60 scanned
PPM results in 1.69G MAOPS per second. The multiplication operands are each 8 bits in
width generating 16 bit products which must be summed. The method proposed in [1] is
not feasible from a cost standpoint and also some of the functionality of the DSP may not
be required. The architectures presented in [2] and [9] are special purpose architectures
for convolution, and both architectures require n×n processing elements which could
potentially occupy a large chip area. The architectures of [3] and [14] are both systolic
array architectures employing bit-serial arithmetic operations and hence may not be able
to meet performance requirements mentioned above. In [7] the authors point out the well
known fact that for most applications bit-parallel arithmetic has a performance edge over
bit-serial arithmetic. However, the processing architecture of [7] is based on bit-serial
arithmetic since it is sufficient for their requirements and has less gate count.
5
Two hardware architectures, version 1 and version 2, are proposed for the
implementation of the two dimensional discrete convolution algorithm shown as
Equation 1. Version 1 of the architecture proposed in the next section, is based on a linear
systolic array structure. Version 2 of the architecture, based on an extension of version 1,
will be shown in a later section. Version 2 of the architecture was developed to meet
some functional and performance requirements different from those of Version 1 of the
architecture.
Version 1 is a special purpose convolution architecture. Unlike [2] and [9], the
architecture will not use n×n processing elements and it will be scalable in order to meet
variable performance requirements. A scalable architecture when implemented into
programmable Field Programmable Gate Array (FPGA) technology allows users to
implement the architecture to meet their specific performance needs. Parallel arithmetic
operations will be utilized in the version 1 architecture for performance gain.
6
Chapter 3
Version 1 Convolution Architecture
Specially designed hardware to implement convolution can offer a performance
gain over a general-purpose Digital Signal Processor (DSP). Since the convolution
algorithm requires a large number of multiplications and additions, a multiprocessor
architecture is desired. Multiprocessor architectures offer the benefit of processing
multiple operations at a given instance.
A specific type of multiprocessor architecture will be utilized for implementation
of convolution in hardware. It is referred to as a systolic array structure [6]. The
advantages of this type of structure include its modularity and regularity in structure and
the ease of pipelining. In other words, the systolic structure has the ability to fully and
simultaneously utilize all computational units within the architecture. A major challenge
in developing systolic array architectures can be in providing data simultaneously to the
multiple computational units in correct order. Figure 3.1 shows the top-level view of
version 1 of the developed hardware convolution architecture with the basic functional
units indicated. An external memory device (external to an FPGA or ASIC chip) will be
utilized to hold a portion of the scanned input image plane during the convolution
process. A more detailed description of the interface between the main system and the
external memory will be presented later. The functionality of all basic functional units of
the architecture of Figure 3.1 will now be described.
3.1. Arithmetic Unit (AU)
The Arithmetic Unit (AU) is the core of version 1 of the convolution architecture.
As shown in Figure 3.1 above, the AU consists of an Accumulator plus Multiplication
and Add Units (MAUs). As the name implies, the basic building blocks within MAUs are
7
multiplication units and adders. Each MAU consists of one multiplication unit and one
adder as shown in Figure 3.2 below.
Input Image Plane, IP
Controller
40
External Memory
Device
8
Data Memory
IDS Input
Interface (DM I/F)
Input Data Shifters (IDSs)
8
Outputs, OI
21
Output
Correction
Unit
21
A c c u m u l a to r
8 IDS
n-1
8 IDS
0
8 IDS
1
Multiplication and Add Units
(MAUs)
19
MAUn-1
MAU1
MAU0
Arithmetic Unit (AU)
8
8
Filter Coefficients, FC
CSn-1
8
CS1
8
CS0
Coefficient Shifters (CSs)
Figure 3.1. Top-level view of Version 1 of the convolution architecture (d is assumed
to be 8 in this example).
IDS output
8
16
To the next
MAU
17
Output from
previous MAU
Adder
16
Multiplication
Unit
8
CS output
Figure 3.2. A MAU and included functional units.
8
As depicted in Figure 3.2 the multiplication unit will multiply two 8-bit binary
numbers, and then the adder will add the product to input from the previous MAU. Output
from the adder will be used as the input to the next MAU. In order to achieve high
performance it is important to utilize high-speed adders and multiplication units within
the architecture. It is also of interest to adopt a multiplication technique that is suitable for
pipelining for performance enhancement. For example, the Wallace Tree multiplication
or array multiplication techniques can be easily pipelined into multiple stages. The
implementation platform influences performance as well. For instance, it is now common
to find high performance built in core adders and multiplication units within FPGA
technology chips.
IDSn-1
IDS1
IDS0
MAU1
MAU0
Clock, Clk
Partial result to
a c c u m u l a to r
MAUn-1
Register1
Registern-1
Register0
Multiplication and Add Units (MAUs)
CSn-1
CS1
CS0
Figure 3.3. Systolic array structures of MAUs, where IDSs are outputs from Input
Data Shifters and CSs are the outputs from Coefficient Shifters.
The MAUs are arranged in such a way that they create the systolic array structure.
Figure 3.3 shows the systolic array structure of the MAUs. The total number of MAUs
used is determined by the size of the coefficient filter. For an n×n filter size, n MAUs
would be utilized.
Use of the systolic array structure will require the outputs from the Input Data
Shifter (IDS) and Coefficient Shifter (CS) functional units to be skewed. This is to ensure
that the correct sequence of multiplications is added correctly. An accumulator is needed
at the end of the structure to add the necessary partial results generated by the MAUs to
form an output pixel. For example, for an n×n filter size, the accumulator must
9
accumulate n partial results generated by the MAUs in the generation of one output pixel.
This requires n clock cycles if each MAU takes one clock cycle to complete its operation.
The registers to the left of each MAU serve as pipeline stage registers. If necessary,
additional pipeline stages can be implemented within each MAU to increase performance
[8].
3.2. Coefficient Shifters (CSs)
Coefficient shifters are a group of parallel register shifters that can be
programmed to retain the values of the filter coefficients. CSs are responsible for
generating a skewed output of filter coefficients to MAU inputs. Figure 3.4 shows a
structural view of the CSs. The number of shifters within the CSs is also dependant on the
coefficient filter size. For an n×n filter size, there will be n coefficient shifters. Each CS
stores n coefficients as seen in Figure 3.4. Once programmed with the filter coefficients,
the CSs will retain the filter coefficients through out the convolution process. In order to
provide the MAUs with the skewed input from the CS, as the convolution process starts,
CS0 will shift after the first clock cycle, the next clock cycle CS1 and CS0 will shift, the
following clock cycle CS2, CS1, and CS0 will shift. The process continues until all the CSs
are shifting every clock cycle. This will ensure that the MAUs will receive required
skewed input. Figure 3.5 shows the arrangement of the filter coefficients within the CSs
for the convolution algorithm corresponding to the filter coefficients shown in Table 3.1.
CSn-1
CS1
8
8
CSn-1
Clock, clk
8
FC Input
CS0
8
CS1
8
0
CS0
0
8
0
8
1
8
1
8
1
8
n-1
8
n-1
8
n-1
8
Coefficients Shifters (CSs)
Figure 3.4. Functional units within CSs.
10
Table 3.1. Filter coefficient array.
CSn-1
FC(0, 0)
FC(1, 0)
FC(0, 1)
FC(0, n-2)
FC(0, n-1)
FC(1, n-1)
FC(n-1, 0)
FC(n-1, 1)
FC(n-1, n-2)
FC(n-1, n-1)
8
FC(0,0)
CSn-2
8
CS1
FC(0,1)
8
FC(0,n-2)
FC(1,0)
FC(n-1,0)
CS0
8
FC(0,n-1)
FC(1,n-1)
FC(n-1,n-2)
FC(n-1,1)
FC(n-1,n-1)
Coefficient Shifters (CSs)
Figure 3.5. Arrangement of the filter coefficients within the Coefficient Shifters.
3.3. Input Data Shifters (IDSs)
The main function of the Input Data Shifters (IDSs) is to generate a proper
sequence of input image pixels for the MAUs. Figure 3.6 shows the basic functional units
within the IDSs of Figure 3.1.
Pattern Generator
Pointer(s) (PGPs)
Input from Data
Memory Interface
Register Bank (RB)
(Size of n×n Registers)
Delay Units
Figure 3.6. Functional units within IDSs.
11
Outputs to MAU
Inputs
3.3.1. Register Bank (RB)
Due to the structure of the convolution algorithm, for each successive output
pixel, it requires access to the previous ((n×(n-1))+n-1) input image pixels. Hence, the RB
of Figure 3.6 is used to provide the correct input image pixels for successive convolution.
Figure 3.7 shows the detail of the RB. The RB consists of n registers and each register has
a length of n input image pixels or (n×d) bits assuming each input image pixel is d bits in
length. Thus, RB has the capacity to hold n2 input image pixels that are needed for each
convolution. This functional unit and its structure also improve the scalability of the
architecture.
Demux’s
select lines
Mux’s select
lines
log 2 n
log 2 n
0
d
d(n-1)-1 (d×n)-1
n×d
Mux
Input from
Data
Memory
0
Demux
0
n×d
To delay
units
n-1
n-1
Registers Bank
Figure 3.7. Generalized RB for n×n filter size. (d denotes number of bits for the
input pixels).
3.3.2. Pattern Generator Pointers (PGPs)
In order to provide the MAUs with the correct sequence of input image pixels for
each convolution, the Pattern Generator Pointers (PGPs) of Figure 3.6 are utilized. As the
RB fully fills up for each convolution, only one register needs to be updated with new
input image pixels. Thus, the update sequence for the RB (Input image pixels coming
12
from the Data Memory Interface) will repeatedly go from top to bottom (repeating from
zero to (n-1)). As for the output sequence from RB, each convolution requires all
registers’ contents being fetched to the Delay Units. Hence, all except the Data Memory
Interface (DM I/F) run at frequency n times faster (for an n×n filter size). The DM I/F
operates at the same frequency as the input image pixel rate. Table 3.2 below shows the
output sequence for one output pointer. The output sequence is 0, 1, 2, 3, 4 for n = 5. The
example in Table 3.2 is based on a 5×5 filter size. It is found that the output pattern will
repeat itself every five convolutions.
Reading order
from the RB
Table 3.2. 5×5 Filter size (with one output pointer).
1st
0
1
2
3
4
Convolutions
2nd
1
2
3
4
0
3rd
2
3
4
0
1
4th
3
4
0
1
2
5th
4
0
1
2
3
The architecture can be scaled up to process up to x convolutions in parallel where
(x ≤ n). This is made possible by adding (x-1) additional output pointer(s), Delay Unit(s)
and AU(s) to the existing architecture. As in the example above, the output sequence for
each pointer can be predetermined and they repeat after every five convolutions. Table
3.3 below shows an example for a 5×5 filter size with two output pointers. Figure 3.8
shows the additional hardware required if two convolutions are to be done in parallel and
the figure infers the additional hardware required to convolute x = n points in parallel. In
addition, an Output Correction Unit (OCU) will be needed for convolution of two or
more output pixels in parallel (See OCU in Figure 3.1). The function of the OCU will be
explained in a following section.
As the architecture is scaled up to process more than one convolution in parallel,
all functional units within the architecture except the DM I/F (which runs at the same
frequency as the input image pixels’ rate) can operate at a lower frequency. If the
architecture is scaled up to process n convolutions in parallel, then the whole architecture
operates at the same clock rate as the input image pixels rate. Thus, on average one
convolution can be achieved in every clock cycle. However, the RB needs modification in
order to process n convolutions in parallel. To process n convolutions in parallel, on
every clock cycle all n registers within the RB will be read at once. Thus, it is necessary
13
for the current input from the DM I/F for updating one of the registers within RB being
fetched to the MAU at the same instance. For this case, n pointers are utilized and each
pointer will only have one sequence instead of five as shown in Table 3.2 and Table 3.3.
Table 3.3. 5×5 Filter size (Convolution with two output pointers).
Reading order
from the RB
(pointer 0)
2nd
2
3
4
0
1
Reading order
from the RB
(pointer 1)
Convolutions
1st
0
1
2
3
4
1st
1
2
3
4
0
2nd
3
4
0
1
2
3rd
4
0
1
2
3
Convolutions
3rd
0
1
2
3
4
4th
1
2
3
4
0
5th
3
4
0
1
2
4th
2
3
4
0
1
5th
4
0
1
2
3
IDS Inputs
Register Bank
4
3
2
1
0
Pointern-1
Pointer1
Pointer0
DU0
DU1
AU1
Output Correction
Unit
DUn-1
AUn-1
AU0
CSs outputs
Figure 3.8. Additional hardware and modification for convolution of x output pixels
in parallel for (x ≤ n) (Functional units shaded in gray are additional hardware
required for processing two convolutions in parallel).
14
The PGPs can be synthesized by using a finite state machine model. Another
modeling possibility is by storing the predetermined sequences into RAM and reading
them out sequentially as needed.
3.3.3. Delay Units (DU)
Output from the Register Bank (RB of Figure 3.7 and Figure 3.8) will go through
the Delay Units (DU) of Figure 3.8 before being fetched into MAUs within the Aus of
Figure 3.8. The delay units consist of a series of flip-flops placed in a manner that will
generate a skewed input to the MAUs. This is necessary for the AU to generate the correct
outputs. Figure 3.9 below shows the internal structure of a DU.
Output from the IDS
40
IDS(39 – 32)
8
IDS(31 – 24)
8
IDS(23 – 16)
8
R
R
R
R
R
R
R
R
8
IDS(7 – 0)
R
IDS(15 – 8)
8
R
MAU4
MAU3
MAU2
MAU1
MAU0
Figure 3.9. Organization of Flip-flops within the Delay Unit (DU). R within the
figure denotes one flip-flop.
3.4. Systolic Flow of Version 1 Convolution Architecture.
In order to further demonstrate how the data flow within the MAUs occurs, Figure
3.10 may be used. As the input image pixels pass through the DU from the RB, skewed
input image pixels are generated and fed to the MAUs of the AU. At the same time,
skewed filter coefficients are also input into MAUs by the CSs. Figure 3.10 below shows
an example of how an output pixel is obtained as it flows through the MAUs with a 5×5
filter coefficient size.
15
FC centered at
IP[3,4]
•
•
IP12
•
•
IP13
•
IP22
•
•
•
•
•
•
•
IP36
•
IP46
IP45
•
IP54
IP26
IP35
IP44
•
IP53
•
•
•
IP16
IP25
IP34
IP43
IP52
•
•
•
•
IP15
IP24
IP33
IP42
•
IP14
IP23
IP32
•
Filter Coefficients
•
IP55
IP56
FC00
FC01
FC02
FC03
FC04
FC10
FC11
FC12
FC13
FC14
FC20
FC21
FC22
FC23
FC24
FC30
FC31
FC32
FC33
FC34
FC40
FC41
FC42
FC43
FC44
output, OI[3,4] = FC00IP56 + FC01IP55 + FC02IP54 + FC03IP53 + FC04IP52
+ FC10IP46 + FC11IP45 + FC12IP44 + FC13IP43 + FC14IP42
+ FC20IP36 + FC21IP35 + FC22IP34 + FC23IP33 + FC24IP32
+ FC30IP26 + FC31IP25 + FC32IP24 + FC33IP23 + FC34IP22
+ FC40IP16 + FC41IP15 + FC42IP14 + FC43IP13 + FC44IP12
Time
t0 c.c.
(t0 + 1)
c.c.
(t0 + 2)
c.c.
-
-
-
-
FC44IP12
-
-
-
FC34IP22 +
FC44IP12
FC43IP13
-
-
FC33IP23 +
FC43IP13
FC42IP14
(t0 + 3)
c.c.
-
FC14IP42 + FC24IP32 +
FC34IP22 + FC44IP12
FC32IP24 +
FC42IP14
FC41IP15
(t0 + 4)
c.c.
FC04IP52 + FC14IP42 + FC24IP32
+ FC34IP22 + FC44IP12
FC13IP43 + FC23IP33 +
FC33IP23 + FC43IP13
FC31IP25 +
FC41IP15
FC40IP16
(t0 + 5)
c.c.
FC03IP53 + FC13IP43 + FC23IP33
+ FC33IP23 + FC43IP13
FC12IP44 + FC22IP34 +
FC32IP24 + FC42IP14
FC30IP26 +
FC40IP16
-
(t0 + 6)
c.c.
(t0 + 7)
c.c.
(t0 + 8)
c.c.
FC02IP54 + FC12IP44 + FC22IP34
+ FC32IP24 + FC42IP14
FC11IP45 + FC21IP35 +
FC31IP25 + FC41IP15
-
-
FC01IP55 + FC11IP45 + FC21IP35
+ FC31IP25 + FC41IP15
FC10IP46 + FC20IP36 +
FC30IP26 + FC40IP16
-
-
-
FC00IP56 + FC10IP46 + FC20IP36
+ FC30IP26 + FC40IP16
-
-
-
-
MAU4
MAU3
MAU2
MAU1
MAU0
FC24IP32 +
FC34IP22 +
FC44IP12
FC23IP33 +
FC33IP23 +
FC43IP13
FC22IP34 +
FC32IP24 +
FC42IP14
FC21IP35 +
FC31IP25 +
FC41IP15
FC20IP36 +
FC30IP26 +
FC40IP16
Figure 3.10. Pictorial view of the data flow within the MAUs for one output pixel.
As shown in Figure 3.10 above, the convolution starts at t0 clock cycle (cc) when
the first input image pixel (IP12) is multiplied by filter coefficient FC44 in MAU0. During
the next clock cycle, (t0 + 1), the previous product from MAU0 is added with the product
of the IP22 and FC34 multiplication in MAU1 while a new product is generated in MAU0
(FC43 and IP13). The sum of the two products in MAU1 will be propagated into MAU2 the
next clock cycle (t0 + 2) and it is then summed with the product generated within MAU2.
16
The process continues as shown in Figure 3.10 above. An output pixel is generated
during (t0 + 9) cc when partial results from MAU4 at (t0 + 4), (t0 + 5), (t0 + 6), (t0 + 7), and
(t0 + 8) cc are summed by the accumulator. Once the first output pixel is generated on the
9th cc, from then on a new output pixel will be generated every five cc’s.
3.5. Data Memory Interface (DM I/F)
It is anticipated that external memory devices will be utilized for IP pixel storage
since the cost of having an on chip memory within the single-chip convolution
architecture of Figure 3.1 is high for any implementation platform. (The following
assessment is made on the assumption that a 5×5 filter size is desired) The bus width for
data transfer between external memory device and DM I/F will be 40 bits wide. This will
ensure that each access to the external memory device can yield five input image pixels.
However, since memory devices such as the SRAM devices on the market only come in
sizes of 8-bit, 16-bit, and 32-bit, two memory devices will be used; one 8-bit and one 32bit. Due to the fact that for consecutive convolution only five input image pixels need to
be updated, only one access to an external memory device will be required. Figure 3.11
below shows the basic functional units within the DM I/F.
External Memory
Device
40
Cache
Unit
40
Output To
IDSs
8
Input From
Scanning Device
8
Zero Padding
Hardware
Figure 3.11. Basic functional units within Data Memory I/F.
A cache unit is utilized to reduce the penalty of accessing external memory
devices. Figure 3.12 shows a more detailed DM I/F. Register File A and B each consists
of four shift registers, namely Registers b, c, d, and e. Each register is 40-bits in size and
holds five input image pixels. In order to prevent data starvation, Register File A and B
are used alternatively. As Register File A is providing input image pixels to IDSs,
17
Register File B is being filled with input image data from external memory devices and
vise versa. As either one of the Register Files are outputting input image pixels to IDSs,
each internal register will shift an input image pixel out by shifting right.
From External 40
Memory Device
32
Register File A
32
32
Register File B
To External
Memory Device
Scanning
Device
Output to
IDSs
40
Register a
8
8
8
‘0’
Figure 3.12. A more detailed look at DM I/F.
Observing Equation 1, there are instances where references are made outside the
range of the input image. For these accesses zero pixel values will be used. To address
these boundary conditions, zero padding hardware is incorporated into the DM I/F.
Whenever the end of a row (input image pixels) is reached
(n − 1)
pixels are attached.
2
Thus, a column counter is needed within the main controller of the architecture.
Register a of Figure 3.12 is a register that can hold up to five input image pixels.
As Register a is filled, the contents will be stored into External Memory Devices. There
are also five address pointers needed for addressing the External Memory Devices for
storage of input image pixels. Each addressable location of the two External Memory
Devices can hold up to five input image pixels (40 bits). Figure 3.13 shows the time line
for activities within the DM I/F. For every four reads from External Memory Devices
(read from each pointer once) and one write to store input image pixels stored in Register
a, five output pixels (OI) of Figure 3.1 will be produced.
18
Figure 3.13. Time line for activities, where W denotes Write or R denotes Read
(from external memory device) registers indicated in boxes directly below.
3.6. Output Correction Unit
This unit is responsible for correcting the output sequence when the architecture is
scaled to process two or more convolutions in parallel. Figure 3.14 below shows an
example of the output pixel sequence when two convolutions are processed in parallel.
Instead of one output on each output clock cycle, two output pixels are generated on a
single clock cycle every two output clock cycles. Thus, the Output Correction Unit may
be needed to correct the output sequence back to one output pixel per one output clock
cycle. Whether the OCU is needed can be addressed at a later time.
Figure 3.14. Output pattern for two convolutions in parallel.
3.7. Controller
This is the functional unit of Figure 3.1 that will coordinate all the other
functional units within the architecture. The main controller of the architecture will be
19
implemented in finite state machine form. Within the main controller, there will be a row
and a column counter to keep track of the row and column counts so that it knows when
the end of a row is encountered. There can be two separate controlling units within the
main controller, one controller responsible for the DM I/F (controller DM) and the other
responsible for the rest of the architecture (controller R). The DM I/F will be running at
the same rate as the input image pixels while the rest of the architecture will be running at
least n times faster assuming convolution of a single point. However, as the architecture
is scaled up to handle x convolutions at the same time (see Figure 3.8), then the controller
R can be run at a correspondingly lower frequency rate.
Basically the controller can be divided into three main stages. The first stage is
mainly devoted to storing the first few rows of input image pixels and waiting until there
are enough input image pixels for convolution to start. The second stage is responsible
for filling up the pipeline and making sure that the convolution starts in the correct
manner. The last stage deals with shutting down of the system.
20
Chapter 4
Revised Architectural Requirements and Resulting Version 2 Convolution
Architecture
The convolution architecture proposed in the previous section is scalable and
suitable for applications that require scalable performance and hardware. In this section a
more stringent performance requirement is addressed for which a convoluted OI pixel is
expected on each clock cycle of 7.3 ns (for a final “production” model based on ASIC
technology). In addition, k distinct n×n FCs are required to be simultaneously convoluted
with each Input Image Plane (IP) resulting in a performance requirement of k OI pixels
on each 7.3 ns clock cycle. The performance requirement of k convoluted OI pixels on
each 7.3 ns clock cycles can only be expected from final high speed production
technologies. In Version 1 of the architecture, filter coefficients (FCs) were assumed to
be 8 bits in length. Filter coefficients will now be 6 bits in length. Even though the
convolution architecture proposed in the previous section can be scaled up and pipeline
stages within the MAUs can also be increased to meet all the above requirements, a
specially tailored architecture can save hardware and reduce the architecture’s controller
complexity. For example, as shown in Figure 3.1, within each AU there is an accumulator
in front of the MAUs. As the architecture is scaled up to process n convolutions in
parallel, n accumulators within the architecture will be required which can be costly from
a hardware standpoint. Furthermore, a simplified controller for the IDSs can also
contribute to a hardware saving. Hence, in this section a modified and specially tailored
convolution architecture will be presented and it is referred to as Version 2 of the
convolution architecture.
4.1. Version 2 Convolution Architecture for (k = 1)
Since the desired output rate is the convolution of one OI pixel per clock cycle,
for a n×n FC size, a total of n2 MAUs are needed for one distinct filter coefficient set.
Figure 4.1 below shows a top level view of Version 2 of the architecture where n and d
21
(width of input image pixels) are assumed to be 5 and 8 respectively. Buses shown in
Figure 4.1 with width of 40 bits are resultant of (n×d). Each functional unit in this
architecture will implement required functionality as will be addressed below.
External Memory
Device
(Data Memory)
Input Image d
Plane, IP
n×d
Controller Unit (CU)
Data Memory
Interface (DM I/F)
n×d
Arithmetic Unit (AU)
n×d
Input
Data
Shifters
(IDS)
IDS0
MAA0
IDS1
MAA1
n×d
AT
IDSn-1
Outputs,
OI
MAAn-1
n×d
Filter Coefficients,
FC
19
6
Figure 4.1. A top level view of Version 2 of the convolution architecture for one
distinct filter coefficient set with n = 5 and d = 8 (MAA denotes Multiplication and
Add Array and AT denotes Adder Tree).
4.2. Arithmetic Unit (AU)
As shown in Figure 4.1 above, the Arithmetic Unit (AU) consists of n
Multiplication and Add Arrays (MAAs) plus an Adder Tree Structure (AT) at the end of
the MAAs. Within each MAA there are n Multiplication and Add Units (MAUs) arranged
in a systolic array structure. Figure 4.2 shows the arrangement of the n MAUs within each
MAA.
The basic functional units within each MAU remain the same as in the previous
section. In Version 2 of the architecture, filter coefficients will be held within the MAUs,
therefore, an additional register is needed to hold the filter coefficient value assigned to a
specific MAU. Since Version 2 of the modified convolution architecture will feature n2
MAUs, the Coefficient Shifters (CSs) shown in Version 1 of the architecture (see Figure
3.1, Figure 3.4 and Figure 3.5) of the previous section can be eliminated. Hence, all the n2
22
filter coefficients will be assigned to a specific MAU. Figure 4.3 shows the functional
units within each MAU.
IDS
output
Clock,
Clk
Delay Units (DU)
DU0
DUn-1
DU1
MAU0
MAU1
Partial
result to
A dder
Tree
MAUn-1
Registern-1
Register1
Register0
Filter
Coefficients
Multiplication and Add Array (MAA)
Figure 4.2. Functional units within the Multiplication and Add Array (MAA).
DU output
Output from
previous MAU
8
14
To the next
MAU
Adder
15
Multiplication
Unit
14
6
Register
Filter Coefficient
Figure 4.3. A MAU and its functional units.
In order to achieve the desired performance, it will be necessary to pipeline all
MAUs beyond the minimum pipeline stages shown in Figure 4.2 (the register to the right
of each MAU represents a pipeline stage). Thus it is important to employ multiply
techniques that can easily be pipelined into multiple stages. It is possible to combine the
multiplication unit and the adder shown in Figure 4.3 into one unit. For the most part, a
multiplication unit usually consists of an adder tree that adds all the generated partial
23
products. As shown in Figure 4.3 an adder is required to sum the previous MAU output
with the product generated by the multiplier. It is possible to use a Carry Save Adder and
generate the output as two separate outputs (a sum output and a carry output). This will
eliminate the need for another high speed adder at the end of each MAU.
The Adder Tree (AT) within the AU is responsible for adding all the n partial
results from the MAAs to form the output image pixel. The AT can be constructed with
Carry Save Adders (CSAs) and a Carry Look Ahead Adder (CLA). In addition, the AT
will be pipelined into multiple stages as well for performance. Figure 4.4 shows a
possible arrangement of CSAs and CLA within the AT. This example is based on a 5×5
FC size.
Sum from MAA0 16
Carry from MAA0 16
CSA
CSA
Sum from MAA1 16
Carry from MAA1
CSA
CSA
16
Sum from MAA2 16
CSA
CLA 19
Output Image
Pixel
CSA
Carry from MAA2 16
Sum from MAA3 16
Carry from MAA3 16
CSA
CSA
R
Sum from MAA4 16
Carry from MAA4 16
R
R
R
R
AT
Figure 4.4. A possible arrangement of the AT (R denotes a single flip-flop pipeline
stage; a pipeline stage is included within each CSA and CLA) for a 5×5 FC.
Another basic functional unit within the AU is the Delay Units (DU) which are
responsible for generating the skewed input image pixels for the MAUs. However, the
DUs will need to be pipelined as well with the same pipeline stages that the MAA has.
Upon further investigation, even though the replacement of a high speed adder
with a CSA within a MAU can save a small amount of hardware, the replacement is not as
beneficial when the architecture is reviewed at the highest level. Table 4.1 shows a direct
24
comparison of number of gates required for both a 14-bit CSA and a 14-bit CLA (EX-OR
gate is counted as five gates). The amount of hardware saved is not as significant as the
increase in hardware for the AT. Figure 4.5 shows a possible arrangement of the AT if
CLA is utilized within the MAUs.
Table 4.1. Gate count comparison between CSA and CLA.
CSA CLA
Gate Count 182 210
Output from MAA0
17
Output from MAA1
17
17
18
CSA
18
Output from MAA2
CSA
18
19
17
CSA
CLA 19
19
Output from MAA3
17
R
17
Output from MAA4
17
R
17
R
Output Image
Pixel
17
AT
Figure 4.5. One possible arrangement of the AT when CLA is utilized within the
MAUs.
First and foremost, comparison between Figure 4.4 and Figure 4.5 shows a
number of CSAs being saved and also a number of pipeline stages being saved as well.
This results in a large amount of hardware savings. If CSA is used within each MAU, the
number of bits (or bus lines) running from one MAU to another is doubled. Hence, when
implemented, CSA will require more real estate within the chip (especially when
implemented as an ASIC) than CLA, thus reinforcing the need to reduce the number of
CSA units. Another important hardware reduction is the reduction in the number of flipflops required for the pipeline into half since only one bus (one output from each MAA) is
required.
25
In conclusion, the adder within each MAU of Figure 4.3 will be a CLA type and
the Adder Tree (AT) of Figure 4.1 will be implemented as shown in Figure 4.5 for the
case of n = 5.
4.2.1. Multiplication Unit (MU) of Multiplication and Add Unit (MAU)
The Multiplication Unit (MU) of Figure 4.3 is one of the most important
arithmetic components within the proposed convolution architecture. Thus, it is important
that a high speed and area efficient multiplication technique be derived and implemented
since the architecture requires 25 MUs for one convolution set. For each MU, an 8-bit
unsigned binary number (IP) is to be multiplied by a 6-bit signed binary number (FC)
and a 14-bit signed binary output (OI) is generated. Table 4.2 below shows a summary of
all the elements involved in the multiplication. All signed binary numbers will be
represented as 2’s complement numbers.
Description
Table 4.2. A summary of the multiplication.
Representation
Range (Decimal)
Multiplicand
8-bit unsigned binary number
0 to 255
Multiplier
6-bit signed binary number (2’s complement)
-32 to 31
Product
14-bit signed binary number (2’s complement)
-8192 to 8191
Multiplication in binary can be done using the same technique as with the
commonly used paper and pencil method. Partial products are generated based on each
bit of the multiplier and then all the partial products are summed to generate the product.
The number of partial products required is dependent on the number of bits of the
multiplier. Hence, as shown in Table 4.2, a 6-bit signed binary number is used as the
multiplier instead of the 8-bit unsigned binary number; this is due to the fact that using a
reduced number of bits for the multiplier results in fewer partial products. However, since
the multiplier in this case is a signed binary number, for the regular paper and pencil
method to work when the multiplier is in negative range, both the multiplicand and
multiplier need to be complemented before the multiplication. This is due to the fact that
all the partial products are positive and hence the result generated will be positive as well,
which is not correct since a negative result should be obtained as the multiplier is of
negative value. Hence, by complementing both the multiplicand and the multiplier, the
26
signs are switched between the two, but the result should be the same, a negative value.
Besides, all the partial products need to be sign extended for the multiplication to be
correct. Figure 4.6 illustrates the multiplication concept mentioned above. A copy of the
multiplicand will be placed into the partial product with sign extension(s) if the
respective multiplier bit is one, otherwise all zeros will be placed.
B7
B6
A5
A4
A3
A2
A1
A0
multiplicand
multiplier
s
s
s
s
s
x
x
x
x
x
x
x
x
partial product based on A0
s
s
s
s
x
x
x
x
x
x
x
x
s
s
s
x
x
x
x
x
x
x
x
s
s
x
x
x
x
x
x
x
x
s
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
P9
P8
P7
P6
P5
×
+
P13 P12 P11 P10
B5
B4
B3
B2
B1
B0
partial product based on A1
partial product based on A2
partial product based on A3
partial product based on A4
partial product based on A5
P4
P3
P2
P1
P0
Figure 4.6. Illustration of the paper and pencil multiplication technique (s on each
row of the partial products denotes sign extension of that particular row of partial
product).
From Figure 4.6 above, it is shown that the most expensive operation is to sum all
the partial products into the final result. It is difficult to design an adder that can add six
operands at the same time and also it may not be speed efficient as well. However, a
method such as the Multilevel Carry Save Adder (CSA) Tree [10] can be employed to add
all the partial products into a final result. The Multilevel CSA Adder Tree uses multiple
stages of the CSA to reduce the operands into two operands and a final adder stage is used
to sum both operands to generate the final result. Depending on the speed requirement,
the Multilevel CSA Tree can be easily pipelined into multiple stages to increase
throughput. Besides, the final stage adder can be replaced with a fast adder such as Carry
Lookahead Adder (CLA) to reduce the latency. Figure 4.7 below shows a possible
arrangement of the Multilevel CSA Tree for adding six operands. Also, a five stage
pipeline can be implemented with this configuration.
It is possible to reduce the hardware count if the number of partial products can be
reduced. This can be done thorough use of the Modified Booth’s Algorithm (MBA) [13].
The MBA inspects three multiplier bits at a time and generates respective partial product
27
selections. Compared to the MBA, the original Booth’s Algorithm (BA) [4] inspects two
bits of the multiplier at an instance and hence the number of partial products generated
still remains proportional to the number of multiplier bits. The MBA can reduce the
number of partial products required to (
x
+ 1 ), assuming x is the number of bits for the
2
multiplier. Thus, for a 6-bit multiplier, the partial products generated will be reduced
from six to four. However, the Partial Products Generator’s (PPG) complexity is
increased due to the different possible outputs for each partial product. Table 4.3 below
gives a summary of the possible outputs for a partial product based on the three multiplier
bits examined.
Multiplicand
Multiplier
Partial Products Generator
PP0 PP1 PP2
PP3 PP4 PP5
CSA
CSA
1st Stage
CSA
CSA
2nd Stage
CLA
3rd Stage
Product
Figure 4.7. One possible arrangement of Multilevel CSA Tree for six partial
products.
Table 4.3. Partial Product Selection Table.
Multiplier Bits
Selection
000
0
001
+ Multiplicand
010
+ Multiplicand
011
+ 2×Multiplicand
100
- 2×Multiplicand
101
- Multiplicand
110
- Multiplicand
111
0
28
As shown in Table 4.3, each partial product generated can have a different output
and thus the hardware complexity for the PPG is increased. Also, the output for the
partial product can be easily obtained by either shifting the multiplicand left one position
for 2× and complement plus one for all the negative values required. Figure 4.8 below
illustrates the changes to the Wallace Tree when MBA is employed. Compared to Figure
4.7, two CSAs can be saved.
Multiplicand
14
Multiplier
14
Partial Products Generator
PP0 PP1 PP2
PP3
1st Stage
CSA
CSA
2nd Stage
CLA
3rd Stage
14
Product
Figure 4.8. Multiplier based on Modified Booth’s Algorithm and Wallace Tree
Structure.
Figure 4.9 below depicts the detailed multiplication technique when MBA is
employed to reduce the number of partial products required. Some hardware saving can
be achieved when a Full Adder (FA) can be replaced with a Half Adder (HA) within the
MU. Figure 4.10 shows the reduced sign extension within the partial products which in
turn can contribute to hardware savings [12].
×
s1 s1
s2 s2
s3 x
s1
s2
x
s1
x
x
s1
x
x
x
x
x
+
P13 P12 P11 P10 P9 P8
B7 B6 B5 B4 B3 B2 B1 B0
A5 A4 A3 A2 A1 A0
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
s2
s1
s3
P7 P6 P5 P4 P3 P2 P1 P0
multiplicand
multiplier
partial product based on 0, A0, A1
partial product based on A1, A2, A3
partial product based on A3, A4, A5
Sign corrections for all partial
products
Figure 4.9. Illustration of multiplication technique based on Modified Booth’s
Algorithm.
29
×
1
x
s3
s1
s2
x
s1
x
x
s1
x
x
x
x
x
+
P13 P12 P11 P10 P9 P8
B7 B6 B5 B4 B3 B2 B1 B0
A5 A4 A3 A2 A1 A0
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
s2
s1
s3
P7 P6 P5 P4 P3 P2 P1 P0
multiplicand
multiplier
partial product based on 0, A0, A1
partial product based on A1, A2, A3
partial product based on A3, A4, A5
Sign corrections for all partial
products
Figure 4.10. Partial Product’s sign extension reduced for hardware saving.
Table 4.4 below shows the hardware comparison between multiplication
techniques shown in Figure 4.7 and Figure 4.8. For simplicity, the multiplication
technique shown in Figure 4.7 is denoted as method I and method II denotes the
multiplication technique shown in Figure 4.8. Also, Exclusive-OR gates (EX-OR) within
both methods are counted as five gates. FF in Table 4.4 denotes Flip-Flops.
Table 4.4. Comparison between method I and method II.
# FA # HA # FF # CLA Gate Count (excluding FF)
Method I
37
5
78
1
691
Method II
9
15
72
1
641
From Table 4.4, the total number of Gate Count, excluding FFs, is almost
identical but in order to achieve the speed requirement, multiple pipeline stages are
included. The maximum gate delay for both methods is identical, which is due to the CLA
(10 gate delays) [11]. Even though the hardware saving between the two methods is
modest at a glance, the number of replications of the unit can be a factor. Another
important note, in order for method I to be able to handle two’s complement, both the
multiplicand and the multiplier need to be complemented (complement each bit and add
one to the result) if the multiplier has a negative value. Hence extra overhead is required
for method I to function correctly. There is a workaround to avoid adding hardware for
complementing both the multiplicand and multiplier for method I, and that is by reversing
the multiplicand and the multiplier. This will ensure that the multiplier will always have
positive value and hence avoid the two’s complement hardware, but in return more partial
products will need to be generated since the multiplier consists of an 8-bit unsigned
number.
30
The multiply technique as shown in Figure 4.8 will be employed when the version
2 architecture is implemented. This is because this multiply technique requires less
hardware compared to the multiply technique shown in Figure 4.7.
4.2.2. Delay Units (DU)
The Delay Units (DU) are responsible for generating skewed input image pixels
for the MAA. The number of stages within DU is directly proportional to n and the
number of pipeline stages employed within the MAA. Figure 4.11 below shows the
organization of the flip-flops within the DU. The number of stages shown in Figure 4.11
below assumes n = 5 and that there are two pipeline stages within each MAU (one
pipeline stage after the Multiplication Unit and another one after the Adder; exclude the
first MAU since it only has a Multiplication Unit and hence one pipeline stage) as shown
in Figure 4.1.
8
DU3
(31-24)
8
PL7
DU2
(23-16)
8
PL6
8
16
PL5
DU1
(15-8)
16
PL4
DU0
(7-0)
8
24
PL3
8
24
PL2
32
PL1
IDS
Output
8
DU4
(39-32)
Figure 4.11. Functional units within the DU for the case of n = 5 and two pipeline
stages within each MAU (PL denotes a pipeline stage composed of flip-flop
registers).
4.3. Input Data Shifters (IDS)
This functional unit (see Figure 4.1) is responsible for providing the AU with the
correct input image pixel sequence. Figure 4.12 below shows the structural view of the
Input Data Shifters (IDS). There are n shift registers (S Registers) within the IDS with
each register capable of holding n input image pixels and with parallel load capability.
Each input image pixel is d bits wide. All shift registers also need to be able to shift all n
input image pixels in parallel.
31
Output from DM I/F
40
S Register0
40
IDS0
40
IDS1
40
S Register1
40
Outputs to
AU
40
40
S Registern-1
IDSn-1
Figure 4.12. Structural view of the IDS with n = 5 and d = 8.
As shown in Figure 4.12 above, all data within the structure are shifted in parallel
(in this example, it is 40 bits) from one shift register to another shift register. For
example, S Register0 is loaded with 40 bits in parallel from the output of DM I/F and
shifts 40 bits in parallel into S Register1.
4.4. Data Memory Interface (DM I/F)
The Data Memory Interface (DM I/F) of Figure 4.1 will remain unchanged from
section 3.5. See Figure 3.11 and Figure 3.12.
4.5. Memory Pointers Unit (MPU)
The external memory devices (see Figure 4.1) that are required by the architecture
are read and written through several memory pointers within the Memory Pointer Unit
(MPU). In order to achieve a minimum number of writes to the external memory devices,
MPU receives and stores n (five) input image pixels (for a 5×5 filter size) before it writes
all five input image pixels to the memory location pointed to by one of the memory
pointers. Thus, the bus width for the interconnection between the memory devices and
the architecture is 40 bits (n×d). If the memory accesses cannot keep up with the main
system clock, then the memory bandwidth can be increased to reduce the number of
accesses to the external memory devices.
32
0
0
31
39
ptr_a (ptr_0)
1024
ptr_b (ptr_1)
2048
ptr_c
3072
ptr_d
4096
ptr_e (ptr_n-1)
external memory
device a
external memory
device b
Figure 4.13. External memory devices organization for n = 5 and d = 8.
Figure 4.13 above shows the memory organization of the external memory
devices for the case of n = 5 and d =8 with different segments of the memory designated
for each memory pointer while Figure 4.14 below shows the functional units within the
MPU, again, for the case of n = 5 and d = 8. Each of n memory segments of the memory
is capable of storing one row of input image pixels. For example, for a n×d (40) bit
memory bus width and 5100 pixels of paper size width, each segment of the memory
should have at least 1020 locations. From Figure 4.13 above, 1024 locations are allocated
to each memory pointer. By allocating 1024 locations to each memory segment, the three
most significant bits of each memory pointer can be used to differentiate each memory
segment. Also, for every five output pixels generated, every memory pointer will have a
common ten least significant bits, except for the memory pointer that is used to write
current pixels into the external memory devices. This is because the other four memory
pointers need to pre-fetch all necessary input image pixels for the next convolution
iteration into the cache memory. Thus, two 10 bit counters (col_cntr #1 and col_cntr #2)
are needed to generate memory addresses as shown in Figure 4.14.
33
reg_sel
3
add_out
1
ptr_b
2
ptr_c
3
ptr_d
4
ptr_e
5
ptr_a
13
reg_sel
1
2
10
col_cntr #2
3
4
5
col_cntr #1
Figure 4.14. Functional units within the Memory Pointers Unit (MPU).
As shown in Figure 4.14 above, the three most significant bits of each memory
pointer is stored in registers. One important note is that, as the architecture is
reinitialized, all registers are initialized differently. For example, Memory Pointer b
(ptr_b) is initialized with 001, Memory Pointer c (ptr_c) with 010, Memory Pointer d
(ptr_d) with 011, Memory Pointer e (ptr_e) with 100, and Memory Pointer a (ptr_a) with
000. This is to ensure that each memory pointer is designated to a specific memory
segment. In addition, the memory pointers are shifted as indicated in Figure 4.14 above
whenever a row of the input image pixels is completed. This is because the least recent
row of input image pixels stored within the external memory devices are no longer
needed and can be overwritten to store the current input image pixels.
4.6. Systolic Flow of Version 2 Convolution Architecture
This section shows how the input image data flows through the AU of Figure 4.1
for the case of (n = 5) and how each output pixel (OI) is generated in Version 2 of the
Convolution Architecture. Figure 4.15 below shows the systolic flow of the data for
Version 2 of the Convolution Architecture. In order to simplify the figure, all pipeline
stages within the MAAs are ignored and the figure corresponds to Figure 3.10 with the
same convolution point and input image pixels. As can be seen in Figure 4.15 below, at
time t0 every MAA will multiply the input image pixel received with the filter coefficient
34
that is stored within the first MAU (within the MAA). The next time instant, t0 + 1td where
td denotes the pipeline delay between two MAUs within each MAA, the previous product
from MAU0 (in each MAA) is summed with the product of MAU1. This process continues
until time instant t0 + 4td when all input image pixels (for one convolution point) have
flowed through MAU4 within each MAA and they will be summed by the AT the next
cycle to generate one output pixel.
4.7. Controller
The controller for version 2 of the architecture, shown in Figure 4.1, is only
responsible for controlling the DM I/F of the architecture and the described Memory
Pointers Unit (MPU). This is due to the fact that the AU and the IDS need no controller to
regulate their activities. AU and IDS will be clocked with the main clock and all the
necessary input image pixels will propagate through the pipeline stages as required,
hence no controller is necessary for both units.
Time
t0 c.c.
(t0 + 1td) c.c.
(t0 + 2 td) c.c.
FC40IP16
FC30IP26 +
FC40IP16
FC20IP36 +
FC30IP26 +
FC40IP16
FC41IP15
FC31IP25 +
FC41IP15
FC21IP35 +
FC31IP25 +
FC41IP15
FC42IP14
FC32IP24 +
FC42IP14
FC22IP34 +
FC32IP24 +
FC42IP14
FC43IP13
FC33IP23 +
FC43IP13
FC23IP33 +
FC33IP23 +
FC43IP13
FC44IP12
FC34IP22 +
FC44IP12
FC24IP32 +
FC34IP22 +
FC44IP12
(t0 + 3 td) c.c.
FC10IP46 +
FC20IP36 +
FC30IP26 +
FC40IP16
FC11IP45 +
FC21IP35 +
FC31IP25 +
FC41IP15
FC12IP44 +
FC22IP34 +
FC32IP24 +
FC42IP14
FC13IP43 +
FC23IP33 +
FC33IP23 +
FC43IP13
FC14IP42 +
FC24IP32 +
FC34IP22 +
FC44IP12
(t0 + 4 td) c.c.
FC00IP56 + FC10IP46 +
FC20IP36 + FC30IP26 +
FC40IP16
MAA0
FC01IP55 + FC11IP45 +
FC21IP35 + FC31IP25 +
FC41IP15
MAA1
FC02IP54 + FC12IP44 +
FC22IP34 + FC32IP24 +
FC42IP14
MAA2
FC03IP53 + FC13IP43 +
FC23IP33 + FC33IP23 +
FC43IP13
MAA3
FC04IP52 + FC14IP42 +
FC24IP32 + FC34IP22 +
FC44IP12
MAA4
Figure 4.15. Pictorial view of the data flow within the MAAs for one output pixel (td
denotes the time delay between each MAU).
Figure 4.16 below shows the top level view of the Controller Unit (CU) with the
input and output control signals shown. The CU is responsible for generating control
signals to functional units within the DM I/F and the MPU.
35
reset, rst
row greater than, rgt
shut down signal, sds
end of column, eoc
beginning of a row, bor
3
2
Controller
Unit
(CU)
2
2
clock, clk
f_sel, cache banks select line
z_pad, zero padding
reg_sel, registers select
en_w, write enable for cache
en_sf, shift enable for cache
z_input, zero input
c_inc, column counter increment
rot, rotate memory pointers
r_inc, row counter increment
r_w, memory read/write line
sd_inc, shut down counter increment
Figure 4.16. Top level view of the Controller Unit (CU).
Figure 4.17 below shows the functional units for which the CU generates control
signals for the case of n = 5 and d = 8. The functional units labeled C_BANK1 and
REG_A are functional units contained within the DM I/F, whereas the functional unit
labeled as MEMPTRS is the MPU referred to above. The MPU is the functional unit
responsible for generating memory addresses for the external memory devices which
store all the necessary input image pixels (IP) for each convolution. C_BANK1 is the
functional unit that supplies input image pixels to the Input Data Shifters (IDS) and prefetch the necessary input image pixels from the external memory devices for the next
iteration of convolutions. In other words, the functional unit serves as a cache memory
for the convolution system. As for the functional unit labeled as REG_A, it is a unit that is
responsible for storing the most recent input image pixels received from the external
scanning device and later write to the external memory device when its register is full. In
addition, the functional unit also supplies the most current input image pixels to the IDS.
36
Figure 4.17. Functional Units that receive control signals from the CU.
The convolution system is pipelined into multiple stages requiring synchronized
operation. Thus, the CU is modeled as a finite state machine. Figure 4.18 below shows
the system flow chart for the CU. The system flow chart describes micro-operations of
the system on a clock-cycle by clock-cycle basis and it also indicates values that must be
assigned to appropriate control signals of the architecture on each clock cycle of
operation. Operation of the system flow chart shown below can be divided into three
segments. The first segment starts from the beginning of the flow chart and runs until the
row counter (row_cntr) reaches a count of greater than one. This segment is operational
when the input image pixels of a scanned page start and the convolution process will only
be started after the first two rows of input image pixels have been received. In this first
segment, the received two rows of input image pixels will be stored in the external
memory device. The purpose of the tog signal within the flow chart is to alternate writing
to the two Regfiles within C_BANK1 to avoid data starvation from the external memory
device. There are two column counters (col_cntr #1 and col_cntr #2) within the
MEMPTRS functional unit, the reason being that the first column counter is for ptr_a
address generator to write to the external memory device while the second column
37
counter will be one count ahead of the first column counter to pre-fetch the rest of the
rows (ptr_b, ptr_c, ptr_d, and ptr_e) required from the external memory device.
After the first two rows of the input image pixels have been stored, the
convolution process can be started as the third row of input image pixels of the new row
are received. The second operational segment of the flow chart, which starts from the
decision box of row greater than one (row_cntr > 1) and ends at the connector A in the
figure, is operational as the convolution process starts. This segment of the flow chart
will continue until all input image pixels for the entire scanned page are received.
The last segment of the flow chart starts from the connector A and continues until
the end of the flow chart. This segment is mainly responsible for supplying the system
with zeros as input to the system until the last two rows of the output pixels are
completely generated. As can be seen from the flow chart, a special counter (sd_cntr) is
designated for counting to two for indication of the end of the convolution process.
Control signals such as tog, en_w and en_sf are expected to retain their latest
value as the system transitions from one micro-operation to another. Thus, some memory
elements such as latches are required. If this compromises the CU’s speed and
performance, then a modification to the system flow chart such as the one shown in
Figure 4.19 below would be desired.
The system flow chart shown in Figure 4.19 contains extra states added to
eliminate the shared states (after each decision branch) shown in Figure 4.18.This
modification is aimed to remove the memory elements for signal tog, en_w and en_sf,
which need to be toggled after each branching after the decision making states. This
reduces the control signals generation delay.
38
Figure 4.18. System flow chart for Version 2 convolution architecture’s Controller
Unit (CU).
39
Figure 4.18. (Continued) System flow chart for Version 2 convolution architecture’s
Controller Unit (CU).
40
Figure 4.19. Modified Version 2 system flow chart.
41
Figure 4.19. (Continued) Modified Version 2 system flow chart.
42
4.8. Multiple Filter Coefficient Sets when (k > 1)
To address the need to simultaneously convolute k different sets of Filter
Coefficients (FC) with a single Input Image Plane (IP), such as when scanning and
printing color images, the version 2 architecture will require some hardware to be
replicated. Figure 4.20 below shows a high level view of the arrangement of the
additional required replicated hardware. For each additional FC set, one additional AU
will need to be added. However, not all the functional units within the AU need to be
replicated. A common DU (within the MAAs) can be shared among all the AUs for
additional FC sets (see Figure 4.2 for detail within a MAA). For example, all the MAA0s
within all the AUs can all share a common DU rather than each MAA0 having its own DU.
Data Memory
(External or
Internal)
Control
signals
Controller Unit (CU)
Data Memory
Interface (DM I/F) Control
Input Image
Plane, IP
signals
Clock, CLK
AU0
IDS0
MAA0
IDS1
MAA1
Input Data
Shifters
(IDS)
AU1
AUk-1
AT
Outputs, OI1
Outputs, OI2
IDSn-1
MAAn-1
Outputs, OIk
FC1
FC2
FCk
Figure 4.20. Version 2 architecture for k (n×n) filter coefficient sets (where k can be
any number).
The CU, DM I/F, and IDS functional units of Figure 4.20 are functionally and
operationally identical to the same units of Figure 4.1 for a given n and d and only one
instantiation of these units is required when k filter coefficient sets are used. This
enhances the scalability of the convolution architecture when expanded to handle
43
multiple FC planes. Likewise, the CU does not have to control any of the AUs of Figure
4.20. It only has to control the DM I/F and IDS units.
The version 2 convolution architecture of Figure 4.20, from a functional and
performance standpoint, can now simultaneously convolute a single IP with k (n×n) FCs
resulting in k convoluted OI pixels (OI1, OI2, … OIk) on each system clock cycle. This
functionality and performance of the version 2 architecture will first be validated via
HDL post-synthesis and post-implementation simulation in a later chapter of the thesis.
The functionality and performance will finally be validated in a later chapter via
development and experimental testing of a FPGA based hardware prototype.
44
Chapter 5
VHDL Description of Version 2 Convolution Architecture
This chapter describes the VHDL coding style and approach used to capture the
Version 2 convolution architecture. After the design is captured through VHDL, it is
synthesized and implemented to a targeted FPGA. Before a hardware prototype is built,
functional and performance level simulation will be done to validate its proper
functionality and determine its performance.
Modular and bottom up hierarchical design approaches were employed during the
VHDL design capture process. The modular design approach partitions the entire system
into smaller modules or functional units that can be independently designed and
described in VHDL. Besides, identical modules (with the same functionalities) can share
the same VHDL code or reuse the previously designed module. In addition, the bottom up
hierarchical design approach allows a multiple level view of the entire system for design
ease. Hence, by employing these approaches the smaller modules or functional units can
be tested and validated before they are combined together as the entire system.
For prototype purposes the Version 2 convolution architecture is captured with
three AUs instantiated (k = 3), no pipeline stages are built into the multiplication units,
and the architecture is tailored to an input image plane of size 5×60 pixels. The VHDL
described system has a total of 13 pipeline stages within each AU. In addition, the
external memory device as shown in Figure 4.1 is described in VHDL and incorporated
into the overall system and will thus, for the experimental hardware prototype, be
implemented within the FPGA chip containing the other functional units of the
convolution architecture.
Figure 5.1 below shows the organization of functional units within the convolution
architecture. For simplicity of the chart only the main functional units are shown, submodules within the main functional units are omitted. Both behavioral and structural
level coding styles were used during the VHDL coding process. Behavioral level style
45
coding has the advantage in that only the behaviors of the modules are described in the
code and the CAD software must infer the internal logic blocks. However, this may
present inconsistency since different CAD software may infer different logic blocks for
the same code. For this thesis behavioral level coding style was employed for most of the
functional units, however all the various sized adders and multiplication units were coded
at the structural level. This was to validate the correctness of the multiply and addition
techniques proposed in the previous chapter.
Convolution
Architecture
External
Memory Device
Data Memory
Interface
Memory Pointers Unit
Input Data
Shifters
Cache Unit
Controller Unit
Arithmetic Unit
Multiplication and Add Array
Adder Tree
Multiplication and
Add Unit
Multiplication Unit
Various sized Adder
Figure 5.1. Version 2 Convolution Architecture organization.
After the system is captured through VHDL, post-synthesis and postimplementation HDL software simulation can determine if the system is functioning and
performing as it should. The next chapter presents post-synthesis and postimplementation simulations of the convolution architecture. All VHDL code for Version
2 of the Discrete Convolution Architecture with three AUs (k = 3, see Figure 4.20) is
included in Appendix A. The code is appropriately commented such that one should be
able to identify the VHDL code describing all functional units of the convolution
architecture system.
46
Chapter 6
Version 2 Convolution Architecture Validation via Virtual Prototyping (PostSynthesis and Post-Implementation Simulation Experimentation)
Hardware Description Language (HDL) simulation of an architecture design,
sometimes known as virtual prototyping, is an important step in the design flow for fine
tuning and detecting potential problem areas before the design is implemented or
manufactured. In this section, Post-Synthesis simulation results and Post-Implementation
simulation results of version 2 of the convolution architecture will be presented. Both
Post-Synthesis simulation and Post-Implementation simulation are utilities contained
within the Xilinx Foundation 4.1i CAD software packages utilized during this project
[18]. During the process of validating version 2 of the convolution architecture, the
computer system that was used to run the specific software has the following
configuration; Intel Pentium III 450 MHz processor, 128 MB memory capacity and
Windows 98 Second Edition operating system.
After a design has been captured either through schematic capture or via HDLs
such as VHDL or Verilog, software HDL simulation of the design is the next step in the
design flow for functional and timing validation. Software HDL simulation has the
advantage of identifying potential problem areas before a design is implemented (for
FPGA) or manufactured (for ASIC) and hence correction or modification can be made.
The usage of both Post-Synthesis simulation and Post-Implementation simulation within
the design flow for design prototyping via FPGA technology can be attributed to the fact
that Post-Synthesis simulation is utilized for functional validation of the design whereas
Post-Implementation simulation is utilized for both functional and timing (performance)
validation of the design. In order to obtain a better understanding of the characteristics of
a particular design, both utilities can be important tools for such purpose.
The testing methodology employed in this project uses the bottom up approach,
which means lower level functional components such as various types of adders and
47
multipliers were tested before these components were combined to form higher level
functional units. Using the bottom up approach in testing is desired since this will help
assure that when the lower level components are combined into higher level functional
units, one can be more assured that the lower level components will not be at fault if
errors are detected.
6.1. Post-Synthesis Simulation
In order to be assured that version 2 of the convolution architecture functionally
performs as intended in the previous sections, Post-Synthesis simulation was utilized for
functional level validation. Post-Synthesis HDL simulation of a system is simulation of
the system as synthesized to netlist (gate-level) form and zero propagation delay is
assumed through gates. To determine the correctness of the functional unit under test, all
possible input vectors are required to be applied and checked against known correct
outputs or expected outputs from the functional unit under test. Thus, testbenches are
required and need to be developed for this purpose. However, if the number of inputs for
the functional unit under test is large, fully testing all the possible inputs or stimulus for
the functional unit under test can be quiet complex. Hence, automated generation of the
testbenches is preferred. In order to achieve this objective, C++ was used to write a
program capable of generating testbenches in the required format. For ease of re-running
the simulation process, the script file editor, a feature of the Xilinx Foundation Simulator,
has been used to eliminate the process of inputting test vectors after each simulation run.
Figure 6.1 below shows the testing model that was used for verifying the functionality of
lower level functional components.
The testing model shown in Figure 6.1 below was captured through VHDL as an
entity with the functional unit under test being instantiated and its output is compared to
the expected or theoretically correct result from the testbench.
6.1.1. Adders
Different types of Carry Lookahead Adders (CLA) were employed within the
convolution system. The main difference between all of them lies in the length of the
operands that they operate on and depending on the length they are referred to as 14-bit,
48
15-bit, 16-bit, 17-bit and 19-bit CLA. To check that these lower level functional
components operate as intended, Post-Synthesis simulation was used. One of the most
utilized CLA is the 14-bit CLA, and it was duplicated within each Multiplication Unit
(MU) contained in the convolution system.
expected / theoretically
correct result from testbench
stimulus / test vectors
of testbench
Functional Unit
Under Test
Comparator
err (zero indicates both results are
identical, one indicates otherwise)
Figure 6.1. Testing model for lower level functional components.
The testing methodology described in Figure 6.1 was used. The VHDL file that
contains the testbench entity and C++ program source code that generates the
theoretically correct outputs (in a file format that is acceptable to the script editor for
software simulation) can be found in Appendix B. Figure 6.2 shows the Post-Synthesis
simulation output of the testbench for the 14-bit CLA. However, due to the length of the
simulation, selective test vectors were used instead of an exhaustive (all possible inputs)
set. As can be seen from Figure 6.2, the signal err remains low throughout the simulation
and indicates that the outputs from the unit under test agree with the theoretically correct
outputs generated from the C++ program.
Figure 6.2. Post-Synthesis simulation for 14-bit CLA.
Figure 6.3 below shows a close up view for one segment of the testbench
simulation shown in Figure 6.2 above. Buses vec_a and vec_b are the two input operands,
while ans_ut is the output from the unit under test (14-bit CLA) and ans is the
49
theoretically correct output. For instance, at the left-most of bus vec_a one sees a
hexadecimal value of 0008 (8 in decimal) and vec_b shows a hexadecimal value of 2000
(-8192 in decimal), thus the sum should be 2008 (-8184 in decimal) which is the same
value shown on both buses ans and ans_ut.
Figure 6.3. A close up view of one segment of Figure 6.2.
The procedure for testing all the other CLAs with different operand lengths is the
same as shown above. Figure 6.4, Figure 6.5, Figure 6.6, and Figure 6.7 show PostSynthesis simulation results of testbenchs for each CLA. As can be seem from the figures,
the err signal stays low throughout, thus indicating that outputs from the unit under test
agrees with the predicted correct results generated by the C++ program.
Figure 6.4. Post-Synthesis simulation for 15-bit CLA.
Figure 6.5. Post-Synthesis simulation for 16-bit CLA.
Figure 6.6. Post-Synthesis simulation for 17-bit CLA.
50
Figure 6.7. Post-Synthesis simulation for 19-bit CLA.
6.1.2. Multiplication Unit
The Multiplication Unit (MU) is a lower level component that is replicated 25
times in version 2 of the convolution architecture within each AU for the case of n = 5.
Hence, it is important to determine that MU is functioning correctly. In order to test MU
with all possible inputs, a C++ program was written to generate the required testbench;
the program can be found in Appendix B. In addition, an entity was created in a VHDL
file with MU being instantiated and a comparator was also instantiated to compare the
output generated by MU with the theoretically correct output from the program (as an
input to the entity). This VHDL file can also be found in Appendix B.
Figure 6.8 below shows the complete run of the testbench. Coef is a 6-bit wide
signed filter coefficients bus, bus mag is an unsigned 8-bit magnitude input, bus product
is the output generated by the unit under test (MU in this case) and bus t_ans is the
theoretically correct output. Signal err is the output from the comparator, and it will be
one if both the outputs, t_ans and product are not matched. As shown in Figure 6.8
below, all the buses are packed closely and cannot be distinguished due to the length of
the simulation, however, the err signal remains low for the entire simulation. Thus, both
the buses t_ans and product are identical throughout the simulation.
Figure 6.8. Post-Synthesis simulation for all possible inputs for the Multiplication
Unit (MU).
51
Figure 6.9 below shows a close up view of one segment of the simulation. For
instance, in the first part of Figure 6.9, bus coef has a value of 21 (-31 in decimal) and bus
mag has a value of FB (251 in decimal), thus the product of the multiplication should
have a value of 219B (14-bit signed value) in hexadecimal (-7781 in decimal). Both
buses product and t_ans have the same value and thus the err signal has a value of zero
indicating that both the values agree with one another.
Figure 6.9. A close up view of one segment of the simulation in Figure 6.8 above.
6.1.3. Version 2 Convolution Architecture (with k = 1)
In the process of testing Version 2 of the convolution architecture as a whole unit
for the case of k = 1 (see Figure 4.20), a few minor modifications were made to the
system such that the Post-Synthesis simulation can be completed within a reasonable time
frame. However, these modifications do not affect the system’s intended characteristics.
For instance, the intended Input Image Plane (IP) has a size of 5100×6600 pixels; for
simulation purpose the IP size was reduced to 5×60 pixels. This reduction will in no way
hamper the system’s functional characteristics. The Filter Coefficient Plane (FC) will
remain the same as in the previous sections; a 5×5 size. Figure 6.10 below shows a test
case used to verify the functional correctness of Version 2 of the convolution
architecture. The IP has a size of 5×60, but due to the page width limit (this thesis) only
the first seven columns of the IP can be clearly shown. The same can be said about the
Output Image Plane (OI) in Figure 6.10.
A C++ program was written to generate the test vectors required to program all
MAUs with the correct filter coefficients. This C++ program can read a text file with
filter coefficients indicated within and then generate waveform vectors for the script
editor to use. Figure 6.11 below shows the source file for the program.
52
Input Image Plane (Decimal)
0
60
120
180
240
1
61
121
181
241
2
62
122
182
241
3
63
123
183
241
4
64
124
184
241
5
65
125
185
241
6
66
126
186
241
Filter Coefficient Plane (Decimal)
-1
1
0
1
2
2
-1
0
1
-2
3
2
1
0
1
0
1
1
1
0
1
0
1
1
1
Output Image Plane (Hexadecimal)
00259
00400
0061F
003C5
002D5
0029C
004BD
00790
005E7
0047D
0031D
005B9
0093D
00756
0069E
00328
005C8
0094A
0075F
006A5
00333
005D7
00956
00768
006AB
0033E
005E6
00962
00771
006B1
00349
005F5
0096E
0077A
006B7
Figure 6.10. Test case 1 with IP and OI of size 5×60 (however, only the first seven
columns of both IP and OI are shown due to report page width limit).
The operation of the convolution architecture can be divided into three phases; the
first phase of the operation starts as the system commences operation until the first two
rows of the IP have been stored into the external memory devices and no OI is generated.
The second phase of the operation starts when the system has enough IP to commence
generation of OI; this phase of operation ends when IP is received completely. Finally,
the last phase of the operation starts with the system being provided with zeros as input
and continues operation until all the OIs are generated.
53
#include <iostream.h>
#include <iomanip.h>
#include <fstream.h>
int main()
{
ifstream in_file1;
ofstream out_file1, out_file2;
in_file1.open("coef.txt");
out_file1.open("v_coef.dat");
out_file2.open("v_c_reg.dat");
int array[5][5];
int time, count, temp, a, b;
time = 40;
count = 1;
for (a=4; a>=0; a--)
{
for (b=0; b<5; b++)
{
in_file1 >> temp;
cout << temp << endl;
array[b][a] = temp;
}
}
out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;
out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;
for (a=0; a<5; a++)
{
for (b=0; b<5; b++)
{
out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << array[a][b] << "\\H +" << endl;
out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << count << "\\H +" << endl;
time += 20;
count++;
}
}
out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
in_file1.close();
out_file1.close();
out_file2.close();
return 0;
}
Figure 6.11. The source code for C++ program that generates test vectors to
program the filter coefficients into MAUs.
Figure 6.12 below shows the arrangement of the Filter Coefficients (FCs) within
each MAU contained in the Arithmetic Unit (AU). The arrangement of the FCs only
showing a 90 degree (clockwise) rotation is because the input image pixels have been
rotated by 90 degrees (counter clockwise) before flowing through the AU.
54
MAUs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Filter Coefficients
FC40
FC30
FC20
FC10
FC00
FC41
FC31
FC21
FC11
FC01
FC42
FC32
FC22
FC12
FC02
FC43
FC33
FC23
FC13
FC03
FC44
FC34
FC24
FC14
FC04
Arithmetic Unit
1
2
3
4
Filter Coefficients
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
FC00
FC01
FC02
FC03
FC04
FC10
FC11
FC12
FC13
FC14
FC20
FC21
FC22
FC23
FC24
FC30
FC31
FC32
FC33
FC34
FC40
FC41
FC42
FC43
FC44
Figure 6.12. Arrangement of the Filter Coefficients within the Arithmetic Unit.
Figure 6.13 and Figure 6.14 below show the Post-Synthesis simulation output of
the first phase of operation for the version 2 convolution architecture based on test case 1
shown in Figure 6.10. Figure 6.13 shows the programming of the FCs into respective
MAUs. As shown in the figure, coef_regs bus acts as a write enable for each MAU within
55
the Arithmetic Unit (AU) and coef bus is the FC value which is given as input to each
MAU.
Figure 6.13. First phase of operation; programming of FCs into MAUs.
Figure 6.14. First phase of operation; receiving the first two rows of the IP (shown
in figure above is the beginning of the second row of the input pixels).
56
In Figure 6.14, the input image pixel values are shown in hexadecimal instead of
the decimal values in test case 1 shown in Figure 6.10. As can be seen in Figure 6.14, in
this phase of operation, there are no output pixels generated since the system has to wait
until the first two rows of IP are received. Also shown in Figure 6.14, the system will
receive five input image pixels and then write to one memory location.
The second phase of the system operation is shown in the following figures.
Figure 6.15 below shows that the system is generating the first three output pixels for the
second row of the OI as compared to the test case 1 shown in Figure 6.10. Under normal
operation, the output pixels will be generated after multiple stages of pipeline delays
contained within the AU as shown in Figure 6.15. Figure 6.16 shows superimposed
(timing delay not included) output pixels with their corresponding input pixels for ease of
comparison. The output pixels shown in Figure 6.16 are the first six output pixels from
the second row of the OI. Buses o1, o2, o3, o4 and o5 are the output buses from
functional unit IDS to the AU (all the buses’ value are shown in hexadecimal) and each
bus is 40 bits wide (five input image pixels). In Figure 6.16 below, the 25 input image
pixels that correspond to each output pixel are highlighted. As can be seen from Figure
6.16, all the output pixels are as predicted in Figure 6.10.
Figure 6.15. Second phase of operation; output pixels generated.
57
Figure 6.16. Second phase of operation; output pixels of the second row of OI
(superimposed).
The third phase of system operation is shown in Figure 6.17 below. Figure 6.17
below shows that the system generates the first six output pixels for the last row of OI. At
this phase of operation, the input image pixels are completely received and zeros are
inserted into the system. The six output pixels shown in Figure 6.17 are compared and
validate with the output pixels predicted in test case 1shown in Figure 6.10.
Figure 6.17. Third phase of operation; output pixels of the last row of OI
(superimposed).
58
A second test (test case 2) was also done to further investigate the correctness of
operation of the system. Figure 6.18 shows the FCs, IP (the first seven columns) and the
OI (the first seven columns) for test case 2. As in the previous test case (Figure 6.10), due
to the page width limit, only the first seven columns of the IP and OI are being displayed
from the 60 columns.
Input Image Plane (Decimal)
0
60
120
180
240
1
61
121
181
241
2
62
122
182
241
3
63
123
183
241
4
64
124
184
241
5
65
125
185
241
6
66
126
186
241
Filter Coefficient Plane (Decimal)
-1
1
0
2
2
2
-2
0
1
-2
1
1
1
0
1
0
1
-1
1
0
1
0
2
-2
1
Output Image Plane (Hexadecimal)
000F0
001A9
00314
0025E
0038B
0012F
001EB
00392
00318
00354
001AA
0031E
00500
003D2
00448
001B0
00326
00508
003D8
0044E
001B6
0032E
0050F
003DE
00452
001BC
00336
00516
003E4
00456
001C2
0033E
0051D
003EA
0045A
Figure 6.18. Test case 2; IP, FCs and expected OI (the first seven columns).
Figure 6.19, Figure 6.20 and Figure 6.21 show the post-synthesis simulation
results of all three phases of operation for Version 2 of the convolution architecture with
IP and FCs as intended in Figure 6.18. All the results from Figure 6.19 and Figure 6.20
agree with the expected results shown in Figure 6.18 above.
59
Figure 6.19. First phase of operation for test case 2.
Figure 6.20. Second phase of operation for test case 2; output pixels shown are the
first six of row one of OI (superimposed).
60
Figure 6.21. Third phase of operation for test case 2; output pixels shown are the
first six of the last row for OI (superimposed).
6.2. Post-Implementation Simulation
After version 2 of the convolution architecture, for the case of (k = 1), was
functionally validated via post-synthesis HDL simulation, its functional and timing
characteristics were studied and validated through post-implementation simulation. The
following sections will describe and depict synthesis and implementation of the system to
a particular Field Programmable Gate Array (FPGA) chip and the post-implementation
simulations that have been done to validate the version 2 convolution architecture.
6.2.1. Synthesis and Implementation of Version 2 Convolution Architecture (with k
= 1)
In general, when a system described in a HDL is synthesized to a specific FPGA
chip, the CAD packages (Xilinx Foundation Series in this case) invoke a process that
translates the system described in HDL to a specific gate level netlist. The gate level
netlist may consist of any gate level elements or functional units that are specific to a
certain family of FPGA, hence a targeted (specific) FPGA is to be specified before the
process begins. Following the synthesis process is the implementation process of the
desired system which targeted a specific FPGA chip. This process includes map, place
and route of the netlist within the specific FPGA chip. Within each FPGA chip there are a
61
certain number of Configurable Logic Blocks (CLBs), and within each of these CLBs
there are a number of Lookup Tables (LUTs) and memory elements such as Flip-Flops
(FFs). The mapping process implements the gate level netlist to the FPGA chip using all
the available resources. Then, the place and route process determines the best placement
and routing of all the resources used for the mapped system such that all the components
(resources) are connected according to the netlist.
For this project, a prototyping board (XSV800) manufactured by XESS Co. is
used. This protoboard featured the Virtex family FPGA chip (XCV800) from Xilinx. Table
6.1 below shows a summary of the resources available within the FPGA chip on the
protoboard. There are 4704 CLBs in this specific FPGA chip and within each CLB there
are four 4-Input LUTs and four FFs. Table 6.2 below shows the resource utilization on
the XCV800 chip as version 2 of the convolution architecture (with k = 1) is implemented.
Table 6.1. Details of FPGA on the XESS protoboard.
XCV800 (Virtex FPGA family)
FPGA
System Gates
888,439
CLB Array
56×84
18,816
FF
4-Input LUT
18,816
Table 6.2. Resource utilization of Version 2 Convolution Architecture (with k = 1)
CLBs
1,878
2,620
FF
4-Input LUT
5,955
Equivalent System Gates 96,210
6.2.2. Version 2 Convolution Architecture (with k = 1)
The post-implementation simulations of version 2 of the convolution architecture
with k = 1 were conducted with the same test cases run in the post-synthesis simulations
in the previous section. The script file and C++ programs used in the post-synthesis
simulations were reused in the post-implementation simulation testing and validation
processes described here.
Figure 6.22 and Figure 6.23 below show the results of the second phase and third
phase of operation for post-implementation simulation of test case 1 (see Figure 6.10). As
can be seen from both of the figures, the highlighted output image pixels were as
62
predicted in Figure 6.10. Figure 6.24 and Figure 6.25 show the second and third phase of
operation of post-implementation simulation for test case 2 (see Figure 6.18). All the
output image pixels highlighted within these figures were in agreement with the predicted
output image pixels as shown in Figure 6.18.
In both Figure 6.22 and Figure 6.24, the second phase of operation is shown; after
the first two rows of the IP has been stored and the convolution architecture starts the
convolution process. Meanwhile, in Figure 6.23 and Figure 6.25 the third phase of
operation is shown; with all the IP received and zeros are inserted as input for the
convolution system to process the last two rows of the OI.
A clock frequency (clk in all figures) of 12.5 MHz has been used in all the postimplementation simulations (Figure 6.22, Figure 6.23, Figure 6.24 and Figure 6.25)
conducted thus far. The main objective of the simulation testing described in this section
was to validate the system functionality and performance with respect to being able to
generate one OI pixel on each system clock cycle with a 5×5 FC. For the case of k = 1,
the convolution architecture met our just stated functional and perfomance goals.
Figure 6.22. Second phase of operation for test case 1 (post-implementation
simulation); output pixels of the second row of OI (superimposed).
63
Figure 6.23. Third phase of operation for test case 1 (post-implementation
simulation); output pixels of the last row of OI (superimposed).
Figure 6.24. Second phase of operation for test case 2 (post-implementation
simulation); output pixels shown are the first six of row one of OI (superimposed).
64
Figure 6.25. Third phase of operation for test case 2 (post-implementation
simulation); output pixels shown are the first six of the last row for OI
(superimposed).
6.2.3. Synthesis and Implementation of Version 2 Convolution Architecture (k = 3)
As shown in Figure 4.20, the architecture can be scaled up to perform k
convolutions in parallel. To validate the scalability of version 2 convolution architecture,
the version 2 convolution architecture with three AUs instantiated is synthesized and
implemented to the XCV800 FPGA chip. Table 6.3 below shows the XCV800 chip
resource utilization as the convolution architecture is implemented. However, as the
system is scaled up to process three convolutions in parallel, the total number of system
gates did not increase proportionally. As can be seen from Table 6.3, the equivalent
system gates for k = 3 is 173,170 gates compared to the 96,210 gates for k = 1 (Table
6.2), an increase of 80 percent, which is less than the factor of three. This is due to the
fact that when the system is scaled up, only the AU needs to be replicated and it does not
need to be totally replicated as earlier discussed. Comparison of the total number of CLBs
utilized between the two implementations will not yield a good measurement since not all
the elements within each CLB are utilized.
65
Table 6.3. Resource utilization of Version 2 Convolution Architecture (with k = 3)
CLBs
4,613
5,226
FF
4-Input LUT
15,307
Equivalent System Gates 173,170
6.2.4. Validation of Version 2 Convolution Architecture (with k = 3)
In order to validate that version 2 of the convolution architecture can be scaled up
to include more than one AU (k > 1 in Figure 4.20) and continue to operate correctly from
a functional and performance standpoint, this section presents post-implementation
simulation results of version 2 convolution architecture operating with three instantiated
AUs. All VHDL code for version 2 of the convolution architecture with three AUs
instantiated can be found in Appendix A.
To validate the output image planes (OI) generated by version 2 convolution
architecture for k = 3, a C++ program with the ability to generate different sets of input
image planes of size 5×60 pixels, depending on the seed number given, has been written
and used. The program uses the rand function to generate random numbers based on the
given seed and the generated numbers were limited in the range of 0 to 255. In addition,
the program also generates the three expected output image planes based on the three
filter coefficient planes that it reads in. The source code of the program mentioned above
can be found in Appendix C. Another program that generates all the test vectors
necessary to program each individual MAUs with the filter coefficients was written and
used. This program reads in three FC planes contained in a text file and then generates
the test vectors according to the script editor format (source code for this program can
also be found in Appendix C).
Two test cases were post-implementation simulated and each of the test cases was
run with a single and different IP (generated by giving different seed number) and three
distinct FC sets. This was done to further validate correct operation and performance of
the version 2 convolution architecture with k = 3. Figure 6.26 below shows test case
number 1 with the inputs and expected outputs (generated by the C++ program
mentioned in the above paragraph). However, again due to page width limitation the
figure only shows part of the IP and predicted OIs.
66
Figure 6.26. Test Case 1: FC planes, IP plane and the predicted OI planes.
Figure 6.27 and Figure 6.28 below show the results of the post-implementation
HDL simulation with the inputs of test case 1 (see Figure 6.26). Figure 6.27 shows the
output image pixels for the first row of the three OIs starting from the third output image
pixel (signals out_pxl1, out_pxl2, and out_pxl3 were output image pixels for the first OI,
second OI and third OI respectively). Figure 6.28 shows the second row of output image
pixels (start from the third pixel) for all three OIs. All output pixels generated by postimplementation simulation of the version 2 convolution architecture system agreed with
the expected results shown in Figure 6.26.
As can be seen from Figure 6.27 and Figure 6.28 all the input image pixels
highlighted within each rectangle correspond to the 25 input image pixels required for all
three convolutions (one output image pixel per FC plane). Again, the clock frequency
that has been utilized in the post-implementation simulation run of test case 1 in Figure
6.27 and Figure 6.28 is 12.5 MHz.
67
Figure 6.27. Superimposed output image pixels (start from the 3rd pixel) for first
row of the OIs for test case 1.
Figure 6.28. Superimposed output image pixels (from 3rd pixel onward) of the
second row of the OIs for test case 1.
For test case number 2, the IP generator program was given a seed number of 2
and hence a different IP plane was produced as shown in Figure 6.29. Figure 6.30 (OIs
result for third row) and Figure 6.31 (OIs result for fourth row) show the post-
68
implementation simulation result for test case 2. The output results from both of the
figures agreed with the predicted results shown in Figure 6.29. The clock frequency of
test case 2 is the same as in test case 1. All the highlighted input image pixels within each
of the rectangles corresponding to the 25 input image pixels required for each output
pixel generated. Each individual OI pixel is generated within a single system clock cycle.
Figure 6.29. Test case 2: FC planes, IP plane and the predicted OI planes.
Validation of version 2 of the convolution architecture has been accomplished
through post-synthesis and post-implementation HDL simulation utilizing the Xilinx
Foundation CAD software packages. All the simulations are done with the system
implemented to a Xilinx Virtex FPGA (XCV800). As the system is scaled up to process k
convolutions in parallel, the hardware increment is directly proportional to k since only
AUs are replicated. A graph showing the equivalent system gates count compared to the
number of FC planes is plotted and shown in Figure 6.32 below. Since all the simulation
69
results are as desired and correct, the version 2 convolution architecture is functionally
and performance validated in that it can correctly generate three OI pixels (OI1, OI2, and
OI3) within one system clock cycle (with k = 3).
Figure 6.30. Superimposed output image pixels (start from the 3rd pixel) for third
row of the OIs for test case 2.
Figure 6.31. Superimposed output image pixels (from 3rd pixel onward) of the fourth
row of the OIs for test case 2.
70
Equivalent System Gates versus Number of FC planes
200,000
180,000
160,000
140,000
120,000
100,000
80,000
60,000
40,000
20,000
0
0
0.5
1
1.5
2
2.5
3
Figure 6.32. A plot of equivalent system gates versus number of FC planes.
71
3.5
Chapter 7
Hardware Prototype Development and Testing
Hardware prototype development and testing were done to experimentally
validate the functionality and performance of the convolution architecture. Ideally, the
convolution architecture will be implemented in ASIC technology with external SRAM
(Data Memory) as shown in Figure 7.1 below. In the figure below, b and l denote the bus
width for the address bus of the external SRAM and output image pixel respectively. CE,
OE, and WE are chip enable, output enable and write enable control signals for the
external SRAM. For example, to implement the convolution architecture with three 5×5
FC planes, a total of 113 IO (Input Output) pins are needed on the FPGA or ASIC.
CE
OE
WE
SRAMs
address
b
data
n×d
FPGA or ASIC
(Implementation of
Convolution
architecture)
OI1
l
OI2
l
OIk
l
Output
Image
planes
Figure 7.1. Convolution Architecture hardware implementation.
To further validate the convolution architecture functionality and performance
correctness, hardware emulation of version 2 of the convolution architecture is done
through the development and testing of a FPGA based prototype. A hardware prototyping
board manufactured by the XESS Corporation [17] which features Xilinx Virtex FPGA
(XCV800) technology was available and utilized. Figure 7.2 below shows a picture of the
XSV-800 prototype board. Even though the XCV800 FPGA has enough IO pins to handle
the convolution architecture configuration shown in Figure 7.1 above, the SRAM on the
72
prototyping board has lower data bandwidth than desired. Because of this, the Data
Memory was inferred or emulated within the FPGA.
Figure 7.2. XSV-800 prototype board featuring Xilinx Virtex 800 FPGA (picture
obtained from XESS Co. website, http://www.xess.com).
The hardware emulation is carried out by programming the FPGA with the
convolution architecture through the parallel port of a computer. The Xilinx Foundation
series CAD software package [18] was utilized to generate the bit stream file (FPGA
configuration bit stream) necessary to program the FPGA with the desired convolution
architecture hardware description. A software utilization package, XSTOOLs is provided
by XESS cooperation for use with the prototyping board. The software package includes
programs such as bit stream download (for FPGA programming), a clock frequency
setting program and an on-board SRAM content retrieval or initialization program.
7.1. Board Utilization Modules and Prototype Setup
As shown in Figure 7.2, the prototyping board contains many auxiliary parts such
as push buttons, LEDs, SRAMs, parallel port and so on. However, to utilize these parts,
the FPGA needs to be programmed with the appropriate driver or module. These drivers
or modules are implemented within the FPGA since all these parts are connected to the
FPGA. Thus, for the purpose of hardware emulation of the convolution architecture
system, a means of supplying the input image plane (IP) and storing of the output image
73
plane (OI) is necessary and must be developed. An internal Block RAM within the FPGA
is implemented to provide the convolution system with the input image pixels. The
internal RAM is initialized with input image pixels when the system is synthesized and
implemented. Figure 7.3 below shows a pictorial view of the prototyping hardware. All
functional blocks or modules within the FPGA were implemented with VHDL.
Data
Memory
Filter Coefficients
6
Parallel
port
Input pixels
Convolution
System
FPGA
FC
Programming
19 Output pixels, data lines
8
Block
RAM
External
clock
SRAM
driver
21 address
6 Control
signals
SRAM
clock
Figure 7.3. Top level view of the prototyping hardware.
In Figure 7.3, there is a FC Programming module. This module, as the name
implies is responsible for initialization of all the filter coefficients within the convolution
system. The filter coefficients are supplied from the computer through the use of the
parallel port. A C++ program (this program can be found in Appendix D) was written to
read in the filter coefficients for each FC plane contained in a file and send them through
the parallel port to the module. The FC Programming module receives two bytes of data
from the parallel port to program one filter coefficient register. In addition, the FC
Programming module also sends the output image plane to the external SRAM for storage
and analysis later on. Due to the limit of the data bus for the external SRAM, it is only
possible to retain one OI from each run.
Also shown in the figure is the Block RAM module within the FPGA chip. This
module is mainly responsible for providing the convolution system with the input image
pixels and it is implemented through VHDL code. This module makes use of the internal
RAM within the FPGA to store the input image pixels. In order to be able to generate the
VHDL file for this module easily, a C++ program was written to generate this file
depending on the seed of the random number provided. By using the same seed number
74
as those used in the previous chapter, the same input image pixels can be generated. An
example of the generated VHDL file is shown in Figure 7.4 below. The C++ program can
be found in Appendix D.
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity IN_RAM is
port( clk: in std_logic;
rst: in std_logic;
req: in std_logic;
dout: out std_logic_vector(7 downto 0) );
end entity IN_RAM;
architecture STRUCT of IN_RAM is
component RAMB4_S8 is
port(
DI: in std_logic_vector(7 downto 0);
EN: in std_logic;
WE: in std_logic;
RST: in std_logic;
CLK: in std_logic;
ADDR: in std_logic_vector(8 downto 0);
DO: out std_logic_vector(7 downto 0) );
end component RAMB4_S8;
attribute INIT_00: string;
attribute INIT_00 of IRM: label is
"9fb6add75e5a290b8713f55f888c72ce37e06d31362b091dd779a254c5b24a30";
attribute INIT_01: string;
attribute INIT_01 of IRM: label is
"80b20000c7212be922035da80f3273676826daf8fd91c35f0b99e093f209c61c";
attribute INIT_02: string;
attribute INIT_02 of IRM: label is
"2f2d2f198cec76303797682ed5553d18eb5345050260bccdc4eed36fc92e4910";
attribute INIT_03: string;
attribute INIT_03 of IRM: label is
"e4392f650000e90330e84c421110aa1db0d9d34e544884c1b5a7ce5aeaff060d";
attribute INIT_04: string;
attribute INIT_04 of IRM: label is
"29af02af2cf91bfc241bd2ada5d75262f228d437b5d0fc6e8f18bb82b5216f9e";
attribute INIT_05: string;
attribute INIT_05 of IRM: label is
"8f552a661c42000095af0ea4bb1cfdb88c34cdf122f8ae8904447e84657c6a00";
attribute INIT_06: string;
attribute INIT_06 of IRM: label is
"e1ab005cf16e41b0d174d275a95fa85c177eb6d6ec1b2ef67cd891e33f88cfb7";
attribute INIT_07: string;
attribute INIT_07 of IRM: label is
"679780f0f91cba3e000007b707bc25aba634015c6ab3c55053130fd44d7f9ee8";
attribute INIT_08: string;
attribute INIT_08 of IRM: label is
"53bb6d2e0a67fd7071d852926a66ce6a617485308dc35ad9177c391f8f32a8b3";
attribute INIT_09: string;
attribute INIT_09 of IRM: label is
"00000000000000000000000066896a4dc61c8c7b1434242dcc20e505cb7aa7a9";
signal din : std_logic_vector(7 downto 0);
signal addr: unsigned(8 downto 0);
Figure 7.4. Example of a VHDL file for creating an internal Block RAM containing
input image pixels for the convolution system (seed number of 1 is provided to the
program).
75
signal adr : std_logic_vector(8 downto 0);
signal en : std_logic;
signal we : std_logic;
begin
L1:
L2:
L3:
L4:
din <= (others=>'0');
en <= '1';
we <= '0';
adr <= std_logic_vector(addr);
P1: process(clk, rst, req) is
begin
if (rst = '1') then
addr <= (others=>'0');
elsif (clk'event and clk = '1') then
if (req = '1') then
addr <= addr + 1;
end if;
end if;
end process P1;
IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk,
ADDR=>adr, DO=>dout);
end architecture STRUCT;
Figure 7.4(Continued) Example of a VHDL file for creating an internal Block RAM
containing input image pixels for the convolution system (seed number of 1 is
provided to the program).
In addition to the two modules mentioned above, there is another module that is
responsible for controlling the external SRAM (parts that are external to the FPGA). This
module generates progressive addresses to store the output image pixels in an ascending
order in addition to the other signals such as cen (chip enable), wen (write enable) and so
on to ensure the proper functioning of the SRAM.
All modules mentioned in this section were developed and implemented via
VHDL description. The VHDL files for all these modules can be found in Appendix E.
7.2. Hardware Prototyping Flow
After a design is synthesized and implemented through use of CAD packages, a
bit stream file (FPGA configuration bit file) for a specific FPGA chip is generated. In this
case, the bit stream file contains the configuration bits for the convolution architecture as
well as the auxiliary modules generated for a Xilinx XCV800 FPGA chip. Next, the bit
stream file is programmed into the FPGA through the parallel port of a computer. For this
particular prototyping board by the XESS Co., a FPGA configuration or download
76
program, gxsload, is provided. Figure 7.5 shows the graphic interface of the gxsload
program once it is executed.
Figure 7.5. FPGA configuration and bit stream download program, gxsload from
XESS Co.
After the FPGA chip has been configured with the convolution architecture, it is
ready for hardware experimental testing and validation of the convolution architecture.
Since the input image pixels are stored within the Block RAM module within the FPGA,
the only time that the system requires external input (external of the FPGA) is for filter
coefficient programming. This is done through hardware (FC Programming module) and
software. The software program (can be found in Appendix D) was developed and
written with the C++ language to read in the filter coefficients from a text file, coef.txt
(the same file as shown in Figure 6.26), and send all the data through the parallel port to
the convolution architecture. However, within this file it also specifies which OI plane
the external SRAM stores during each experimental run. Figure 7.6 below shows a
segment of the verbose output for the execution of the FCs configuration program. The
program enters the filter coefficient in the order shown in Figure 6.12. For each filter
coefficient two bytes of data are sent through the parallel port, the first byte indicates the
position of the filter coefficient within AU and the following byte is the filter coefficient.
77
Figure 7.6. Execution of the FCs configuration program.
Next, the convolution process is commenced when the start push button on the
prototyping board is pressed. One of the push buttons on the prototyping board is mapped
as the start signal for the convolution system. Since the execution of the convolution
architecture is transparent to the user, a LED on the prototyping board is mapped to the
invert of the SRAM’s write enable signal (invert of the wen signal in the highest level of
VHDL description). Consequently, once the convolution architecture finishes its
execution, the SRAM’s write enable line will be pulled low, hence the LED is lighted.
Then, the output image pixels stored in the external SRAM are retrieved by using
the gxsload program, which is the same program used to download the FPGA
configuration file. Figure 7.7 shows the graphical interface of the gxsload program when
used to upload SRAM contents to a file. The uploaded SRAM content is stored in an Intel
hex file format. Figure 7.8 below shows the uploaded SRAM contents in a file. There are
two banks of SRAM on the prototyping board, the left bank and the right bank. Each bank
of the SRAM has a 16-bit data bus and 19-bit address bus. Since the output image pixels
are 19-bits wide, both sides of the SRAM are utilized.
As evident from Figure 7.8 below, it is tedious to trace and compare the uploaded
SRAM contents to the expected output results. As mentioned in the previous section, a
program was written to generate the theoretically correct output image pixels (shown in
Figure 6.26). In order to compare the uploaded results with the known correct results
efficiently, a C++ program was written to parse the Intel hex file and check against the
78
theoretically correct output for similarity. The source code for this program can be found
in Appendix D.
SRAM Upper limit
SRAM lower limit
Figure 7.7. Upload SRAM content using gxsload utility, the high address indicates
the upper bound of the SRAM address space whereas the low address indicates the
lower bound of the SRAM address space.
Figure 7.8. Uploaded SRAM contents stored in a file (Intel hex file format). There
are two segments due to the fact that the program wrote the right bank of the SRAM
(16-bit) first and the left bank of the SRAM next (16 MSB bits).
79
7.3. Test Cases
To validate correct functional and performance operation of the FPGA based
hardware prototype of the convolution architecture, two test cases were run. The
convolution architecture was run at 2 kHz clock frequency for these test cases. Maximum
clock rate hardware prototype performance was not a goal for these two tests. The
performance metric of interest is whether the prototype can simultaneously convolute one
IP with k (n×n) FCs and generate k OI pixels on each system clock cycle. These two test
cases are as shown in Figure 6.26 and Figure 6.29 of the previous section. Since for each
test case there are three different Filter Coefficient (FC) planes, three experimental runs
must be carried out. The reason that three experimental runs are needed even though all
three OI planes are generated in each experimental run is because of the SRAM data bus
bandwidth limitation. Each OI requires 19-bits and the SRAM data bus is only 32-bits
wide.
Figure 7.9 (first OI plane), Figure 7.10 (second OI plane) and Figure 7.11 (third
FC plane) show the results obtained from the SRAM after each experimental run. The
grayed areas of the figures are Intel hex file header and checksum, and the highlighted
boxes with arrows projected to the bottom of the figure are the first OI pixel for the
respective OI plane. To obtain the second OI pixel, slide the windows to the next column
as marked. Comparison of all three obtained OI planes with the result shown in Figure
6.26 reveals they match. The comparing program executed on each of the experimental
runs show that the obtained results are identical to the expected results.
80
Figure 7.9. SRAM contents retrieved for first OI plane for test case 1.
Figure 7.10. SRAM contents retrieved for second OI plane for test case 1.
Figure 7.11. SRAM contents retrieved for third OI plane for test case 1.
81
Figure 7.12 (first OI plane), Figure 7.13 (second OI plane) and Figure 7.14 (third
OI plane) below show the experimental runs for test case 2. Again, the OI planes received
from each experimental run are compared to the expected results shown in Figure 6.29 of
the previous section. After each experimental run, the results obtained were compared
with the expected results using the comparison program. All three of the OI planes
matched with the expected results and thus, again, validated the correctness of the
convolution architecture.
From the results obtained from testing of the hardware prototype, the functionality
and performance correctness of version 2 of the convolution architecture is thus further
validated.
Figure 7.12. SRAM contents retrieved for first OI plane for test case 2.
Figure 7.13. SRAM contents retrieved for second OI plane for test case 2.
82
Figure 7.14. SRAM contents retrieved for third OI plane for test case 2.
83
Chapter 8
Conclusions
In summary, the main objective of this thesis research project was to develop the
architecture for and design, validate, and build a hardware prototype of a convolution
architecture capable of processing an input image plane such that an output image pixel is
obtained every clock cycle assuming convolution with one FC plane. In addition, the
convolution architecture needed to be scalable in both the filter coefficient plane size
(kernel size) and the number of filter coefficient planes which could be simultaneously
processed. The motivations behind the scalability were such that, first, the convolution
architecture can be tailored to any size of kernel and still produce one output image pixel
per clock cycle. Secondly, a motivation for scalability was to allow k kernels of any size
within the architecture and the architecture will have the functional and performance
capability to output k output image pixels on each system clock cycle.
The developed convolution architecture was captured through the use of the
VHDL hardware description language. Xilinx Foundation series CAD software packages
were used to synthesize and implement the architecture to a FPGA chip. Before the
architecture was prototyped into the prototyping board for experimental testing, the
architecture was functionally and performance validated through HDL software
simulation
via
post-synthesis
and
post-implementation
software
simulations.
Experimental testing of the architecture was done on a prototyping board that featured a
Virtex family FPGA.
Post-synthesis
and
Post-implementation
HDL
software
simulation
and
experimental hardware testing of the hardware prototype showed that the implemented
prototype of the convolution architecture was indeed functionally correct as intended. It
is felt that if the convolution architecture were implemented in high speed ASIC
“production” technology with a high speed external SRAM, the intent that a single IP
pixel can be convoluted with k (n×n) FCs and k OI pixels generated within a clock cycle
of 7.3 ns could be achieved. If required, and as earlier indicated, pipelining of the
84
multiply unit within MAUs of the AUs would greatly increase overall system performance
if needed.
As a side note, a convolution program with three 5×5 FC planes and 5100×6600
IP plane was run on a general purpose processor (AMD Athlon 650 MHz) for a “loose”
comparison to the performance of the new convolution architecture. The processor used
on average 0.4 second of system time to convolute the one IP plane with three FC planes
which indicates that when the processor runs at around 260 MHz it would be able to meet
the “production” requirements for the new convolution architecture system. However, the
cost/performance ratio for the general purpose processor will be higher than the version 2
convolution architecture implemented in ASIC technology considering the die size of
both architectures (the convolution architecture has roughly less than ten percent of the
general purpose processor’s transistor count).
In conclusion, the best cost/performance ratio can be obtained from implementing
the new convolution architecture in “production” ASIC technology which should allow
the system clock of the convolution architecture to have a desired cycle time of 7.3 ns or
less. Thus, the primary factor that determines performance of the new convolution
architecture is the speed of the implementation technology, optimization of the
layout/placement of the implementation to reduce longest path delays, and the degree of
pipelining one chooses to design into the system.
85
Appendix A
VHDL Code for Version 2 Discrete Convolution Architecture (Figure 4.20 for k = 3)
1. Version 2 Convolution Architecture
-- sys.vhd (Top Level System of Version 2 Convolution Architecture)
library IEEE;
use IEEE.std_logic_1164.all;
entity SYS is
port( clk, rst, str: in std_logic;
d_in: in std_logic_vector(7 downto 0); --(FIFO -> DM_IF)
coef: in std_logic_vector(5 downto 0); --(FCs from parallel port)
ld_reg: in std_logic_vector(4 downto 0); --(MAUs select from pp)
au_sel: in std_logic_vector(1 downto 0); --(AU select from pp)
o_sel: in std_logic;
--(Output config pin from pp)
req: out std_logic;
--(Controller -> FIFO)
sram_w: out std_logic;
--(SYS -> SRAM)
d_out: out std_logic_vector(18 downto 0) );
end entity SYS;
architecture STRUCT of SYS is
component CTR is
port( clk, rst, str, bor, eoc, sds, rgt: in std_logic;
f_sel: out std_logic;
z_pad: out std_logic;
reg_sel: out std_logic_vector(2 downto 0);
en_w: out std_logic_vector(1 downto 0);
en_sf: out std_logic_vector(1 downto 0);
z_input: out std_logic;
c_inc: out std_logic_vector(1 downto 0);
rot: out std_logic;
r_inc: out std_logic;
req: out std_logic;
r_w: out std_logic;
s_inc: out std_logic;
sd_inc: out std_logic;
ans: out std_logic );
end component CTR;
component RCNT is
port( clk, rst, r_inc, sd_inc: in std_logic;
eoc, sds, rgt: out std_logic );
end component RCNT;
component REG_A is
generic( n: integer := 8; -- denotes the data width
d: integer := 5 );-- denotes the number of registers
port( clk, rst, z_pad, z_input: in std_logic;
reg_sel: in std_logic_vector(2 downto 0);
d_in: in std_logic_vector(n-1 downto 0);
d_out: out std_logic_vector((n*d)-1 downto 0);
ids_out: out std_logic_vector(n-1 downto 0) );
end component REG_A;
component C_BANK1 is
port( clk, rst, f_sel, z_pad: in std_logic;
en_sf, en_w: in std_logic_vector(1 downto 0);
ld_reg: in std_logic_vector(2 downto 0);
bseq: in std_logic_vector(3 downto 0);
d_in: in std_logic_vector(39 downto 0);
d_out: out std_logic_vector(31 downto 0));
end component C_BANK1;
component RAM is
port( wclk, r_w: in std_logic;
d_in: in std_logic_vector(39 downto 0);
addr: in std_logic_vector(6 downto 0);
86
d_out: out std_logic_vector(39 downto 0) );
end component RAM;
component MEMPTR is
port( clk, rst, rot, s_inc: in std_logic;
inc: in std_logic_vector(1 downto 0);
reg_sel: in std_logic_vector(2 downto 0);
bor: out std_logic;
-- begining of a new row
bseq: out std_logic_vector(3 downto 0);
add_out: out std_logic_vector(6 downto 0) );
end component MEMPTR;
component IDS is
port( clk, rst, ans: in std_logic;
ids_in: in std_logic_vector(39 downto 0);
o1, o2, o3, o4, o5: out std_logic_vector(39 downto 0));
end component IDS;
component DU is
port( clk, rst: in std_logic;
ids_in: in std_logic_vector(39 downto 0);
du_out: out std_logic_vector(39 downto 0));
end component DU;
component AU is
port(
clk, rst: in std_logic;
du0, du1, du2, du3, du4: in std_logic_vector(39 downto 0);
p_en: in std_logic;
ld_reg: in std_logic_vector(4 downto 0);
coef: in std_logic_vector(5 downto 0);
out_pxl: out std_logic_vector(18 downto 0);
ovf: out std_logic);
end component AU;
-- Internal signals to connect components
signal r_inc, sd_inc, eoc, sds, rgt: std_logic; --(Row counter <-> Controller)
-- signal req
: std_logic;
signal z_pad, z_input
: std_logic; --(Controller -> REG_A
)
signal rot, s_inc, bor
: std_logic; --(Controller -> MEMPTR )
signal r_w
: std_logic; --(Controller -> RAM
)
signal f_sel
: std_logic; --(Controller -> BANK_1 )
signal ans
: std_logic; --(Controller -> IDS
)
signal a1, a2, a3
: std_logic; --(Controller->a1->a2->a3->ans)
signal ovf1, ovf2, ovf3
: std_logic; --(Overflow from AUs)
--(Controller -> MEMPTR)
signal c_inc
: std_logic_vector(1 downto 0);
--(Controller -> BANK_1)
signal en_w, en_sf
: std_logic_vector(1 downto 0);
--(Controller -> REG_A, MEMPTR)
signal reg_sel
: std_logic_vector(2 downto 0);
--(REG_A -> RAM)
signal rega_ram
: std_logic_vector(39 downto 0);
--(REG_A -> IDS)
signal rega_ids
: std_logic_vector(7 downto 0);
--(MEMPTR -> C_BANK1)
signal bseq
: std_logic_Vector(3 downto 0);
--(C_BANK1 -> IDS)
signal cbank_ids
: std_logic_vector(31 downto 0);
--(RAM -> C_BANK1)
signal ram_cbank
: std_logic_vector(39 downto 0);
--(MEMPTR -> RAM) Ram address
signal memptr_ram
: std_logic_vector(6 downto 0);
--(IDS -> DUs)
signal o1, o2, o3, o4, o5
: std_logic_vector(39 downto 0);
--(Combined output from REG_A and C_BANK1 into ids_in)
signal ids_in
: std_logic_vector(39 downto 0);
--(DUs -> AUs)
signal du_au1, du_au2, du_au3 : std_logic_vector(39 downto 0);
signal du_au4, du_au5
: std_logic_vector(39 downto 0);
--(AUs -> Output Pixels)
87
signal out_pxl1
: std_logic_vector(18 downto 0);
signal out_pxl2
: std_logic_vector(18 downto 0);
signal out_pxl3
: std_logic_vector(18 downto 0);
--(AU's select line for programming)
signal a_sel
: std_logic_vector(2 downto 0);
--(Output select register for holding output selection from parallel port)
signal op_sel_reg
: std_logic_vector(1 downto 0);
--(ans delays signals)
signal ds
: std_logic_vector(13 downto 0);
begin
-- Main Controller of Version 2 Convolution Architecture
U0: CTR port map(clk=>clk, rst=>rst, str=>str, bor=>bor, eoc=>eoc, sds=>sds,
rgt=>rgt, f_sel=>f_sel, z_pad=>z_pad, reg_sel=>reg_sel, en_w=>en_w,
en_sf=>en_sf, z_input=>z_input, c_inc=>c_inc, rot=>rot,
r_inc=>r_inc, req=>req, r_w=>r_w, s_inc=>s_inc, sd_inc=>sd_inc,
ans=>a1);
-- Row counter for the main controller
U1: RCNT port map(clk=>clk, rst=>rst, r_inc=>r_inc, sd_inc=>sd_inc, eoc=>eoc,
sds=>sds, rgt=>rgt);
-- Register A of the DM_IF
U2: REG_A port map(clk=>clk, rst=>rst, z_pad=>z_pad, z_input=>z_input,
reg_sel=>reg_sel, d_in=>d_in, d_out=>rega_ram, ids_out=>rega_ids);
-- C_BANK1 of the DM_IF
U3: C_BANK1 port map(clk=>clk, rst=>rst, f_sel=>f_sel, z_pad=>z_pad, en_sf=>en_sf,
en_w=>en_w, ld_reg=>reg_sel, bseq=>bseq, d_in=>ram_cbank,
d_out=>cbank_ids);
-- RAM
U4: RAM port map(wclk=>clk, r_w=>r_w, d_in=>rega_ram, addr=>memptr_ram,
d_out=>ram_cbank);
-- MEMPTR (memory pointer)
U5: MEMPTR port map(clk=>clk, rst=>rst, rot=>rot, s_inc=>s_inc, inc=>c_inc,
reg_sel=>reg_sel, bor=>bor, bseq=>bseq, add_out=>memptr_ram);
-- IDS
L1: ids_in <= rega_ids & cbank_ids; -- Combine signals output from REG_A and CBANK
u6: IDS port map(clk=>clk, rst=>rst, ans=>ans, ids_in=>ids_in, o1=>o1, o2=>o2,
o3=>o3, o4=>o4, o5=>o5);
-- This process is to create delays such that the output from IDS will outputs at
-- the same time. Also take cares of the boundary outputs from line to line.
D2: process (clk, rst, a1, a2) is
begin
if (rst = '1') then
a2 <= '0';
a3 <= '0';
ans <= '0';
elsif (clk'event and clk = '1') then
a2 <= a1;
a3 <= a2;
ans <= a3;
end if;
end process D2;
-- This process is to propagate ans signal through out all the pipeline stages to
-- the SRAM writer interface so it could strat "recording".
D3: process (clk, rst, ans, ds) is
begin
if (rst = '1') then
ds <= (others => '0');
elsif (clk'event and clk = '1') then
ds(0) <= ans;
ds(1) <= ds(0);
88
ds(2) <= ds(1);
ds(3) <= ds(2);
ds(4) <= ds(3);
ds(5) <= ds(4);
ds(6) <= ds(5);
ds(7) <= ds(6);
ds(8) <= ds(7);
ds(9) <= ds(8);
ds(10) <= ds(9);
ds(11) <= ds(10);
ds(12) <= ds(11);
ds(13) <= ds(12);
end if;
end process D3;
L2: sram_w <= ds(13) or ds(12) or ds(11) or ds(10);
-- DUs
u7: DU port map(clk=>clk, rst=>rst, ids_in=>o1, du_out=>du_au1);
u8: DU port map(clk=>clk, rst=>rst, ids_in=>o2, du_out=>du_au2);
u9: DU port map(clk=>clk, rst=>rst, ids_in=>o3, du_out=>du_au3);
u10: DU port map(clk=>clk, rst=>rst, ids_in=>o4, du_out=>du_au4);
u11: DU port map(clk=>clk, rst=>rst, ids_in=>o5, du_out=>du_au5);
-- AUs
u12: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3,
du3=>du_au4, du4=>du_au5, p_en=>a_sel(0), ld_reg=>ld_reg,
coef=>coef, out_pxl=>out_pxl1, ovf=>ovf1);
u13: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3,
du3=>du_au4, du4=>du_au5, p_en=>a_sel(1), ld_reg=>ld_reg,
coef=>coef, out_pxl=>out_pxl2, ovf=>ovf2);
u14: AU port map(clk=>clk, rst=>rst, du0=>du_au1, du1=>du_au2, du2=>du_au3,
du3=>du_au4, du4=>du_au5, p_en=>a_sel(2), ld_reg=>ld_reg,
coef=>coef, out_pxl=>out_pxl3, ovf=>ovf3);
AUP_SEL: process (au_sel) is
begin
case (au_sel) is
when "01" => a_sel <= "001";
when "10" => a_sel <= "010";
when "11" => a_sel <= "100";
when others => a_sel <= "000";
end case;
end process AUP_SEL;
-- Testing Purposes
-- Output selection logic
OP_SEL: process (clk, rst, o_sel) is
begin
if (rst = '1') then
op_sel_reg <= (others => '0');
elsif (clk'event and clk = '1') then
if (o_sel = '1') then
op_sel_reg <= coef(1 downto 0);
else
op_sel_reg <= op_sel_reg;
end if;
end if;
end process OP_SEL;
-- Output selection mux
OP_D: process (op_sel_reg, out_pxl1, out_pxl2, out_pxl3) is
begin
case (op_sel_reg) is
when "00" => d_out <= out_pxl1;
when "01" => d_out <= out_pxl2;
89
when "10" => d_out <= out_pxl3;
when others => d_out <= out_pxl1;
end case;
end process OP_D;
end architecture STRUCT;
2. Controller Unit (CU)
-- ctr.vhd (Controller)
library IEEE;
use IEEE.std_logic_1164.all;
entity CTR is
port( clk, rst, str, bor, eoc, sds, rgt: in std_logic;
f_sel: out std_logic;
z_pad: out std_logic;
reg_sel: out std_logic_vector(2 downto 0);
en_w: out std_logic_vector(1 downto 0);
en_sf: out std_logic_vector(1 downto 0);
z_input: out std_logic;
c_inc: out std_logic_vector(1 downto 0);
rot: out std_logic;
r_inc: out std_logic;
req: out std_logic;
r_w: out std_logic;
s_inc: out std_logic;
sd_inc: out std_logic;
ans: out std_logic );
end entity CTR;
architecture BEHAVIORAL of CTR is
type statetype is (st0, st1, st2, st3, st4, st5, st6, st7, st8, st9, st10,
st11, st12, st13, st14, st15, st16, st17, st18, st19, st20,
st21, st22, st23, st24, st25, st26, st27, st28, st29, st30,
st31, st32, st33, st34, st35, st36, st37, st38, st39, st40,
st41, st42, st43, st44);
signal c_st, n_st: statetype;
signal tog
: std_logic;
begin
NXTSTPROC: process (c_st, str, bor, eoc, rgt, tog, sds) is
begin
case c_st is
when st0 => if (str = '1') then
n_st <= st1;
else
n_st <= st0;
end if;
when st1 => if (rgt = '0' and tog = '0') then
n_st <= st9;
elsif (rgt = '0' and tog = '1') then
n_st <= st2;
elsif (rgt = '1' and tog = '0') then
n_st <= st23;
else
n_st <= st16;
end if;
when st2 => n_st <= st3;
when st3 => n_st <= st4;
when st4 => n_st <= st5;
when st5 => n_st <= st6;
90
when st6 => if (bor = '1') then
n_st <= st7;
else
if (rgt = '0' and tog = '0') then
n_st <= st9;
elsif (rgt = '0' and tog = '1') then
n_st <= st2;
elsif (rgt = '1' and tog = '0') then
n_st <= st23;
else
n_st <= st16;
end if;
end if;
when st7 => n_st <= st8;
when st8 => if (rgt = '0' and tog = '0') then
n_st <= st9;
elsif (rgt = '0' and tog = '1') then
n_st <= st2;
elsif (rgt = '1' and tog = '0') then
n_st <= st23;
else
n_st <= st16;
end if;
when st9 => n_st <= st10;
when st10 => n_st <= st11;
when st11 => n_st <= st12;
when st12 => n_st <= st13;
when st13 => if (bor = '1') then
n_st <= st14;
else
if (rgt = '0' and tog = '0') then
n_st <= st9;
elsif (rgt = '0' and tog = '1') then
n_st <= st2;
elsif (rgt = '1' and tog = '0') then
n_st <= st23;
else
n_st <= st16;
end if;
end if;
when st14 => n_st <= st15;
when st15 => if (rgt = '0' and tog = '0') then
n_st <= st9;
elsif (rgt = '0' and tog = '1') then
n_st <= st2;
elsif (rgt = '1' and tog = '0') then
n_st <= st23;
else
n_st <= st16;
end if;
when ST16 => n_st <= st17;
when st17 => n_st <= st18;
when st18 => n_st <= st19;
when st19 => n_st <= st20;
when st20 => if (bor = '1') then
91
n_st <= st21;
else
if (tog = '0') then
n_st <= st23;
else
n_st <= st16;
end if;
end if;
when st21 => n_st <= st22;
when st22 => if (eoc = '1') then
if (tog = '0') then
n_st <= st37;
else
n_st <= st30;
end if;
else
if (tog = '0') then
n_st <= St23;
else
n_st <= st16;
end if;
end if;
when ST23 => n_st <= st24;
when st24 => n_st <= st25;
when st25 => n_st <= st26;
when st26 => n_st <= st27;
when st27 => if (bor = '1') then
n_st <= st28;
else
if (tog = '0') then
n_st <= st23;
else
n_st <= st16;
end if;
end if;
when st28 => n_st <= st29;
when st29 => if (eoc = '1') then
if (tog = '0') then
n_st <= st37;
else
n_st <= st30;
end if;
else
if (tog = '0') then
n_st <= St23;
else
n_st <= st16;
end if;
end if;
when st30 => n_st <= st31;
when st31 => n_st <= st32;
when st32 => n_st <= st33;
when st33 => n_st <= st34;
when st34 => if (bor = '1') then
n_st <= st35;
else
92
n_st <= st37;
end if;
when st35 => n_st <= st36;
when st36 => if (sds = '1') then
n_st <= st44;
else
n_st <= st37;
end if;
when st37 => n_st <= st38;
when st38 => n_st <= st39;
when st39 => n_st <= st40;
when st40 => n_st <= st41;
when st41 => if (bor = '1') then
n_st <= st42;
else
n_st <= st30;
end if;
when st42 => n_st <= st43;
when st43 => if (sds = '1') then
n_st <= st44;
else
n_st <= st30;
end if;
when st44 => n_st <= st44;
when others => null;
end case;
end process NXTSTPROC;
CURSTPROC: process (clk, rst) is
begin
if (rst = '1') then
c_st <= st0;
elsif (clk'event and clk = '0') then
c_st <= n_st;
end if;
end process CURSTPROC;
OUTCONPROC: process (c_st) is
begin
case c_st is
when st0 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '0';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '0';
93
ans
<= '0';
when st1 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '0';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '0';
ans <= '0';
when st2 => reg_sel <= "001";
f_sel <= '0';
en_w <= "01";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "01";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st3 => reg_sel <= "010";
f_sel <= '0';
en_w <= "01";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st4 => reg_sel <= "011";
f_sel <= '0';
en_w <= "01";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st5 => reg_sel <= "100";
f_sel <= '0';
en_w <= "01";
en_sf <= "00";
94
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st6 => reg_sel <= "101";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "10";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '1';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st7 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '1';
r_inc <= '1';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st8 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st9 => reg_sel <= "001";
f_sel <= '0';
en_w <= "10";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "01";
rot <= '0';
r_inc <= '0';
req <= '1';
95
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st10 => reg_sel <= "010";
f_sel <= '0';
en_w <= "10";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st11 => reg_sel <= "011";
f_sel <= '0';
en_w <= "10";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st12 => reg_sel <= "100";
f_sel <= '0';
en_w <= "10";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st13 => reg_sel <= "101";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "10";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '1';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
96
when st14 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '1';
r_inc <= '1';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st15 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st16 => reg_sel <= "001";
f_sel <= '1';
en_w <= "01";
en_sf <= "10";
z_pad <= '0';
z_input <= '0';
c_inc <= "01";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st17 => reg_sel <= "010";
f_sel <= '1';
en_w <= "01";
en_sf <= "10";
z_pad <= '0';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st18 => reg_sel <= "011";
f_sel <= '1';
en_w <= "01";
en_sf <= "10";
z_pad <= '0';
z_input <= '0';
97
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st19 => reg_sel <= "100";
f_sel <= '1';
en_w <= "01";
en_sf <= "10";
z_pad <= '0';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st20 => reg_sel <= "101";
f_sel <= '1';
en_w <= "00";
en_sf <= "10";
z_pad <= '0';
z_input <= '0';
c_inc <= "10";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '1';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st21 => reg_sel <= "000";
f_sel <= '1';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '1';
r_inc <= '1';
req <= '1';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st22 => reg_sel <= "000";
f_sel <= '1';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '0';
r_w <= '0';
98
sd_inc <= '0';
s_inc <= '1';
ans
<= '0';
when st23 => reg_sel <= "001";
f_sel <= '0';
en_w <= "10";
en_sf <= "01";
z_pad <= '0';
z_input <= '0';
c_inc <= "01";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st24 => reg_sel <= "010";
f_sel <= '0';
en_w <= "10";
en_sf <= "01";
z_pad <= '0';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st25 => reg_sel <= "011";
f_sel <= '0';
en_w <= "10";
en_sf <= "01";
z_pad <= '0';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st26 => reg_sel <= "100";
f_sel <= '0';
en_w <= "10";
en_sf <= "01";
z_pad <= '0';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
99
when st27 => reg_sel <= "101";
f_sel <= '0';
en_w <= "00";
en_sf <= "01";
z_pad <= '0';
z_input <= '0';
c_inc <= "10";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '1';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st28 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '1';
r_inc <= '1';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st29 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '1';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st30 => reg_sel <= "001";
f_sel <= '1';
en_w <= "01";
en_sf <= "10";
z_pad <= '0';
z_input <= '1';
c_inc <= "01";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st31 => reg_sel <= "010";
f_sel <= '1';
en_w <= "01";
en_sf <= "10";
z_pad <= '0';
z_input <= '1';
100
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st32 => reg_sel <= "011";
f_sel <= '1';
en_w <= "01";
en_sf <= "10";
z_pad <= '0';
z_input <= '1';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st33 => reg_sel <= "100";
f_sel <= '1';
en_w <= "01";
en_sf <= "10";
z_pad <= '0';
z_input <= '1';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st34 => reg_sel <= "101";
f_sel <= '1';
en_w <= "00";
en_sf <= "10";
z_pad <= '0';
z_input <= '1';
c_inc <= "10";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '1';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st35 => reg_sel <= "000";
f_sel <= '1';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '1';
c_inc <= "00";
rot <= '1';
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '0';
101
sd_inc <= '1';
s_inc <= '1';
ans <= '0';
when st36 => reg_sel <= "000";
f_sel <= '1';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '1';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st37 => reg_sel <= "001";
f_sel <= '0';
en_w <= "10";
en_sf <= "01";
z_pad <= '0';
z_input <= '1';
c_inc <= "01";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st38 => reg_sel <= "010";
f_sel <= '0';
en_w <= "10";
en_sf <= "01";
z_pad <= '0';
z_input <= '1';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st39 => reg_sel <= "011";
f_sel <= '0';
en_w <= "10";
en_sf <= "01";
z_pad <= '0';
z_input <= '1';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st40 => reg_sel <= "100";
f_sel <= '0';
102
en_w <= "10";
en_sf <= "01";
z_pad <= '0';
z_input <= '1';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st41 => reg_sel <= "101";
f_sel <= '0';
en_w <= "00";
en_sf <= "01";
z_pad <= '0';
z_input <= '1';
c_inc <= "10";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '1';
r_w <= '1';
sd_inc <= '0';
s_inc <= '1';
ans <= '1';
when st42 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '1';
c_inc <= "00";
rot <= '1';
r_inc <= '0';
req <= '0';
tog <= '1';
r_w <= '0';
sd_inc <= '1';
s_inc <= '1';
ans <= '0';
when st43 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '1';
c_inc <= "00";
rot <= '0';
r_inc <= '0';
req <= '0';
tog <= '1';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when st44 => reg_sel <= "000";
f_sel <= '0';
en_w <= "00";
en_sf <= "00";
z_pad <= '1';
z_input <= '0';
c_inc <= "00";
rot <= '0';
103
r_inc <= '0';
req <= '0';
tog <= '0';
r_w <= '0';
sd_inc <= '0';
s_inc <= '1';
ans <= '0';
when others => null;
end case;
end process OUTCONPROC;
end architecture BEHAVIORAL;
3. Memory Pointers Unit (MPU)
-- mem_ptr.vhd (Memory Pointers)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity MEMPTR is
port( clk, rst, rot: in std_logic;
inc: in std_logic_vector(1 downto 0);
reg_sel: in std_logic_vector(2 downto 0);
bor: out std_logic;
-- begining of a new row
bseq: out std_logic_vector(3 downto 0);
add_out: out std_logic_vector(12 downto 0) );
end entity MEMPTR;
architecture BEHAVIORAL of MEMPTR is
signal count1, count2
: unsigned(9 downto 0);
signal ptr_a, ptr_b, ptr_c, ptr_d, ptr_e: unsigned(2 downto 0);
signal b_seq
: unsigned(3 downto 0);
signal eor
: std_logic;
begin
CNTR1: process (clk, rst, inc) is
begin
if (rst = '1') then
count1 <= to_unsigned(0, 10);
elsif (clk'event and clk = '1') then
if (inc(0) = '1') then
if (count1 = to_unsigned(1020,10)) then
count1 <= to_unsigned(0, 10);
else
count1 <= count1 + 1;
end if;
end if;
end if;
end process CNTR1;
CNTR2: process (clk, rst, inc) is
begin
if (rst = '1') then
count2 <= to_unsigned(1, 10);
elsif (clk'event and clk = '1') then
if (inc(1) = '1') then
if (count2 = to_unsigned(1020,10)) then
count2 <= to_unsigned(0, 10);
else
count2 <= count2 + 1;
end if;
104
end if;
end if;
end process CNTR2;
CODOUT: process (count1, count2) is
begin
if (count2 = to_unsigned(0, 10)) then
eor <= '1';
else
eor <= '0';
end if;
if (count1 = to_unsigned(0, 10)) then
bor <= '1';
else
bor <= '0';
end if;
end process CODOUT;
BLK: process (clk, rst, rot) is
begin
if (rst = '1') then
b_seq <= to_unsigned(0, 4);
elsif (clk'event and clk = '1') then
if (rot = '1') then
b_seq(0) <= '1';
b_seq(1) <= b_seq(0);
b_seq(2) <= b_seq(1);
b_seq(3) <= b_seq(2);
end if;
end if;
end process BLK;
L1: bseq <= std_logic_vector(b_seq);
PTRS: process (clk, rst, rot) is
begin
if (rst = '1') then
ptr_a <= to_unsigned(0, 3);
ptr_b <= to_unsigned(1, 3);
ptr_c <= to_unsigned(2, 3);
ptr_d <= to_unsigned(3, 3);
ptr_e <= to_unsigned(4, 3);
elsif (clk'event and clk = '1') then
if (rot = '1') then
ptr_b <= ptr_a;
ptr_c <= ptr_b;
ptr_d <= ptr_c;
ptr_e <= ptr_d;
ptr_a <= ptr_e;
end if;
end if;
end process PTRS;
MUX: process (reg_sel, count1, count2, ptr_a, ptr_b, ptr_c, ptr_d, ptr_e, eor) is
begin
case reg_sel is
when "001" => if (eor = '1') then
add_out <= std_logic_vector(ptr_d) & std_logic_vector(count2);
else
add_out <= std_logic_vector(ptr_e) & std_logic_vector(count2);
end if;
105
when "010" => if (eor = '1') then
add_out <= std_logic_vector(ptr_c) & std_logic_vector(count2);
else
add_out <= std_logic_vector(ptr_d) & std_logic_vector(count2);
end if;
when "011" => if (eor = '1') then
add_out <= std_logic_vector(ptr_b) & std_logic_vector(count2);
else
add_out <= std_logic_vector(ptr_c) & std_logic_vector(count2);
end if;
when "100" => if (eor = '1') then
add_out <= std_logic_vector(ptr_a) & std_logic_vector(count2);
else
add_out <= std_logic_vector(ptr_b) & std_logic_vector(count2);
end if;
when "101" => add_out <= std_logic_vector(ptr_a) & std_logic_vector(count1);
when others => add_out <= std_logic_vector(to_unsigned(0, 13));
end case;
end process MUX;
end architecture BEHAVIORAL;
4. Data Memory Interface (DM I/F)
-- dm_if.vhd (Data Memory Interface)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity C_BANK1 is
port( clk, rst, f_sel, z_pad: in std_logic;
en_sf, en_w: in std_logic_vector(1 downto 0);
ld_reg: in std_logic_vector(2 downto 0);
bseq: in std_logic_vector(3 downto 0);
d_in: in std_logic_vector(39 downto 0);
d_out: out std_logic_vector(31 downto 0));
end entity C_BANK1;
architecture STRUCTURAL of C_BANK1 is
component REGFILE is
generic( n: integer := 8; -- n denotes the data width
d: integer := 5); -- d denotes number of registers
port( clk, rst, en_sf, en_w: in std_logic;
ld_reg: in std_logic_vector(2 downto 0);
bseq: in std_logic_vector(3 downto 0);
d_in: in std_logic_vector(39 downto 0);
d_out: out std_logic_vector(31 downto 0));
end component REGFILE;
signal f_a, f_b, f_mux: std_logic_vector(31 downto 0);
begin
RF1: REGFILE port map(clk=>clk, rst=>rst, en_sf=>en_sf(0), en_w=>en_w(0), bseq=>bseq,
ld_reg=>ld_reg, d_in=>d_in, d_out=>f_a);
RF2: REGFILE port map(clk=>clk, rst=>rst, en_sf=>en_sf(1), en_w=>en_w(1), bseq=>bseq,
ld_reg=>ld_reg, d_in=>d_in, d_out=>f_b);
MUX1: f_mux <= f_a when f_sel = '0' else f_b;
Z_P: d_out <= f_mux when z_pad = '0' else std_logic_vector(to_unsigned(0, 32));
end architecture STRUCTURAL;
106
--- regfile.vhd
library IEEE;
use IEEE.std_logic_1164.all;
entity REGFILE is
generic( n: integer := 8; -- n denotes the data width
d: integer := 5); -- d denotes number of registers
port( clk, rst, en_sf, en_w: in std_logic;
ld_reg: in std_logic_vector(2 downto 0);
bseq: in std_logic_vector(3 downto 0);
d_in: in std_logic_vector(39 downto 0);
d_out: out std_logic_vector(31 downto 0));
end entity REGFILE;
architecture STRUCTURAL of REGFILE is
component PLS_REG is
generic( n: integer := 8; -- n denotes the data width
d: integer := 5); -- d denotes number of registers
port( clk, rst, en_ld, en_sf: in std_logic;
d_in: in std_logic_vector((n*d)-1 downto 0);
d_out: out std_logic_vector(n-1 downto 0));
end component PLS_REG;
signal reg_sel: std_logic_vector(3 downto 0);
begin
LF: for f in 1 to 4 generate
PR_F: PLS_REG generic map(n=>n, d=>d)
port map(clk=>clk, rst=>rst, en_ld=>reg_sel(f-1), en_sf=>en_sf,
d_in=>d_in, d_out=>d_out((f*n)-1 downto ((f-1)*n)));
end generate LF;
SEL: process (ld_reg, en_w, bseq) is
begin
if (en_w = '0') then
reg_sel <= "0000";
else
case ld_reg is
when "001" => if (bseq(0) = '1') then
reg_sel <= "0001";
else
reg_sel <="0000";
end if;
when "010" => if (bseq(1) = '1') then
reg_sel <= "0010";
else
reg_sel <= "0000";
end if;
when "011" => if (bseq(2) = '1') then
reg_sel <= "0100";
else
reg_sel <= "0000";
end if;
when "100" => if (bseq(3) = '1') then
reg_sel <= "1000";
else
reg_sel <= "0000";
end if;
when others => reg_sel <= "0000";
end case;
end if;
107
end process SEL;
end architecture STRUCTURAL;
-- pls_reg.vhd
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
use BFULIB.bfu_pckg.all;
entity PLS_REG is
generic( n: integer := 8; -- n denotes the data width
d: integer := 5); -- d denotes number of registers
port( clk, rst, en_ld, en_sf: in std_logic;
d_in: in std_logic_vector((n*d)-1 downto 0);
d_out: out std_logic_vector(n-1 downto 0));
end entity PLS_REG;
architecture STRUCTURAL of PLS_REG is
signal s: std_logic_vector((n*(d+1))-1 downto 0);
begin
L1: s(n-1 downto 0) <= std_logic_vector(to_unsigned(0, n));
LK: for k in 1 to d generate
REGK: S_REG generic map(n=>n)
port map(clk=>clk, rst=>rst, en_ld=>en_ld, en_sf=>en_sf, p_in=>d_in((n*k)-1 downto n*(k-1)),
d_in=>s((n*k)-1 downto n*(k-1)), d_out=>s((n*(k+1))-1 downto (n*k)));
end generate LK;
L2: d_out <= s((n*(d+1))-1 downto n*d);
end architecture STRUCTURAL;
-- reg_a.vhd
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity REG_A is
generic( n: integer := 8; -- denotes the data width
d: integer := 5 );-- denotes the number of registers
port( clk, rst, z_pad: in std_logic;
reg_sel: in std_logic_vector(2 downto 0);
d_in: in std_logic_vector(n-1 downto 0);
d_out: out std_logic_vector((n*d)-1 downto 0);
ids_out: out std_logic_vector(n-1 downto 0) );
end entity REG_A;
architecture BEHAVIORAL of REG_A is
signal reg1, reg2, reg3, reg4, reg5, regt: unsigned(n-1 downto 0);
begin
-- Register write with conditions
REGSEL: process (clk, rst, reg_sel) is
begin
if (rst = '1') then
reg1 <= to_unsigned(0, n);
reg2 <= to_unsigned(0, n);
reg3 <= to_unsigned(0, n);
reg4 <= to_unsigned(0, n);
reg5 <= to_unsigned(0, n);
regt <= to_unsigned(0, n);
elsif (clk'event and clk = '1') then
108
case reg_sel is
when "001" => reg1 <= unsigned(d_in);
regt <= unsigned(d_in);
when "010" => reg2 <= unsigned(d_in);
regt <= unsigned(d_in);
when "011" => reg3 <= unsigned(d_in);
regt <= unsigned(d_in);
when "100" => reg4 <= unsigned(d_in);
regt <= unsigned(d_in);
when "101" => reg5 <= unsigned(d_in);
regt <= unsigned(d_in);
when others => null;
end case;
end if;
end process REGSEL;
-- Output Logic
L1: d_out <= std_logic_vector(reg5) & std_logic_vector(reg4) & std_logic_vector(reg3) &
std_logic_vector(reg2) & std_logic_vector(reg1);
L2: ids_out <= std_logic_vector(regt) when z_pad = '0' else
std_logic_vector(to_unsigned(0, n));
end architecture BEHAVIORAL;
5. Input Data Shifters (IDS)
-- ids.vhd (Input Data Shifters)
-- This is the functional unit that responsible for providing inputs to the five MAAs.
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use BFULIB.bfu_pckg.all;
entity IDS is
port(
clk, rst: in std_logic;
ids_in: in std_logic_vector(39 downto 0);
o1, o2, o3, o4, o5: out std_logic_vector(39 downto 0));
end entity IDS;
architecture STRUCTURAL of IDS is
signal s1, s2, s3, s4: std_logic_vector(39 downto 0);
begin
R1: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>ids_in, d_out=>s1);
L1: o1 <= s1;
R2: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s1, d_out=>s2);
L2: o2 <= s2;
R3: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s2, d_out=>s3);
L3: o3 <= s3;
R4: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s3, d_out=>s4);
L4: o4 <= s4;
R5: REG_P generic map(n=>40) port map(clk=>clk, rst=>rst, d_in=>s4, d_out=>o5);
end architecture STRUCTURAL;
6. Arithmetic Unit (AU)
-- au.vhd (Arithmetic Unit)
-- This is the combination of all the arithmetic units, which including all the
109
-- MAUs (25 of them).
library IEEE;
use IEEE.std_logic_1164.all;
entity AU is
port(
clk, rst: in std_logic;
ids0, ids1, ids2, ids3, ids4: in std_logic_vector(39 downto 0);
ld_reg: in std_logic_vector(4 downto 0);
coef: in std_logic_vector(5 downto 0);
out_pxl: out std_logic_vector(18 downto 0);
ovf: out std_logic);
end entity AU;
architecture STRUCTURAL of AU is
component MAA is
port( clk, rst: in std_logic;
ld_reg: in std_logic_vector(4 downto 0);
coef: in std_logic_vector(5 downto 0);
img_pxl: in std_logic_vector(39 downto 0);
p_rst: out std_logic_vector(16 downto 0));
end component MAA;
component AT is
port(
clk, rst: in std_logic;
maa0, maa1, maa2, maa3, maa4: in std_logic_vector(16 downto 0);
ovf: out std_logic;
out_pxl: out std_logic_vector(18 downto 0));
end component AT;
signal maa0, maa1, maa2, maa3, maa4: std_logic_vector(16 downto 0);
signal ld_coef
: std_logic_vector(24 downto 0);
begin
MAA_0: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(4 downto 0), coef=>coef,
img_pxl=>ids0, p_rst=>maa0);
MAA_1: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(9 downto 5), coef=>coef,
img_pxl=>ids1, p_rst=>maa1);
MAA_2: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(14 downto 10), coef=>coef,
img_pxl=>ids2, p_rst=>maa2);
MAA_3: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(19 downto 15), coef=>coef,
img_pxl=>ids3, p_rst=>maa3);
MAA_4: MAA port map(clk=>clk, rst=>rst, ld_reg=>ld_coef(24 downto 20), coef=>coef,
img_pxl=>ids4, p_rst=>maa4);
U1: AT port map(clk=>clk, rst=>rst, maa0=>maa0, maa1=>maa1, maa2=>maa2, maa3=>maa3,
maa4=>maa4, ovf=>ovf, out_pxl=>out_pxl);
MUX: process (ld_reg, coef) is
begin
case (ld_reg) is
when "00001" => ld_coef <= "0000000000000000000000001"; -- 1
when "00010" => ld_coef <= "0000000000000000000000010"; -- 2
when "00011" => ld_coef <= "0000000000000000000000100"; -- 3
when "00100" => ld_coef <= "0000000000000000000001000"; -- 4
when "00101" => ld_coef <= "0000000000000000000010000"; -- 5
when "00110" => ld_coef <= "0000000000000000000100000"; -- 6
when "00111" => ld_coef <= "0000000000000000001000000"; -- 7
when "01000" => ld_coef <= "0000000000000000010000000"; -- 8
when "01001" => ld_coef <= "0000000000000000100000000"; -- 9
when "01010" => ld_coef <= "0000000000000001000000000"; -- 10
when "01011" => ld_coef <= "0000000000000010000000000"; -- 11
when "01100" => ld_coef <= "0000000000000100000000000"; -- 12
when "01101" => ld_coef <= "0000000000001000000000000"; -- 13
when "01110" => ld_coef <= "0000000000010000000000000"; -- 14
110
when "01111" => ld_coef <= "0000000000100000000000000"; -- 15
when "10000" => ld_coef <= "0000000001000000000000000"; -- 16
when "10001" => ld_coef <= "0000000010000000000000000"; -- 17
when "10010" => ld_coef <= "0000000100000000000000000"; -- 18
when "10011" => ld_coef <= "0000001000000000000000000"; -- 19
when "10100" => ld_coef <= "0000010000000000000000000"; -- 20
when "10101" => ld_coef <= "0000100000000000000000000"; -- 21
when "10110" => ld_coef <= "0001000000000000000000000"; -- 22
when "10111" => ld_coef <= "0010000000000000000000000"; -- 23
when "11000" => ld_coef <= "0100000000000000000000000"; -- 24
when "11001" => ld_coef <= "1000000000000000000000000"; -- 25
when others => ld_coef <= "0000000000000000000000000";
end case;
end process MUX;
end architecture STRUCTURAL;
-- at.vhd (Adding Tree)
-- This is the Adding Tree that is responsible for adding five 17-bit word from
-- five different MAAs. The structure will include four level of pipeline stages.
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use BFULIB.bfu_pckg.all;
entity AT is
port(
clk, rst: in std_logic;
maa0, maa1, maa2, maa3, maa4: in std_logic_vector(16 downto 0);
ovf: out std_logic;
out_pxl: out std_logic_vector(18 downto 0));
end entity AT;
architecture STRUCTURAL of AT is
signal low, ovf_r
: std_logic;
signal sum1
: std_logic_vector(16 downto 0);
signal sum2, sum3, carry1, carry2: std_logic_vector(17 downto 0);
signal carry3
: std_logic_vector(18 downto 0);
signal sum4
: std_logic_vector(19 downto 0);
signal pl1_r1
: std_logic_vector(17 downto 0);
signal pl1_r2, pl1_r3, pl1_r4 : std_logic_vector(16 downto 0);
signal pl2_r5, pl2_r6
: std_logic_vector(17 downto 0);
signal pl2_r7
: std_logic_vector(16 downto 0);
signal pl3_r8
: std_logic_vector(18 downto 0);
signal pl3_r9
: std_logic_vector(17 downto 0);
signal pl4_r10
: std_logic_vector(19 downto 0);
begin
L1: low <= '0';
U1: CSA generic map(n=>17) port map(a=>maa0, b=>maa1, c=>maa2, sum=>sum1, carry=>carry1);
R1: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>carry1, d_out=>pl1_r1);
R2: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>sum1, d_out=>pl1_r2);
R3: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>maa3, d_out=>pl1_r3);
R4: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>maa4, d_out=>pl1_r4);
U2: CSA generic map(n=>17) port map(a=>pl1_r1(16 downto 0), b=>pl1_r2, c=>pl1_r3,
sum=>sum2(16 downto 0), carry=>carry2);
L2: sum2(17) <= pl1_r1(17); -- This is the most significant bit from carry1 above
R5: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>carry2, d_out=>pl2_r5);
R6: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>sum2, d_out=>pl2_r6);
R7: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>pl1_r4, d_out=>pl2_r7);
U3: CSA generic map(n=>17) port map(a=>pl2_r5(16 downto 0), b=>pl2_r6(16 downto 0),
c=>pl2_r7, sum=>sum3(16 downto 0), carry=>carry3(17 downto 0));
L3: HA port map(a=>pl2_r5(17), b=>pl2_r6(17), s=>sum3(17), cout=>carry3(18));
R8: REG_P generic map(n=>19) port map(clk=>clk, rst=>rst, d_in=>carry3, d_out=>pl3_r8);
R9: REG_P generic map(n=>18) port map(clk=>clk, rst=>rst, d_in=>sum3, d_out=>pl3_r9);
111
U4: CLA_19 port map(a(17 downto 0)=>pl3_r9, a(18)=>low, b=>pl3_r8, s=>sum4(18 downto 0),
ovf=>sum4(19));
R10: REG_P generic map(n=>20) port map(clk=>clk, rst=>rst, d_in=>sum4, d_out=>pl4_r10);
L4: out_pxl <= pl4_r10(18 downto 0);
L5: ovf <= pl4_r10(19);
end architecture STRUCTURAL;
-- maa.vhd (This is ths systolic array of five MAUs with a DU)
library IEEE;
use IEEE.std_logic_1164.all;
entity MAA is
port( clk, rst: in std_logic;
ld_reg: in std_logic_vector(4 downto 0);
coef: in std_logic_vector(5 downto 0);
img_pxl: in std_logic_vector(39 downto 0);
p_rst: out std_logic_vector(16 downto 0));
end entity MAA;
architecture STRUCTURAL of MAA is
component DU is
port( clk, rst: in std_logic;
ids_in: in std_logic_vector(39 downto 0);
du_out: out std_logic_vector(39 downto 0));
end component DU;
component MAUS is
port( clk, rst: in std_logic;
ld_reg: in std_logic_vector(4 downto 0);
coef: in std_logic_vector(5 downto 0);
img_pxl: in std_logic_vector(39 downto 0);
p_rst: out std_logic_vector(16 downto 0));
end component MAUS;
signal s: std_logic_vector(39 downto 0);
begin
U1: DU port map(clk=>clk, rst=>rst, ids_in=>img_pxl, du_out=>s);
U2: MAUS port map(clk=>clk, rst=>rst, ld_reg=>ld_reg, coef=>coef, img_pxl=>s, p_rst=>p_rst);
end architecture STRUCTURAL;
-- du.vhd
-- This is the Delay Unit for the propagation of the image data
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use BFULIB.bfu_pckg.all;
entity DU is
port( clk, rst: in std_logic;
ids_in: in std_logic_vector(39 downto 0);
du_out: out std_logic_vector(39 downto 0));
end entity DU;
architecture STRUCTURAL of DU is
signal p1: std_logic_vector(31 downto 0);
signal p2, p3: std_logic_vector(23 downto 0);
signal p4, p5: std_logic_vector(15 downto 0);
signal p6, p7: std_logic_vector(7 downto 0);
begin
L1: du_out(7 downto 0) <= ids_in(7 downto 0);
PL1: REG_P generic map(n=>32) port map(clk=>clk, rst=>rst, d_in=>ids_in(39 downto 8), d_out=>p1);
112
L2: du_out(15 downto 8) <= p1(7 downto 0);
PL2: REG_P generic map(n=>24) port map(clk=>clk, rst=>rst, d_in=>p1(31 downto 8), d_out=>p2);
PL3: REG_P generic map(n=>24) port map(clk=>clk, rst=>rst, d_in=>p2, d_out=>p3);
L3: du_out(23 downto 16) <= p3(7 downto 0);
PL4: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p3(23 downto 8), d_out=>p4);
PL5: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p4, d_out=>p5);
L4: du_out(31 downto 24) <= p5(7 downto 0);
PL6: REG_P generic map(n=>8) port map(clk=>clk, rst=>rst, d_in=>p5(15 downto 8), d_out=>p6);
PL7: REG_P generic map(n=>8) port map(clk=>clk, rst=>rst, d_in=>p6, d_out=>p7);
L5: du_out(39 downto 32) <= p7;
end architecture STRUCTURAL;
-- maus.vhd
library IEEE;
use IEEE.std_logic_1164.all;
entity MAUS is
port( clk, rst: in std_logic;
ld_reg: in std_logic_vector(4 downto 0);
coef: in std_logic_vector(5 downto 0);
img_pxl: in std_logic_vector(39 downto 0);
p_rst: out std_logic_vector(16 downto 0));
end entity MAUS;
architecture STRUCTURAL of MAUS is
component MAU_0 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0);
img_pxl: in std_logic_vector(7 downto 0);
p_res: out std_logic_vector(13 downto 0));
end component MAU_0;
component MAU_1 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU
p_mau: in std_logic_vector(13 downto 0); -- previous MAU output
p_res: out std_logic_vector(14 downto 0)); -- partial result to next MAU
end component MAU_1;
component MAU_2 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU
p_mau: in std_logic_vector(14 downto 0); -- previous MAU output
p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU
end component MAU_2;
component MAU_3 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU
p_mau: in std_logic_vector(15 downto 0); -- previous MAU output
p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU
end component MAU_3;
component MAU_4 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU
p_mau: in std_logic_vector(15 downto 0); -- previous MAU output
p_res: out std_logic_vector(16 downto 0)); -- partial result to next MAU
end component MAU_4;
113
signal p_res1: std_logic_vector(13 downto 0);
signal p_res2: std_logic_vector(14 downto 0);
signal p_res3, p_res4: std_logic_vector(15 downto 0);
begin
U0: MAU_0 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(0), coef=>coef,
img_pxl=>img_pxl(7 downto 0), p_res=>p_res1);
U1: MAU_1 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(1), coef=>coef,
img_pxl=>img_pxl(15 downto 8), p_mau=>p_res1, p_res=>p_res2);
U2: MAU_2 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(2), coef=>coef,
img_pxl=>img_pxl(23 downto 16), p_mau=>p_res2, p_res=>p_res3);
U3: MAU_3 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(3), coef=>coef,
img_pxl=>img_pxl(31 downto 24), p_mau=>p_res3, p_res=>p_res4);
U4: MAU_4 port map(clk=>clk, rst=>rst, ld_reg=>ld_reg(4), coef=>coef,
img_pxl=>img_pxl(39 downto 32), p_mau=>p_res4, p_res=>p_rst);
end architecture STRUCTURAL;
-- mau_0.vhd
-- This is the first MAU of the MAUs (for one systolic array).
-- This MAU only contains Multiplication unit and no Adder since there is no previous
-- MAU output that needs to be accumulated. The range of multiplication is within
-- -8160 to 7905 (decimal), hence the output (p_res) is 14-bit word.
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use BFULIB.bfu_pckg.all;
entity MAU_0 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- Filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- Image pixels
p_res: out std_logic_vector(13 downto 0)); -- Partial result to
end entity MAU_0;
architecture BEHAVIORAL of MAU_0 is
signal coef_reg: std_logic_vector(5 downto 0);
signal product: std_logic_vector(13 downto 0);
begin
U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product);
R1: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>p_res);
STORE: process (clk, rst, ld_reg) is
begin
if (rst = '1') then
coef_reg <= "000000";
elsif (rising_edge(clk)) then
if (ld_reg = '1') then
coef_reg <= coef;
else
coef_reg <= coef_reg;
end if;
end if;
end process STORE;
end architecture BEHAVIORAL;
114
-- mau_1.vhd
-- This is the second MAU of the MAUs. The range of this MAU is between
-- -16320 and 15810 (decimal), hence the 15-bit word is used for the partial result.
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use BFULIB.bfu_pckg.all;
entity MAU_1 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU
p_mau: in std_logic_vector(13 downto 0); -- previous MAU output
p_res: out std_logic_vector(14 downto 0)); -- partial result to next MAU
end entity MAU_1;
architecture BEHAVIORAL of MAU_1 is
signal coef_reg: std_logic_vector(5 downto 0);
signal pl1, pl2, product: std_logic_vector(13 downto 0);
signal sum: std_logic_vector(14 downto 0);
begin
R1: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1);
U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product);
R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2);
U2: CLA_15 port map(a=>pl1, b=>pl2, s=>sum);
R3: REG_P generic map(n=>15) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res);
STORE: process (clk, rst, ld_reg) is
begin
if (rst = '1') then
coef_reg <= "000000";
elsif (rising_edge(clk)) then
if (ld_reg = '1') then
coef_reg <= coef;
else
coef_reg <= coef_reg;
end if;
end if;
end process STORE;
end architecture BEHAVIORAL;
-- mau_2.vhd
-- This is the third MAU of the MAUs systolic array.
-- The range of the MAU is between -24480 and 23715 (decimal), thus
-- a 16-bit word is used for the partial result bus.
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use BFULIB.bfu_pckg.all;
entity MAU_2 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU
p_mau: in std_logic_vector(14 downto 0); -- previous MAU output
p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU
end entity MAU_2;
architecture BEHAVIORAL of MAU_2 is
signal coef_reg: std_logic_vector(5 downto 0);
signal product: std_logic_vector(13 downto 0);
signal pl1, pl2, sum: std_logic_vector(15 downto 0);
begin
115
L1: pl2(15 downto 14) <= "00";
L2: pl1(15) <= '0';
R1: REG_P generic map(n=>15) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1(14 downto 0));
U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product);
R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0));
U2: CLA_16 port map(a=>pl1, b=>pl2, s=>sum);
R3: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res);
STORE: process (clk, rst, ld_reg) is
begin
if (rst = '1') then
coef_reg <= "000000";
elsif (rising_edge(clk)) then
if (ld_reg = '1') then
coef_reg <= coef;
else
coef_reg <= coef_reg;
end if;
end if;
end process STORE;
end architecture BEHAVIORAL;
-- mau_3.vhd
-- This is the third MAU with the MAUs. The range of the MAU is between
-- -32640 and 31620 (decimal), thus a 16-bit word bus is used for the partial
-- result coming out from this MAU.
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use BFULIB.bfu_pckg.all;
entity MAU_3 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU
p_mau: in std_logic_vector(15 downto 0); -- previous MAU output
p_res: out std_logic_vector(15 downto 0)); -- partial result to next MAU
end entity MAU_3;
architecture BEHAVIORAL of MAU_3 is
signal coef_reg: std_logic_vector(5 downto 0);
signal product: std_logic_vector(13 downto 0);
signal pl1, pl2, sum: std_logic_vector(15 downto 0);
begin
L1: pl2(15 downto 14) <= "00";
R1: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1);
U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product);
R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0));
U2: CLA_16 port map(a=>pl1, b=>pl2, s=>sum);
R3: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res);
STORE: process (clk, rst, ld_reg) is
begin
if (rst = '1') then
coef_reg <= "000000";
elsif (rising_edge(clk)) then
if (ld_reg = '1') then
coef_reg <= coef;
else
coef_reg <= coef_reg;
116
end if;
end if;
end process STORE;
end architecture BEHAVIORAL;
-- mau_4.vhd
-- This is the last MAU within the MAUs systolic array. The range for this MAU
-- is between -40800 and 39525 (decimal), thus a 17-bit word bus is used for the
-- partial result coming out of this MAU.
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use BFULIB.bfu_pckg.all;
entity MAU_4 is
port( clk, rst, ld_reg: in std_logic;
coef: in std_logic_vector(5 downto 0); -- filter coefficient
img_pxl: in std_logic_vector(7 downto 0); -- image pixel from DU
p_mau: in std_logic_vector(15 downto 0); -- previous MAU output
p_res: out std_logic_vector(16 downto 0)); -- partial result to next MAU
end entity MAU_4;
architecture BEHAVIORAL of MAU_4 is
signal coef_reg: std_logic_vector(5 downto 0);
signal product: std_logic_vector(13 downto 0);
signal pl1, pl2: std_logic_vector(15 downto 0);
signal sum: std_logic_vector(16 downto 0);
begin
L1: pl2(15 downto 14) <= "00";
R1: REG_P generic map(n=>16) port map(clk=>clk, rst=>rst, d_in=>p_mau, d_out=>pl1);
U1: MULT port map(a=>img_pxl, b=>coef_reg, p=>product);
R2: REG_P generic map(n=>14) port map(clk=>clk, rst=>rst, d_in=>product, d_out=>pl2(13 downto 0));
U2: CLA_17 port map(a=>pl1, b=>pl2, s=>sum);
R3: REG_P generic map(n=>17) port map(clk=>clk, rst=>rst, d_in=>sum, d_out=>p_res);
STORE: process (clk, rst, ld_reg) is
begin
if (rst = '1') then
coef_reg <= "000000";
elsif (rising_edge(clk)) then
if (ld_reg = '1') then
coef_reg <= coef;
else
coef_reg <= coef_reg;
end if;
end if;
end process STORE;
end architecture BEHAVIORAL;
7. Multiplication and Adder Units (These functional units have been defined as a library
package)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
package bfu_pckg is
117
component CLA_15 is
port(
a, b: in std_logic_vector(13 downto 0);
s: out std_logic_vector(14 downto 0));
end component CLA_15;
component CLA_16 is
port(
a, b: in std_logic_vector(15 downto 0);
s: out std_logic_vector(15 downto 0));
end component CLA_16;
component CLA_17 is
port(
a, b: in std_logic_vector(15 downto 0);
s: out std_logic_vector(16 downto 0));
end component CLA_17;
component CLA_19 is
port(a, b: in std_logic_vector(18 downto 0);
s: out std_logic_vector(18 downto 0);
ovf: out std_logic);
end component CLA_19;
component MULT is
port(a: in std_logic_vector(7 downto 0);
b: in std_logic_vector(5 downto 0);
p: out std_logic_vector(13 downto 0));
end component MULT;
component FA is
port(a, b, cin: in std_logic;
s, cout: out std_logic);
end component FA;
component HA is
port( a, b: in std_logic;
s, cout: out std_logic);
end component HA;
component CSA is
generic(n: positive := 5);
port( a, b, c: in std_logic_vector(n-1 downto 0);
sum: out std_logic_vector(n-1 downto 0);
carry: out std_logic_vector(n downto 0));
end component CSA;
component REG_P is
generic(n: positive := 5);
port( clk, rst: in std_logic;
d_in: in std_logic_vector(n-1 downto 0);
d_out: out std_logic_vector(n-1 downto 0));
end component REG_P;
component REG_N is
generic(n: positive := 5);
port( clk, rst: in std_logic;
d_in: in std_logic_vector(n-1 downto 0);
d_out: out std_logic_vector(n-1 downto 0));
end component REG_N;
end package bfu_pckg;
-- mult.vhd (Multiplication Unit)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
entity MULT is
118
port(a: in std_logic_vector(7 downto 0);
b: in std_logic_vector(5 downto 0);
p: out std_logic_vector(13 downto 0));
end entity MULT;
architecture STRUCT of MULT is
component PPG is
generic(n: integer := 8);
port( a: in std_logic_vector(n-1 downto 0);
mult: in std_logic_vector(2 downto 0);
pp: out std_logic_vector(n downto 0);
spp: out std_logic);
end component PPG;
component R3_2C is
generic(n: integer := 14);
port(a, b, c: in std_logic_vector(n-1 downto 0);
sum: out std_logic_vector(n-1 downto 0);
carry: out std_logic_vector(n downto 0));
end component R3_2C;
component S3_2C is
port(pp1, pp2, pp3: in std_logic_vector(8 downto 0);
sp1, sp2, sp3: in std_logic;
sum, carry: out std_logic_vector(13 downto 0));
end component S3_2C;
component CLA_14 is
port(
a, b: in std_logic_vector(13 downto 0);
s: out std_logic_vector(13 downto 0));
end component CLA_14;
signal sp1, sp2, sp3: std_logic;
signal ls: std_logic_vector(2 downto 0);
signal pp1, pp2, pp3: std_logic_vector(8 downto 0);
signal pp4, sum1, sum2, carry1: std_logic_vector(13 downto 0);
signal carry2: std_logic_vector(14 downto 0);
begin
L1: ls <= b(1 downto 0) & '0';
U1: PPG port map(a=>a, mult=>ls, pp=>pp1, spp=>sp1);
U2: PPG port map(a=>a, mult=>b(3 downto 1), pp=>pp2, spp=>sp2);
U3: PPG port map(a=>a, mult=>b(5 downto 3), pp=>pp3, spp=>sp3);
U4: S3_2C port map(pp1=>pp1, pp2=>pp2, pp3=>pp3, sp1=>sp1, sp2=>sp2, sp3=>sp3,
sum=>sum1, carry=>carry1);
L2: pp4 <= "000000000" & sp3 & '0' & sp2 & '0' & sp1;
U5: R3_2C port map(a=>sum1, b=>carry1, c=>pp4, sum=>sum2, carry=>carry2);
U6: CLA_14 port map(a=>sum2, b=>carry2(13 downto 0), s=>p);
end architecture STRUCT;
-- ppg.vhd (Partial Product Generator)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
entity PPG is
generic(n: integer := 8);
port( a: in std_logic_vector(n-1 downto 0);
mult: in std_logic_vector(2 downto 0);
pp: out std_logic_vector(n downto 0);
spp: out std_logic);
end entity PPG;
119
architecture BEHAVIORAL of PPG is
begin
PP_PROC: process(mult, a)
begin
case mult is
when "000" => for k in n downto 0 loop
pp(k) <= '0';
end loop;
spp <= '0';
when "001" => pp <= '0' & a;
spp <= '0';
when "010" => pp <= '0' & a;
spp <= '0';
when "011" => pp <= a & '0';
spp <= '0';
when "100" => pp <= not(a & '0');
spp <= '1';
when "101" => pp <= not('0' & a);
spp <= '1';
when "110" => pp <= not('0' & a);
spp <= '1';
when "111" => for l in n downto 0 loop
pp(l) <= '0';
end loop;
spp <= '0';
when others => null;
end case;
end process PP_PROC;
end architecture BEHAVIORAL;
-- r3_2c.vhd (Second level 3 to 2 Counter for Multiplier)
library IEEE;
use IEEE.std_logic_1164.all;
entity R3_2C is
generic(n: integer := 14);
port(a, b, c: in std_logic_vector(n-1 downto 0); -- a->sum, b->carry c->pp4
sum: out std_logic_vector(n-1 downto 0);
carry: out std_logic_vector(n downto 0));
end entity R3_2C;
architecture STRUCTURAL of R3_2C is
component HA is
port( a, b: in std_logic;
s, cout: out std_logic);
end component HA;
component FA is
port(a, b, cin: in std_logic;
s, cout: out std_logic);
end component FA;
begin
L1: carry(0) <= '0';
LK: for k in 2 downto 0 generate
HAK: HA port map(a=>a(k), b=>c(k), s=>sum(k), cout=>carry(k+1));
end generate LK;
L2: HA port map(a=>a(3), b=>b(3), s=>sum(3), cout=>carry(4));
L3: FA port map(a=>a(4), b=>b(4), cin=>c(4), s=>sum(4), cout=>carry(5));
LF: for f in n-1 downto 5 generate
120
HAF: HA port map(a=>a(f), b=>b(f), s=>sum(f), cout=>carry(f+1));
end generate LF;
end architecture STRUCTURAL;
-- s3_2c.vhd (Special 3 to 2 Counter)
library IEEE;
use IEEE.std_logic_1164.all;
entity S3_2C is
port(pp1, pp2, pp3: in std_logic_vector(8 downto 0);
sp1, sp2, sp3: in std_logic;
sum, carry: out std_logic_vector(13 downto 0));
end entity S3_2C;
architecture STRUCTURAL of S3_2C is
component HA is
port( a, b: in std_logic;
s, cout: out std_logic);
end component HA;
component FA is
port(a, b, cin: in std_logic;
s, cout: out std_logic);
end component FA;
signal high, n_sp1, n_sp2, n_sp3: std_logic;
begin
G0: high <= '1';
G1: n_sp1 <= not sp1;
G2: n_sp2 <= not sp2;
G3: n_sp3 <= not sp3;
L1: sum(1 downto 0) <= pp1(1 downto 0);
L2: carry(2 downto 0) <= "000";
L3: sum(13) <= n_sp3;
LK: for k in 1 downto 0 generate
HAA: HA port map(a=>pp1(k+2), b=>pp2(k), s=>sum(k+2), cout=>carry(k+3));
end generate LK;
LG: for g in 4 downto 0 generate
FAG: FA port map(a=>pp1(g+4), b=>pp2(g+2), cin=>pp3(g), s=>sum(g+4), cout=>carry(g+5));
end generate LG;
LF: for f in 6 downto 5 generate
FAF: FA port map(a=>sp1, b=>pp2(f+2), cin=>pp3(f), s=>sum(f+4), cout=>carry(f+5));
end generate LF;
FA7: FA port map(a=>n_sp1, b=>n_sp2, cin=>pp3(7), s=>sum(11), cout=>carry(12));
HA2: HA port map(a=>high, b=>pp3(8), s=>sum(12), cout=>carry(13));
end architecture STRUCTURAL;
-- cla_16.vhd (Carry Lookahead Adder ~ 16 Bits)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
entity CLA_16 is
port(
a, b: in std_logic_vector(15 downto 0);
s: out std_logic_vector(15 downto 0));
end entity CLA_16;
architecture STRUCTURAL of CLA_16 is
121
component CLA_4_1 is
port(
a, b: in std_logic_vector(3 downto 0);
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLA_4_1;
component CLA_4 is
port(
a, b: in std_logic_vector(3 downto 0);
cin: in std_logic;
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLA_4;
component CLL_2L is
port(p, g: in std_logic_vector(2 downto 0);
cout: out std_logic_vector(3 downto 1));
end component CLL_2L;
signal p, g: std_logic_vector(3 downto 0);
signal c: std_logic_vector(3 downto 1);
begin
U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0),
p_out=>p(0), g_out=>g(0));
U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4),
p_out=>p(1), g_out=>g(1));
U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8),
p_out=>p(2), g_out=>g(2));
U4: CLA_4 port map(a=>a(15 downto 12), b=>b(15 downto 12), cin=>c(3), s=>s(15 downto 12),
p_out=>p(3), g_out=>g(3));
U5: CLL_2L port map(p=>p(2 downto 0), g=>g(2 downto 0), cout=>c);
end architecture STRUCTURAL;
-- cla_19.vhd (19-bit Carry Lookahead adder)
library IEEE;
use IEEE.std_logic_1164.all;
entity CLA_19 is
port(a, b: in std_logic_vector(18 downto 0);
s: out std_logic_vector(18 downto 0);
ovf: out std_logic);
end entity CLA_19;
architecture STRUCTURAL of CLA_19 is
component CLA_3S is
port(a, b: in std_logic_vector(2 downto 0);
cin: in std_logic;
s: out std_logic_vector(2 downto 0);
ovf: out std_logic);
end component CLA_3S;
component CLA_17 is
port(
a, b: in std_logic_vector(15 downto 0);
s: out std_logic_vector(16 downto 0));
end component CLA_17;
signal cout, high, low, ovf0, ovf1: std_logic;
signal s1, s2: std_logic_vector(2 downto 0);
begin
L1: high <= '1';
L2: low <= '0';
U1: CLA_17 port map(a=>a(15 downto 0), b=>b(15 downto 0),
s(15 downto 0)=>s(15 downto 0), s(16)=>cout);
U2: CLA_3S port map(a=>a(18 downto 16), b=>b(18 downto 16), cin=>high, s=>s1, ovf=>ovf1);
122
U3: CLA_3S port map(a=>a(18 downto 16), b=>b(18 downto 16), cin=>low, s=>s2, ovf=>ovf0);
L3: s(18 downto 16) <= s1 when cout = '1' else s2;
L4: ovf <= ovf1 when cout='1' else ovf0;
end architecture STRUCTURAL;
-- cla_17.vhd (Carry Lookahead Adder ~ 17 Bits)
-- This Carry Lookahead adder adds two 16-bit number and generate a 17-bit sum.
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
entity CLA_17 is
port(
a, b: in std_logic_vector(15 downto 0);
s: out std_logic_vector(16 downto 0));
end entity CLA_17;
architecture STRUCTURAL of CLA_17 is
component CLA_16_P is
port(
a, b: in std_logic_vector(15 downto 0);
s: out std_logic_vector(15 downto 0);
p_out, g_out: out std_logic);
end component CLA_16_P;
signal p_4, g_4: std_logic;
begin
U1: CLA_16_P port map(a=>a, b=>b, s=>s(15 downto 0), p_out=>p_4, g_out=>g_4);
L3: s(16) <= g_4;
end architecture STRUCTURAL;
-- cla_16_p.vhd (Carry Lookahead Adder ~ 16 Bits) with p and g
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
entity CLA_16_P is
port(
a, b: in std_logic_vector(15 downto 0);
s: out std_logic_vector(15 downto 0);
p_out, g_out: out std_logic);
end entity CLA_16_P;
architecture STRUCTURAL of CLA_16_P is
component CLA_4_1 is
port(
a, b: in std_logic_vector(3 downto 0);
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLA_4_1;
component CLA_4 is
port(
a, b: in std_logic_vector(3 downto 0);
cin: in std_logic;
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLA_4;
component CLL_2 is
port(
p, g: in std_logic_vector(3 downto 0);
cout: out std_logic_vector(3 downto 1);
p_out, g_out: out std_logic);
end component CLL_2;
signal p, g: std_logic_vector(3 downto 0);
123
signal c: std_logic_vector(3 downto 1);
begin
U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0),
p_out=>p(0), g_out=>g(0));
U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4),
p_out=>p(1), g_out=>g(1));
U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8),
p_out=>p(2), g_out=>g(2));
U4: CLA_4 port map(a=>a(15 downto 12), b=>b(15 downto 12), cin=>c(3), s=>s(15 downto 12),
p_out=>p(3), g_out=>g(3));
U5: CLL_2 port map(p=>p, g=>g, cout=>c, p_out=>p_out, g_out=>g_out);
end architecture STRUCTURAL;
--- cll_2.vhd (2nd Level of Carry Lookahead Logic - for 4 bits) with P, G output
library IEEE;
use IEEE.std_logic_1164.all;
entity CLL_2 is
port(
p, g: in std_logic_vector(3 downto 0);
cout: out std_logic_vector(3 downto 1);
p_out, g_out: out std_logic);
end entity CLL_2;
architecture BEHAVIORAL of CLL_2 is
begin
L1: cout(1) <= g(0);
L2: cout(2) <= g(1) or (p(1) and g(0));
L3: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0));
L4: p_out <= p(3) and p(2) and p(1) and p(0);
L5: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1))
or (p(3) and p(2) and p(1) and g(0));
end architecture BEHAVIORAL;
-- cla_15.vhd (Carry Lookahead Adder ~ 15 Bits)
-- This is a adder that add two 14-bit number and the sum is 15-bit word
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
entity CLA_15 is
port(
a, b: in std_logic_vector(13 downto 0);
s: out std_logic_vector(14 downto 0));
end entity CLA_15;
architecture STRUCTURAL of CLA_15 is
component CLA_4_1 is
port(
a, b: in std_logic_vector(3 downto 0);
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLA_4_1;
component CLA_4 is
port(
a, b: in std_logic_vector(3 downto 0);
cin: in std_logic;
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLA_4;
component CLA_3 is
port(a, b: in std_logic_vector(1 downto 0);
cin: in std_logic;
124
s: out std_logic_vector(2 downto 0));
end component CLA_3;
component CLL_2L is
port(p, g: in std_logic_vector(2 downto 0);
cout: out std_logic_vector(3 downto 1));
end component CLL_2L;
signal p, g: std_logic_vector(2 downto 0);
signal c: std_logic_vector(3 downto 1);
begin
U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0),
p_out=>p(0), g_out=>g(0));
U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4),
p_out=>p(1), g_out=>g(1));
U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8),
p_out=>p(2), g_out=>g(2));
U4: CLA_3 port map(a=>a(13 downto 12), b=>b(13 downto 12), cin=>c(3), s=>s(14 downto 12));
U5: CLL_2L port map(p=>p, g=>g, cout=>c);
end architecture STRUCTURAL;
--- cla_3.vhd (Carry Lookahead Adder ~ 3 Bits (Last 3 Bits))
library IEEE;
use IEEE.std_logic_1164.all;
entity CLA_3 is
port(a, b: in std_logic_vector(1 downto 0);
cin: in std_logic;
s: out std_logic_vector(2 downto 0));
end entity CLA_3;
architecture STRUCTURAL of CLA_3 is
component SCLL_3 is
port( cin: in std_logic;
p, g: in std_logic_vector(1 downto 0);
cout: out std_logic_vector(2 downto 0));
end component SCLL_3;
component PFA is
port(a, b, c: in std_logic;
s, g, p: out std_logic);
end component PFA;
signal p, g: std_logic_vector(1 downto 0);
signal c: std_logic_vector(2 downto 0);
begin
U0: SCLL_3 port map(cin=>cin, p=>p, g=>g, cout=>c);
U1: PFA port map(a=>a(0), b=>b(0), c=>c(0), s=>s(0), g=>g(0), p=>p(0));
U2: PFA port map(a=>a(1), b=>b(1), c=>c(1), s=>s(1), g=>g(1), p=>p(1));
L1: s(2) <= c(2);
end architecture STRUCTURAL;
--- scll_3.vhd (Carry Lookahead Logic for Bit position 13 & 14)
library IEEE;
use IEEE.std_logic_1164.all;
entity SCLL_3 is
port( cin: in std_logic;
p, g: in std_logic_vector(1 downto 0);
cout: out std_logic_vector(2 downto 0));
end entity SCLL_3;
125
architecture BEHAVIORAL of SCLL_3 is
begin
L1: cout(0) <= cin;
L2: cout(1) <= g(0) or (p(0) and cin);
L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin);
end architecture BEHAVIORAL;
-- cla_14.vhd (Carry Lookahead Adder ~ 14 Bits)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
use IEEE.std_logic_unsigned.all;
entity CLA_14 is
port(
a, b: in std_logic_vector(13 downto 0);
s: out std_logic_vector(13 downto 0));
end entity CLA_14;
architecture STRUCTURAL of CLA_14 is
component CLA_4_1 is
port(
a, b: in std_logic_vector(3 downto 0);
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLA_4_1;
component CLA_4 is
port(
a, b: in std_logic_vector(3 downto 0);
cin: in std_logic;
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLA_4;
component CLA_2 is
port(a, b: in std_logic_vector(1 downto 0);
cin: in std_logic;
s: out std_logic_vector(1 downto 0));
end component CLA_2;
component CLL_2L is
port(p, g: in std_logic_vector(2 downto 0);
cout: out std_logic_vector(3 downto 1));
end component CLL_2L;
signal p, g: std_logic_vector(2 downto 0);
signal c: std_logic_vector(3 downto 1);
begin
U1: CLA_4_1 port map(a=>a(3 downto 0), b=>b(3 downto 0), s=>s(3 downto 0),
p_out=>p(0), g_out=>g(0));
U2: CLA_4 port map(a=>a(7 downto 4), b=>b(7 downto 4), cin=>c(1), s=>s(7 downto 4),
p_out=>p(1), g_out=>g(1));
U3: CLA_4 port map(a=>a(11 downto 8), b=>b(11 downto 8), cin=>c(2), s=>s(11 downto 8),
p_out=>p(2), g_out=>g(2));
U4: CLA_2 port map(a=>a(13 downto 12), b=>b(13 downto 12), cin=>c(3), s=>s(13 downto 12));
U5: CLL_2L port map(p=>p, g=>g, cout=>c);
end architecture STRUCTURAL;
-- cla_3s.vhd (Carry Lookahead Adder ~ 3 Bits (For CLA_19))
library IEEE;
use IEEE.std_logic_1164.all;
entity CLA_3S is
126
port(a, b: in std_logic_vector(2 downto 0);
cin: in std_logic;
s: out std_logic_vector(2 downto 0);
ovf: out std_logic);
end entity CLA_3S;
architecture STRUCTURAL of CLA_3S is
component SCLL_4 is
port( cin: in std_logic;
p, g: in std_logic_vector(2 downto 0);
cout: out std_logic_vector(3 downto 0));
end component SCLL_4;
component PFA is
port(a, b, c: in std_logic;
s, g, p: out std_logic);
end component PFA;
signal p, g: std_logic_vector(2 downto 0);
signal c: std_logic_vector(3 downto 0);
begin
U0: SCLL_4 port map(cin=>cin, p=>p, g=>g, cout=>c);
U1: PFA port map(a=>a(0), b=>b(0), c=>c(0), s=>s(0), g=>g(0), p=>p(0));
U2: PFA port map(a=>a(1), b=>b(1), c=>c(1), s=>s(1), g=>g(1), p=>p(1));
U3: PFA port map(a=>a(2), b=>b(2), c=>c(2), s=>s(2), g=>g(2), p=>p(2));
L1: ovf <= c(2) xor c(3);
end architecture STRUCTURAL;
-- scll_4.vhd (Carry Lookahead Logic for Bit position 17, 18, & 19)
library IEEE;
use IEEE.std_logic_1164.all;
entity SCLL_4 is
port( cin: in std_logic;
p, g: in std_logic_vector(2 downto 0);
cout: out std_logic_vector(3 downto 0));
end entity SCLL_4;
architecture BEHAVIORAL of SCLL_4 is
begin
L1: cout(0) <= cin;
L2: cout(1) <= g(0) or (p(0) and cin);
L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin);
L4: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0))
or (p(2) and p(1) and p(0) and cin);
end architecture BEHAVIORAL;
-- cla_4.vhd (Carry Lookahead Adder ~ 4 bits)
library IEEE;
use IEEE.std_logic_1164.all;
entity CLA_4 is
port(
a, b: in std_logic_vector(3 downto 0);
cin: in std_logic;
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end entity CLA_4;
architecture STRUCTURAL of CLA_4 is
component PFA is
port(a, b, c: in std_logic;
s, g, p: out std_logic);
end component PFA;
127
component CLL is
port(
cin: in std_logic;
p, g: in std_logic_vector(3 downto 0);
cout: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end component CLL;
signal c, g, p: std_logic_vector(3 downto 0);
begin
L1: CLL port map(cin=>cin, p=>p, g=>g, cout=>c, p_out=>p_out, g_out=>g_out);
LK: for k in 3 downto 0 generate
PFAK: PFA port map(a=>a(k), b=>b(k), c=>c(k), s=>s(k), g=>g(k), p=>p(k));
end generate LK;
end architecture STRUCTURAL;
-- cla_4_1.vhd (Carry Lookahead Adder ~ 4 bits / for the first CLA_4)
library IEEE;
use IEEE.std_logic_1164.all;
entity CLA_4_1 is
port(
a, b: in std_logic_vector(3 downto 0);
s: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end entity CLA_4_1;
architecture STRUCTURAL of CLA_4_1 is
component PFA is
port(a, b, c: in std_logic;
s, g, p: out std_logic);
end component PFA;
component CLL_1 is
port(
p, g: in std_logic_vector(3 downto 0);
cout: out std_logic_vector(2 downto 0);
p_out, g_out: out std_logic);
end component CLL_1;
signal g, p: std_logic_vector(3 downto 0);
signal c: std_logic_vector(3 downto 0);
begin
L0: g(0) <= a(0) and b(0);
L1: p(0) <= a(0) xor b(0);
L2: s(0) <= p(0);
L3: CLL_1 port map(p=>p, g=>g, cout=>c(3 downto 1), p_out=>p_out, g_out=>g_out);
LK: for k in 3 downto 1 generate
PFAK: PFA port map(a=>a(k), b=>b(k), c=>c(k), s=>s(k), g=>g(k), p=>p(k));
end generate LK;
end architecture STRUCTURAL;
--- cll_1.vhd (Carry Lookahead Logic - for first 4 bits)
library IEEE;
use IEEE.std_logic_1164.all;
entity CLL_1 is
port(
p, g: in std_logic_vector(3 downto 0);
128
cout: out std_logic_vector(2 downto 0);
p_out, g_out: out std_logic);
end entity CLL_1;
architecture BEHAVIORAL of CLL_1 is
begin
L1: cout(0) <= g(0);
L2: cout(1) <= g(1) or (p(1) and g(0));
L3: cout(2) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0));
L4: p_out <= p(3) and p(2) and p(1) and p(0);
L5: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1))
or (p(3) and p(2) and p(1) and g(0));
end architecture BEHAVIORAL;
-- cll.vhd (Carry Lookahead Logic - for 4 bits)
library IEEE;
use IEEE.std_logic_1164.all;
entity CLL is
port(
cin: in std_logic;
p, g: in std_logic_vector(3 downto 0);
cout: out std_logic_vector(3 downto 0);
p_out, g_out: out std_logic);
end entity CLL;
architecture BEHAVIORAL of CLL is
begin
L1: cout(0) <= cin;
L2: cout(1) <= g(0) or (p(0) and cin);
L3: cout(2) <= g(1) or (p(1) and g(0)) or (p(1) and p(0) and cin);
L4: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0))
or (p(2) and p(1) and p(0) and cin);
L5: p_out <= p(3) and p(2) and p(1) and p(0);
L6: g_out <= g(3) or (p(3) and g(2)) or (p(3) and p(2) and g(1))
or (p(3) and p(2) and p(1) and g(0));
end architecture BEHAVIORAL;
-- cll_2l.vhd (2nd Level of Carry Lookahead Logic - for 4 bits)
library IEEE;
use IEEE.std_logic_1164.all;
entity CLL_2L is
port(p, g: in std_logic_vector(2 downto 0);
cout: out std_logic_vector(3 downto 1));
end entity CLL_2L;
architecture BEHAVIORAL of CLL_2L is
begin
L1: cout(1) <= g(0);
L2: cout(2) <= g(1) or (p(1) and g(0));
L3: cout(3) <= g(2) or (p(2) and g(1)) or (p(2) and p(1) and g(0));
end architecture BEHAVIORAL;
-- csa.vhd (Carry Save Adder)
library IEEE;
use IEEE.std_logic_1164.all;
entity CSA is
generic(n: positive := 5);
port( a, b, c: in std_logic_vector(n-1 downto 0);
129
sum: out std_logic_vector(n-1 downto 0);
carry: out std_logic_vector(n downto 0));
end entity CSA;
architecture STRUCTURAL of CSA is
component FA is
port(a, b, cin: in std_logic;
s, cout: out std_logic);
end component FA;
begin
L1: carry(0) <= '0';
KL: for k in n-1 downto 0 generate
FAK: FA port map(a=>a(k), b=>b(k), cin=>c(k), s=>sum(k), cout=>carry(k+1));
end generate KL;
end architecture STRUCTURAL;
--- cla_2.vhd (Carry Lookahead Adder ~ 2 Bits (Last 2 Bits))
library IEEE;
use IEEE.std_logic_1164.all;
entity CLA_2 is
port(a, b: in std_logic_vector(1 downto 0);
cin: in std_logic;
s: out std_logic_vector(1 downto 0));
end entity CLA_2;
architecture STRUCTURAL of CLA_2 is
component SCLL is
port(cin, p, g: in std_logic;
cout: out std_logic);
end component SCLL;
component PFA is
port(a, b, c: in std_logic;
s, g, p: out std_logic);
end component PFA;
signal c, p, g: std_logic;
begin
U1: PFA port map(a=>a(0), b=>b(0), c=>cin, s=>s(0), g=>g, p=>p);
U2: SCLL port map(cin=>cin, p=>p, g=>g, cout=>c);
L1: s(1) <= a(1) xor b(1) xor c;
end architecture STRUCTURAL;
--- scll.vhd (Carry Lookahead Logic for Bit position 13)
library IEEE;
use IEEE.std_logic_1164.all;
entity SCLL is
port(cin, p, g: in std_logic;
cout: out std_logic);
end entity SCLL;
architecture BEHAVIORAL of SCLL is
begin
L1: cout <= g or (p and cin);
130
end architecture BEHAVIORAL;
-- ha.vhd (Half Adder)
library IEEE;
use IEEE.std_logic_1164.all;
entity HA is
port( a, b: in std_logic;
s, cout: out std_logic);
end entity HA;
architecture BEHAVIORAL of HA is
begin
L1: s <= a xor b;
L2: cout <= a and b;
end architecture BEHAVIORAL;
-- fa.vhd (full adder)
library IEEE;
use IEEE.std_logic_1164.all;
entity FA is
port(a, b, cin: in std_logic;
s, cout: out std_logic);
end entity FA;
architecture BEHAVIORAL of FA is
begin
L1: s <= a xor b xor cin;
L2: cout <= (a and b) or (a and cin) or (b and cin);
end architecture BEHAVIORAL;
-- pfa.vhd (Partial Full Adder)
library IEEE;
use IEEE.std_logic_1164.all;
entity PFA is
port(a, b, c: in std_logic;
s, g, p: out std_logic);
end entity PFA;
architecture BEHAVIORAL of PFA is
begin
L1: s <= a xor b xor c;
L2: g <= a and b;
L3: p <= a xor b;
end architecture BEHAVIORAL;
-- reg_p.vhd (Positive edge clocked registers)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
entity REG_P is
generic(n: positive := 5);
port( clk, rst: in std_logic;
d_in: in std_logic_vector(n-1 downto 0);
d_out: out std_logic_vector(n-1 downto 0));
end entity REG_P;
131
architecture BEHAVIORAL of REG_P is
signal d_reg: signed(n-1 downto 0);
begin
STORE: process (clk, rst, d_in) is
begin
if (rst = '1') then
d_reg <= conv_signed('0', n);
elsif (rising_edge(clk)) then
d_reg <= signed(d_in);
end if;
end process STORE;
L1: d_out <= std_logic_vector(d_reg);
end architecture BEHAVIORAL;
-- reg_n.vhd (Negative edge clocked registers)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.all;
entity REG_N is
generic(n: positive := 5);
port( clk, rst: in std_logic;
d_in: in std_logic_vector(n-1 downto 0);
d_out: out std_logic_vector(n-1 downto 0));
end entity REG_N;
architecture BEHAVIORAL of REG_N is
signal d_reg: signed(n-1 downto 0);
begin
STORE: process (clk, rst, d_in) is
begin
if (rst = '1') then
d_reg <= conv_signed('0', n);
elsif (falling_edge(clk)) then
d_reg <= signed(d_in);
end if;
end process STORE;
L1: d_out <= std_logic_vector(d_reg);
end architecture BEHAVIORAL;
132
Appendix B
VHDL codes, C++ source codes and Script file for Post-Synthesis simulation
Adders
C++ source code
// This program generate all possible inputs to the Adders
// with the ability to increase the increment
#include <iostream.h>
#include <iomanip.h>
#include <fstream.h>
int main()
{
ofstream out_file1, out_file2, out_file3;
out_file1.open("v_a.dat");
out_file2.open("v_b.dat");
out_file3.open("v_ans.dat");
int time, delay, a, b, choice;
int lo, hi, step;
time = 20;
delay = 0;
cout << "Please enter the selection by number:" << endl;
cout << "-------------------------------------" << endl;
cout << "(1) CLA 14" << endl;
cout << "(2) CLA 15" << endl;
cout << "(3) CLA 16" << endl;
cout << "(4) CLA 17" << endl;
cout << "(5) CLA 19" << endl;
cin >> choice;
cout << endl << "Please enter the step: ";
cin >> step;
switch (choice)
{
case 1:
lo = -8192;
hi = 8192;
break;
case 2:
lo = -16384;
hi = 16384;
break;
case 3:
lo = -32768;
hi = 32768;
break;
case 4:
lo = -65536;
hi = 65536;
break;
case 5:
lo = -262144;
hi = 262144;
break;
default:
lo = -8192;
133
hi = 8192;
break;
}
out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;
out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;
out_file3 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;
for (a=lo; a<hi; a+=step)
{
out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << a << "\\H +" << endl;
for (b=lo; b<hi; b+=step)
{
out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << b << "\\H +" << endl;
out_file3 << setiosflags(ios::uppercase) << "@" << dec << (time + delay) << "ns="
<< hex << (a + b) << "\\H +" << endl;
time = time + 20;
}
}
out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
out_file3 << "@" << dec << (time + delay) << "ns=" << 0 << "\\H" << endl;
out_file1.close();
out_file2.close();
out_file3.close();
return 0;
}
VHDL file for 14-bit CLA Testbench
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity TB_CLA14 is
port( a, b: in std_logic_vector(13 downto 0);
ans: in std_logic_vector(13 downto 0);
t: inout std_logic_vector(13 downto 0);
err: out std_logic );
end entity TB_CLA14;
architecture BEHAV of TB_CLA14 is
component CLA_14 is
port(
a, b: in std_logic_vector(13 downto 0);
s: out std_logic_vector(13 downto 0));
end component CLA_14;
begin
U1: CLA_14 port map(a=>a, b=>b, s=>t);
COMP: process(t, ans) is
begin
if (t = ans) then
err <= '0';
else
err <= '1';
end if;
end process COMP;
end architecture BEHAV;
134
VHDL file for 15-bit CLA Testbench
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity TB_CLA15 is
port( a, b: in std_logic_vector(14 downto 0);
ans: in std_logic_vector(14 downto 0);
t: inout std_logic_vector(14 downto 0);
err: out std_logic );
end entity TB_CLA15;
architecture BEHAV of TB_CLA15 is
component CLA_15 is
port(
a, b: in std_logic_vector(14 downto 0);
s: out std_logic_vector(14 downto 0));
end component CLA_15;
begin
U1: CLA_15 port map(a=>a, b=>b, s=>t);
COMP: process(t, ans) is
begin
if (t = ans) then
err <= '0';
else
err <= '1';
end if;
end process COMP;
end architecture BEHAV;
VHDL file for 16-bit CLA Testbench
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity TB_CLA16 is
port( a, b: in std_logic_vector(15 downto 0);
ans: in std_logic_vector(15 downto 0);
t: inout std_logic_vector(15 downto 0);
err: out std_logic );
end entity TB_CLA16;
architecture BEHAV of TB_CLA16 is
component CLA_16 is
port(
a, b: in std_logic_vector(15 downto 0);
s: out std_logic_vector(15 downto 0));
end component CLA_16;
begin
U1: CLA_16 port map(a=>a, b=>b, s=>t);
COMP: process(t, ans) is
begin
if (t = ans) then
err <= '0';
else
err <= '1';
end if;
end process COMP;
end architecture BEHAV;
135
VHDL file for 17-bit CLA Testbench
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity TB_CLA17 is
port( a, b: in std_logic_vector(16 downto 0);
ans: in std_logic_vector(16 downto 0);
t: inout std_logic_vector(16 downto 0);
err: out std_logic );
end entity TB_CLA17;
architecture BEHAV of TB_CLA17 is
component CLA_17 is
port(
a, b: in std_logic_vector(16 downto 0);
s: out std_logic_vector(16 downto 0));
end component CLA_17;
begin
U1: CLA_17 port map(a=>a, b=>b, s=>t);
COMP: process(t, ans) is
begin
if (t = ans) then
err <= '0';
else
err <= '1';
end if;
end process COMP;
end architecture BEHAV;
VHDL file for 19-bit CLA Testbench
-- tb_cla_19.vhd
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
use BFULIB.bfu_pckg.all;
entity TB_CLA19 is
port( a, b: in std_logic_vector(18 downto 0);
ans: in std_logic_vector(18 downto 0);
ovf: out std_logic;
t: inout std_logic_vector(18 downto 0);
err: out std_logic );
end entity TB_CLA19;
architecture BEHAV of TB_CLA19 is
begin
U1: CLA_19 port map(a=>a, b=>b, s=>t, ovf=>ovf);
COMP: process(t, ans) is
begin
if (t = ans) then
err <= '0';
else
err <= '1';
end if;
end process COMP;
end architecture BEHAV;
136
Multiplication Unit
C++ source code
// This program generate all possible inputs to the Multiplication
// unit and correct output results corresponding to all the inputs.
#include <iostream.h>
#include <iomanip.h>
#include <fstream.h>
int main()
{
ofstream out_file1, out_file2, out_file3, out_file4, out_file5;
out_file1.open("coef.dat");
out_file2.open("mag.dat");
out_file3.open("x_ans.dat");
int delay, time, m, n;
time = 0;
delay = 40;
out_file1 << "@" << time << "ns=" << 0 << "\\H +" << endl;
out_file2 << "@" << time << "ns=" << 0 << "\\H +" << endl;
out_file3 << "@" << time << "ns=" << 0 << "\\H +" << endl;
for (m=-32; m<32; m++)
{
out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << m << "\\H +" << endl;
for (n=0; n<256; n++)
{
out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << n << "\\H +" << endl;
out_file3 << setiosflags(ios::uppercase) << "@" << dec << time + delay << "ns="
<< hex << (m*n) << "\\H +" << endl;
time = time + 20;
}
}
out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
out_file3 << "@" << dec << time + delay << "ns=" << 0 << "\\H" << endl;
out_file1.close();
out_file2.close();
out_file3.close();
return 0;
}
VHDL code for Multiplication Unit testbench
-- testbench for multiplier (tb_mult.vhd)
library IEEE, BFULIB;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
use BFULIB.bfu_pckg.all;
entity TB_MULT is
port( clk, rst: in std_logic;
coef: in std_logic_vector(5 downto 0);
mag: in std_logic_vector(7 downto 0);
pro: in std_logic_vector(13 downto 0);
p: out std_logic_vector(13 downto 0);
err: out std_logic );
end entity TB_MULT;
architecture STRUCT of TB_MULT is
signal product: std_logic_vector(13 downto 0);
137
begin
M1: MULT port map(clk=>clk, rst=>rst, a=>mag, b=>coef, p=>product);
L1: p <= product;
COMP: process(clk, rst) is
begin
if (rst = '1') then
err <= '0';
elsif (clk'event and clk = '1') then
if (product = pro) then
err <= '0';
else
err <= '1';
end if;
end if;
end process COMP;
end architecture STRUCT;
Script File
| The file has been automatically generated by
| the Script Editor File Wizard version 2.0.1.89
|
| Copyright © 1998 Aldec, Inc.
| Initial settings
delete_signals
set_mode functional
restart
stepsize 10 ns
| Vector Definitions
|
| Add your vector definition commands here
vector coef coef5 coef4 coef3 coef2 coef1 coef0
radix hex coef
vector mag mag7 mag6 mag5 mag4 mag3 mag2 mag1 mag0
radix hex mag
vector product p[13:0]
radix hex product
vector t_ans pro[13:0]
radix hex t_ans
| Watched Signals and Vectors
|
| Define your signal and vector watch list here
watch coef mag product err
| Stimulators Assignment
|
| Select and/or define your own stimulators
| and assign them to the selected signals
wfm coef < coef.dat
wfm mag < mag.dat
wfm t_ans < x_ans.dat
| Set Breakpoint Conditions
|
| Define breakpoint conditions and
| breakpoint actions for selected signals here
break err 1-0 do (print err.out)
| Perform Simulation
138
|
| Run simulation for a selected number of
| clock cycles or a time range
sim
139
Appendix C
C++ Source Codes for Programs Used During Post-Implementation Simulation
Program 1 (Input Image Plane and Output Image Planes generator)
#include <iostream.h>
#include <stdlib.h>
#include <iomanip.h>
#include <fstream.h>
int main()
{
ifstream in_file1;
ofstream out_file1, out_file2;
in_file1.open("coef.txt");
out_file1.open("v_input.dat");
out_file2.open("input_mag.txt");
int row, col, nfc;
int a, b, k, m, n, mag;
int i_mag[5][60];
int fc[3][5][5];
unsigned int seed;
cout << "Seed number: ";
cin >> seed;
in_file1 >> nfc;
row = 5;
col = 60;
k = 0;
// Reading in the FC planes in coef.txt file
while (k < nfc)
{
for (a = 0; a < 5; a++)
{
for (b = 0; b < 5; b++)
{
in_file1 >> mag;
fc[k][a][b] = mag;
}
}
k++;
}
in_file1.close();
// Generate Randomized Input Image plane with rand function
srand(seed);
out_file2 << "Input Image Plane (" << row << "x" << col << ")" << endl;
out_file2 << "------------------------" << endl;
for (a = 0; a < row; a++)
{
for (b = 0; b < col; b++)
{
mag = 256;
while (mag > 255)
{
mag = rand();
}
i_mag[a][b] = mag;
out_file2 << setiosflags(ios::uppercase) << setw(3)
<< hex << mag << " ";
}
140
out_file2 << endl;
}
// This segment of the code generate the input image plane for simulation
for (a = 0; a < row; a++)
{
for (b = 0; b < col; b++)
{
out_file1 << setiosflags(ios::uppercase) << "assign input " << hex
<< i_mag[a][b] << "\\h" << endl;
out_file1 << "cycle 1" << endl;
}
if (a < 2)
{
out_file1 << "cycle 1" << endl;
}
else
{
out_file1 << "cycle 2" << endl;
}
}
// Generate Expected Output
for (k = 0; k < nfc; k++)
{
out_file2 << endl << endl << "Output Image Plane " << k+1 << endl
<< "--------------------" << endl;
for (a = 0; a < row; a++)
{
for (b = 0; b < col; b++)
{
mag = 0;
for (m = 0; m < 5; m++)
{
for (n = 0; n < 5; n++)
{
if (!(((a-m+2) < 0) | ((a-m+2) >= row)|
((b-n+2)<0) | ((b-n+2) >= col)))
mag = mag + (i_mag[a-m+2][b-n+2]*fc[k][m][n]);
}
}
mag = mag & 0x7FFFF;
out_file2 << setw(5) << hex << setiosflags(ios::uppercase)
<< mag << " ";
}
out_file2 << endl;
}
}
out_file1.close();
out_file2.close();
return 0;
}
Program 2 (To generate test vectors that will program the FCs into each MAUs)
#include <iostream.h>
#include <iomanip.h>
#include <fstream.h>
int main()
{
ifstream in_file1;
ofstream out_file1, out_file2, out_file3;
in_file1.open("coef.txt"); //Filter Coefficient file
out_file1.open("v_coef.dat");
out_file2.open("v_c_reg.dat");
out_file3.open("v_au_sel.dat");
141
int array[5][5];
int time, count, temp, a, b, k, nfc;
time = 40;
count = 1;
k = 1;
in_file1 >> nfc;
out_file1 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;
out_file2 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;
out_file3 << "@" << 0 << "ns=" << 0 << "\\H +" << endl;
while (k-1 < nfc)
{
out_file3 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << k << "\\H +" << endl;
for (a=4; a>=0; a--)
{
for (b=0; b<5; b++)
{
in_file1 >> temp;
cout << temp << endl;
array[b][a] = temp;
}
}
for (a=0; a<5; a++)
{
for (b=0; b<5; b++)
{
out_file1 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << array[a][b] << "\\H +" << endl;
out_file2 << setiosflags(ios::uppercase) << "@" << dec << time << "ns="
<< hex << count << "\\H +" << endl;
time += 80;
count++;
}
}
count = 1;
k++;
}
out_file1 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
out_file2 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
out_file3 << "@" << dec << time << "ns=" << 0 << "\\H" << endl;
in_file1.close();
out_file1.close();
out_file2.close();
out_file3.close();
return 0;
}
142
Appendix D
C++ Source Codes for Programs Used During Hardware Prototype Implementation
Program 1 (This is the program that responsible for sending the FC values)
//This is the driver for system without FIFO
#include <iostream.h>
#include <stdlib.h>
#include <conio.h>
#include <iomanip.h>
#include <fstream.h>
#define DATA 0x0378
#define STATUS DATA+1
#define CONTROL DATA+2
void delay(int);
void sentdata(int &);
main()
{
int reg_coef[3][25];
int reg_cfg[3][25];
int fc[3][5][5];
int k, nfc, a, b, count, wait, d, o_sel, o_cfg;
/* Reading the Filter Coefficient from the coef.txt */
ifstream in_file;
in_file.open("coef.txt"); // Open the coef.txt file
in_file >> nfc >> o_sel; // Read in the number of FC plane
o_cfg = 0x80;
/* Make Sure Parallel Port is in forward mode and set strobe */
_outp(CONTROL, _inp(CONTROL) & 0xDE);
/* Make Sure the write enable (ppc(3)) is at low */
_outp(CONTROL, _inp(CONTROL) | 0x08);
k = 1; count = 1;
while (k-1 < nfc) // Repeat for the number of plane indicated
{
for (a=4; a>=0; a--)
{
for (b=0; b<5; b++)
{
in_file >> fc[k-1][b][a]; // Reading in the filter coefficients
}
// in the arrangement of FC in the AU
}
for (a=0; a<5; a++)
{
for (b=0; b<5; b++)
{
reg_coef[k-1][count-1] = fc[k-1][a][b] & 0x3F;
reg_cfg[k-1][count-1] = ((k & 0x03) << 5);
reg_cfg[k-1][count-1] = reg_cfg[k-1][count-1] | (count & 0x1F);
cout << hex << reg_cfg[k-1][count-1] << " " << reg_coef[k-1][count-1] << endl;
count++;
}
}
count = 1;
k++;
}
k = 1;
while (k-1 < nfc)
{
a = 1;
143
while(a<26)
{
while(((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3)
{
_outp(CONTROL, _inp(CONTROL) & 0xF7); //Assert ppc(3)
while(!((_inp(STATUS) & 0x20) == 32)){} //Detect high on pps(5)
sentdata(reg_coef[k-1][a-1]);
while(!((_inp(STATUS) & 0x40) == 64)){} //Detect high on pps(6)
sentdata(reg_cfg[k-1][a-1]);
a++;
while((_inp(STATUS) & 0x40) == 64){} //Detect high on pps(6)
}
}
k++;
} //Configuration of the MAUs are done
_outp(CONTROL, _inp(CONTROL) | 0x08); //Desert ppc(3)
//
k = 0;
while (k<1)
{
//Program output selection according to o_sel read in
while(((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3)
{
_outp(CONTROL, _inp(CONTROL) & 0xF7); //Assert ppc(3)
while(!((_inp(STATUS) & 0x20) == 32)){} //Detect high on pps(5)
sentdata(o_sel);
while(!((_inp(STATUS) & 0x40) == 64)){} //Detect high on pps(6)
sentdata(o_cfg);
while((_inp(STATUS) & 0x40) == 64){} //Detect high on pps(6)
k++;
}
}
_outp(CONTROL, _inp(CONTROL) | 0x08); //Desert ppc(3)
cout << "Configuration done!" << endl;
in_file.close(); //Close coef.txt
// Programming of the Filter Coefficient is done
exit(1);
// This section starts sending input datas to the system
ifstream in_file1;
in_file1.open("input.txt"); //Open the input data file
wait = 1;
while (wait==1)
{
if ((_inp(STATUS) & 0x08) == 0) //Detect low on pps(3)
{
//Run the following segment of code if pps(3)==0
while(!((_inp(STATUS) & 0x08) == 8)) //Detect high on pps(3)
{
//Run the following segment of code if pps(4)==1
if ((_inp(STATUS) & 0x10) == 16) //Detect high on pps(4)
{
in_file1 >> d; //Read in the data from file
sentdata(d);
while((_inp(STATUS) & 0x08) == 0) //Detect high on pps(3)
{}
}
}
}
}
return 0;
}; //end of main
void sentdata(int &c)
{
cout << c << " sent..." << endl;
_outp(DATA, c^0x03); /* sending the data with the two LSB toggled */
_outp(CONTROL, _inp(CONTROL) & 0xFB); /* set strobe ~ one to zero */
delay(1000);
_outp(CONTROL, _inp(CONTROL) | 0x04); /* reset strobe ~ zero to one */
144
delay(1000);
};
/* A function to create delay */
void delay(int k)
{
int i;
for (i=0; i<=k; i++){}
};
Program 2 (This is the program that generate VHDL file for internal RAM for input image
pixels)
#include <iostream.h>
#include <fstream.h>
#include <iomanip.h>
#include <stdlib.h>
main()
{
int a, b, mag, row, col, i, j, k;
int i_mag[5][62];
int temp[32];
unsigned int seed;
ofstream outfile("input_ram.vhd", ios::out);
row = 5;
col = 62;
cout << "Seed number: ";
cin >> seed;
srand(seed);
for (a = 0; a < row; a++)
{
for (b = 0; b < (col-2); b++)
{
mag = 256;
while (mag > 255)
{
mag = rand();
}
i_mag[a][b] = mag;
}
i_mag[a][col-1] = 0;
i_mag[a][col-2] = 0;
}
outfile << "library IEEE;\n"
<< "use IEEE.std_logic_1164.all;\n"
<< "use IEEE.numeric_std.all;\n"
<< "\nentity IN_RAM is\n"
<< " port( clk: in std_logic;\n"
<< "
rst: in std_logic;\n"
<< "
req: in std_logic;\n"
<< "
dout: out std_logic_vector(7 downto 0) );\n"
<< "end entity IN_RAM;\n\n";
outfile << "architecture STRUCT of IN_RAM is\n\n"
<< " component RAMB4_S8 is\n"
<< " port( DI: in std_logic_vector(7 downto 0);\n"
<< "
EN: in std_logic;\n"
<< "
WE: in std_logic;\n"
<< "
RST: in std_logic;\n"
<< "
CLK: in std_logic;\n"
<< "
ADDR: in std_logic_vector(8 downto 0);\n"
<< "
DO: out std_logic_vector(7 downto 0) );\n"
145
<< " end component RAMB4_S8;\n\n";
a = b = j = k = 0;
while (k<(row*col))
{
i = 0;
while (i<32)
{
if (a < 5)
{
temp[31-i] = i_mag[a][b];
}
else
{
temp[31-i] = 0;
}
if (b == (col-1))
{
b = 0;
a++;
k++;
i++;
}
else
{
k++;
b++;
i++;
}
}
outfile << " attribute INIT_0" << setw(1) << hex << j << ": string;\n"
<< " attribute INIT_0" << setw(1) << hex << j << " of IRM: label is \"";
for (i=0; i<32; i++)
{
if (temp[i] < 16)
{
outfile << "0" << setw(1) << hex << temp[i];
}
else
{
outfile << setw(2) << hex << temp[i];
}
}
outfile << "\";\n";
j++;
}
outfile << "\n signal din : std_logic_vector(7 downto 0);\n"
<< " signal addr: unsigned(8 downto 0);\n"
<< " signal adr : std_logic_vector(8 downto 0);\n"
<< " signal en : std_logic;\n"
<< " signal we : std_logic;\n";
outfile << "\n begin\n\n"
<< " L1: din <= (others=>'0');\n"
<< " L2: en <= '1';\n"
<< " L3: we <= '0';\n"
<< " L4: adr <= std_logic_vector(addr);\n\n";
outfile << " P1: process(clk, rst, req) is\n"
<< " begin\n"
<< " if (rst = '1') then\n"
<< "
addr <= (others=>'0');\n"
<< " elsif (clk'event and clk = '1') then\n"
<< "
if (req = '1') then\n"
<< "
addr <= addr + 1;\n"
<< "
end if;\n"
<< " end if;\n"
<< " end process P1;\n\n";
outfile << " IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk, \n"
146
<< "
ADDR=>adr, DO=>dout);\n";
outfile << "\nend architecture STRUCT;\n";
outfile.close();
return 0;
}
Program 3 (Test Program: this program responsible for comparing the uploaded output
witht the theoretical correct outputs)
#include <iostream.h>
#include <stdlib.h>
#include <iomanip.h>
#include <fstream.h>
void hex2file(ofstream &, int);
int main()
{
ifstream in_file1;
ofstream out_file1, out_file2;
in_file1.open("coef.txt");
out_file1.open("input.txt");
out_file2.open("exp_res.txt");
int row, col, nfc;
int a, b, k, m, n, mag, o_sel;
int i_mag[5][60];
int fc[3][5][5];
unsigned int seed;
cout << "Seed number: ";
cin >> seed;
in_file1 >> nfc >> o_sel;
row = 5;
col = 60;
k = 0;
// Reading in the FC planes in coef.txt file
while (k < nfc)
{
for (a = 0; a < 5; a++)
{
for (b = 0; b < 5; b++)
{
in_file1 >> mag;
fc[k][a][b] = mag;
}
}
k++;
}
in_file1.close();
// Generate Randomized Input Image plane with rand function
srand(seed);
out_file2 << "Input Image Plane (" << row << "x" << col << ") Seed: " << seed << endl;
out_file2 << "---------------------------------" << endl;
for (a = 0; a < row; a++)
{
for (b = 0; b < col; b++)
{
mag = 256;
while (mag > 255)
{
147
mag = rand();
}
i_mag[a][b] = mag;
out_file1 << dec << setw(3) << mag << " ";
out_file2 << setiosflags(ios::uppercase) << setw(3)
<< hex << mag << " ";
}
out_file1 << endl;
out_file2 << endl;
}
// Generate Expected Output
for (k = 0; k < nfc; k++)
{
out_file2 << endl << endl << "Output Image Plane " << k+1 << endl <<
"--------------------" << endl;
for (a = 0; a < row; a++)
{
for (b = 0; b < col; b++)
{
mag = 0;
for (m = 0; m < 5; m++)
{
for (n = 0; n < 5; n++)
{
if (!(((a-m+2) < 0) | ((a-m+2) >= row) |
((b-n+2) < 0) | ((b-n+2) >= col)))
mag = mag + (i_mag[a-m+2][b-n+2]*fc[k][m][n]);
}
}
mag = mag & 0x7FFFF;
hex2file(out_file2, mag);
}
out_file2 << endl;
}
}
out_file1.close();
out_file2.close();
return 0;
}
void hex2file(ofstream &outfile, int mag)
{
int i, temp;
for (i=0; i<5; i++)
{
temp = mag & 0xF0000;
temp = temp >> 16;
outfile << hex << setiosflags(ios::uppercase) << temp;
mag = mag << 4;
}
outfile << " ";
}
148
Appendix E
VHDL Files for Modules External to the Convolution Architecture
1. Top Level Description of the whole system (the convolution architecture is included)
-- par2brd.vhd
library IEEE, BRDMOD;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
use BRDMOD.brd_util.all;
entity PAR2BRD is
port( -- inputs from board
sw1: in std_logic; -- start signal
sw2: in std_logic; -- global reset
sw3: in std_logic; -- mux select for internal clock
clk: in std_logic; -- external clock
-- from parallel port
ppd: in std_logic_vector(7 downto 0);
ppc: in std_logic_vector(3 downto 2);
pps: out std_logic_vector(6 downto 3);
-- output to external SRAM (right bank)
cen_r : out std_logic;
wen_r : out std_logic;
oen_r : out std_logic;
addr_r: out std_logic_vector(18 downto 0);
data_r: out std_logic_vector(15 downto 0);
-- output to external SRAM (left bank)
cen_l : out std_logic;
wen_l : out std_logic;
oen_l : out std_logic;
addr_l: out std_logic_vector(18 downto 0);
data_l: out std_logic_vector(15 downto 0);
---
-- output from the interface
clk_led: out std_logic;
sl: out std_logic_vector(6 downto 0);
sr: out std_logic_vector(6 downto 0) );
done: out std_logic );
end entity PAR2BRD;
architecture STRUCT of PAR2BRD is
component SYS is
port( clk, rst, str: in std_logic;
d_in: in std_logic_vector(7 downto 0); --(FIFO -> DM_IF)
coef: in std_logic_vector(5 downto 0); --(FCs from parallel port)
ld_reg: in std_logic_vector(4 downto 0); --(MAUs select from pp)
au_sel: in std_logic_vector(1 downto 0); --(AU select from pp)
o_sel: in std_logic;
--(Output config from pp)
req: out std_logic;
--(Controller -> FIFO)
sram_w: out std_logic;
--(SYS -> SRAM)
d_out: out std_logic_vector(18 downto 0) );
end component SYS;
component IN_RAM is
port( clk: in std_logic;
rst: in std_logic;
req: in std_logic;
dout: out std_logic_vector(7 downto 0) );
end component IN_RAM;
component IBUF is
port( i: in std_logic;
149
o: out std_logic );
end component IBUF;
component BUFG is
port( i: in std_logic;
o: out std_logic );
end component BUFG;
-- These signals are from parallel port interface
signal d_clk, strobe, strobe_b: std_logic;
signal nsw1, nsw2, nsw3
: std_logic;
-- Internal connection signals
signal req: std_logic;
--(SYS -> IN_RAM)
signal v_d : std_logic_vector(7 downto 0); --(PINTFC -> REG_A)
signal v_t : std_logic_vector(18 downto 0); --(SYS -> SRAM)
signal v_in : std_logic_vector(7 downto 0); --(IN_RAM -> SYS)
-- signal v_led : std_logic_vector(7 downto 0); --(MUX -> SVNSEG)
signal c_out : std_logic_vector(5 downto 0); --(coefficient output from FC_MOD)
signal cf_out: std_logic_vector(7 downto 0); --(MAUs configuration output from FC_MOD)
signal sram_w: std_logic;
--(SYS -> OUT_RAM)
signal cen : std_logic;
signal wen : std_logic;
signal oen : std_logic;
signal data : std_logic_vector(18 downto 0);
signal addr : std_logic_vector(18 downto 0);
-- Clock selection signal
signal c_sel : std_logic;
signal p_clk : std_logic; -- filter coefficients programming clk
begin
-- External strobe buffering and padding
B1: IBUF port map(i=>ppc(2), o=>strobe_b);
B2: BUFG port map(i=>strobe_b, o=>strobe);
-- Inverting the logic level of the push buttons.
S1: nsw1 <= not sw1;
S2: nsw2 <= not sw2;
S3: nsw3 <= not sw3;
-- Clock counter to reduce the clock frequency of the external clock
C1: C_CNTR generic map(n=>12500) port map(clk=>clk, rst=>nsw2, co=>p_clk);
-- First In First Out queue after the parallel port
-- F1: FIFO port map(rst=>nsw2, r_clk=>d_clk, r_en=>req, w_clk=>strobe, w_en=>ppc(3),
-din=>ppd, dout=>v_d, empty=>pps(3));
-- This parallel port interface is aimed to replace the FIFO queue
P1: PINTFC port map(clk=>strobe, rst=>nsw2, ppd=>ppd, d_out=>v_d);
-- Drivers to the two seven segments LEDs
-- SV1: SVNSG port map(ldg=>v_led(7 downto 4), rdg=>v_led(3 downto 0), sl=>sl, sr=>sr);
-- Filter Coefficient Programming Module
FC1: FC_MOD port map(clk=>d_clk, rst=>nsw2, ppc=>ppc(3), ppd=>v_d,
pps1=>pps(5), pps2=>pps(6), coef_out=>c_out, cfg_out=>cf_out);
-- SRAM Interface module (Responsible for writing output pixels to the external SRAM
SRM: OUT_RAM port map(clk=>d_clk, rst=>nsw2, w=>sram_w, d_in=>v_t,
cen=>cen, wen=>wen, oen=>oen, addr=>addr, data=>data);
L1 : cen_l <= cen;
L2 : cen_r <= cen;
L3 : wen_l <= wen;
L4 : wen_r <= wen;
L5 : oen_l <= oen;
L6 : oen_r <= oen;
L7 : addr_l <= addr;
L8 : addr_r <= addr;
L9 : data_l <= "0000000000000" & data(18 downto 16);
L10: data_r <= data(15 downto 0);
150
-- Clock LED on the bar(6) LED
L11: pps(3) <= d_clk;
L12: done <= sram_w;
-- MUX Select for SVNSEG LEDs display
-- MUX: v_led <= v_t(7 downto 0) when nsw3 = '0' else v_t(15 downto 8);
-- Convolution System (req is replaced by pps(4) to the parallel port pin)
-- (o_sel is the output config pin to select which output plane to output to svnseg)
U0: SYS port map(clk=>d_clk, rst=>nsw2, str=>nsw1, coef=>c_out,
ld_reg=>cf_out(4 downto 0), au_sel=>cf_out(6 downto 5),
o_sel=>cf_out(7), req=>req, d_in=>v_in, sram_w=>sram_w,
d_out=>v_t);
-- Input RAM to provide input image pixels to the convolution system
U1: IN_RAM port map(clk=>d_clk, rst=>nsw2, req=>req, dout=>v_in);
-- MUX select for the internal operation clock by SW3
-- SELP: process (nsw3, nsw2) is
-- begin
-- if (nsw2 = '1') then
-- c_sel <= '0';
-- elsif (nsw3 = '1') then
-- c_sel <= '1';
-- end if;
-- end process SELP;
-- assign the clock as indicated by the c_sel signal
L13: d_clk <= p_clk when nsw3 = '0' else clk;
L14: clk_led <= c_sel;
end architecture STRUCT;
2. Block RAM module (initialized with input image plane)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity IN_RAM is
port( clk: in std_logic;
rst: in std_logic;
req: in std_logic;
dout: out std_logic_vector(7 downto 0) );
end entity IN_RAM;
architecture STRUCT of IN_RAM is
component RAMB4_S8 is
port( DI: in std_logic_vector(7 downto 0);
EN: in std_logic;
WE: in std_logic;
RST: in std_logic;
CLK: in std_logic;
ADDR: in std_logic_vector(8 downto 0);
DO: out std_logic_vector(7 downto 0) );
end component RAMB4_S8;
attribute INIT_00: string;
attribute INIT_00 of IRM: label is "b127037aa60fa2fcd6d2ecfba0b29b472f795754f8c707915e72910a387ba32d";
attribute INIT_01: string;
attribute INIT_01 of IRM: label is "54c7000034647cf42036263aea806ef629ca8017f15167ae072fa406aa3c5f8e";
attribute INIT_02: string;
attribute INIT_02 of IRM: label is "29e3340c8d55d63104709180640dde6598abfa87834f3f8b862521afb27b05da";
attribute INIT_03: string;
attribute INIT_03 of IRM: label is "8c059a5e0000d10f29ef58e2343094fdf00c4875d7132b775a9d90eb0a2af23a";
attribute INIT_04: string;
attribute INIT_04 of IRM: label is "f4ffbba026cb9a95e76073e78cf8bb71ec7adb544571aba14108426ce0b65b48";
attribute INIT_05: string;
attribute INIT_05 of IRM: label is "562f3c0012a50000c449b579657f3430b517d329250b06e193764decbb8d2fae";
151
attribute INIT_06: string;
attribute INIT_06 of IRM: label is "0e1445292bf6efc7ca1a168a0b28ac6c95e4c5e1eafa97d4a739e68847e36db2";
attribute INIT_07: string;
attribute INIT_07 of IRM: label is "766d9cd4a6f5a327000025adece722128cec7c0d9ec1ecf1cb50f064f402ac29";
attribute INIT_08: string;
attribute INIT_08 of IRM: label is "48acd73b6335198a049f8709c6e40466c6c633d89e44e024272e09b37771f914";
attribute INIT_09: string;
attribute INIT_09 of IRM: label is "0000000000000000000000003f1b75a632e7d585d0d36448ec00ce97f55e0ad8";
signal din : std_logic_vector(7 downto 0);
signal addr: unsigned(8 downto 0);
signal adr : std_logic_vector(8 downto 0);
signal en : std_logic;
signal we : std_logic;
begin
L1: din <= (others=>'0');
L2: en <= '1';
L3: we <= '0';
L4: adr <= std_logic_vector(addr);
P1: process(clk, rst, req) is
begin
if (rst = '1') then
addr <= (others=>'0');
elsif (clk'event and clk = '1') then
if (req = '1') then
addr <= addr + 1;
end if;
end if;
end process P1;
IRM: RAMB4_S8 port map(DI=>din, EN=>en, WE=>we, RST=>rst, CLK=>clk,
ADDR=>adr, DO=>dout);
end architecture STRUCT;
3. FC Programming module
-- fc_pg_mod.vhd (Filter Coefficient Programming Module)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity FC_MOD is
port( clk, rst: in std_logic;
ppc: in std_logic;
-- Control Pin from parallel port
ppd: in std_logic_vector(7 downto 0); -- Input data from parallel port
pps1: out std_logic;
-- Status pin for request data
pps2: out std_logic;
-- Status pin for request cfg
coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program
cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals
end entity FC_MOD;
architecture STRUCTURAL of FC_MOD is
component FC_REG is
port( clk, rst: in std_logic;
rec_0: in std_logic;
rec_1: in std_logic;
prog: in std_logic;
d_in: in std_logic_vector(7 downto 0); -- Input data from the parallel port
coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program
cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals
end component FC_REG;
component FC_FSM is
port( clk, rst: in std_logic;
152
ctr_pin: in std_logic;
rec_0: out std_logic; -- receive_data state enable pin
rec_1: out std_logic; -- receive_config state enable pin
prog: out std_logic ); -- program state enable pin
end component FC_FSM;
signal rec_0, rec_1, prog: std_logic;
begin
FSM: FC_FSM port map(clk=>clk, rst=>rst, ctr_pin=>ppc, rec_0=>rec_0, rec_1=>rec_1,
prog=>prog);
FCG: FC_REG port map(clk=>clk, rst=>rst, rec_0=>rec_0, rec_1=>rec_1, prog=>prog,
d_in=>ppd, coef_out=>coef_out, cfg_out=>cfg_out);
-- Status pins to the parallel port
O1: pps1 <= rec_0;
O2: pps2 <= rec_1;
end architecture STRUCTURAL;
-- fc_pg_fsm.vhd (Filter Coefficient Programming Finite State Machine)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity FC_FSM is
port( clk, rst: in std_logic;
ctr_pin: in std_logic;
rec_0: out std_logic; -- receive_data state enable pin
rec_1: out std_logic; -- receive_config state enable pin
prog: out std_logic ); -- program state enable pin
end entity FC_FSM;
architecture BEHAVIORAL of FC_FSM is
type states is (idle, receive_data, receive_config, program);
signal c_state: states; -- Current State
signal n_state: states; -- Next State
begin
NST_PROC: process(c_state, ctr_pin) is
begin
case c_state is
when idle => if (ctr_pin = '0') then
n_state <= idle;
else
n_state <= receive_data;
end if;
when receive_data => n_state <= receive_config;
when receive_config => n_state <= program;
when program => if (ctr_pin = '0') then
n_state <= idle;
else
n_state <= receive_data;
end if;
end case;
end process NST_PROC;
CST_PROC: process(clk, rst, n_state) is
begin
if(rst='1') then
c_state <= idle;
153
elsif (clk'event and clk='0') then
c_state <= n_state;
end if;
end process CST_PROC;
OUT_PROC: process(c_state) is
begin
case c_state is
when idle => rec_0 <= '0';
rec_1 <= '0';
prog <= '0';
when receive_data => rec_0 <= '1';
rec_1 <= '0';
prog <= '0';
when receive_config => rec_0 <= '0';
rec_1 <= '1';
prog <= '0';
when program => rec_0 <= '0';
rec_1 <= '0';
prog <= '1';
end case;
end process OUT_PROC;
end architecture BEHAVIORAL;
-- fc_pg_reg.vhd (Filter COefficient Programming Registers)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity FC_REG is
port( clk, rst: in std_logic;
rec_0: in std_logic;
rec_1: in std_logic;
prog: in std_logic;
d_in: in std_logic_vector(7 downto 0); -- Input data from the parallel port
coef_out: out std_logic_vector(5 downto 0); -- Filter Coefficients to program
cfg_out: out std_logic_vector(7 downto 0) ); -- MAUs configuration signals
end entity FC_REG;
architecture BEHAVIORAL of FC_REG is
signal d_reg: std_logic_vector(5 downto 0);
signal c_reg: std_logic_vector(7 downto 0);
begin
-- Registers for storing Filter Coefficients
REC_D: process(clk, rst, rec_0) is
begin
if (rst = '1') then
d_reg <= (others=>'0');
elsif (clk'event and clk='1') then
if (rec_0='1') then
d_reg <= d_in(5 downto 0);
end if;
end if;
end process REC_D;
-- MAUs configuration signals
REC_C: process(clk, rst, rec_1) is
begin
if (rst = '1') then
154
c_reg <= (others=>'0');
elsif (clk'event and clk='1') then
if (rec_1='1') then
c_reg <= d_in(7 downto 0);
end if;
end if;
end process REC_C;
-- Enable Output from the registers
O1: coef_out <= d_reg when prog = '1' else (others=>'0');
O2: cfg_out <= c_reg when prog = '1' else (others=>'0');
end architecture BEHAVIORAL;
4. SRAM Driver
-- out_ram.vhd (Output Ram for storing output pixels from the architecture)
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.numeric_std.all;
entity OUT_RAM is
port( clk, rst: in std_logic;
w: in std_logic; -- read or write to SRAM
d_in: in std_logic_vector(18 downto 0); -- Input data from architecture
cen, wen: out std_logic; -- cen=chip enable, wen=write enable (both active low)
oen: out std_logic; -- oen=out enable (active low)
addr: out std_logic_vector(18 downto 0); -- SRAM address bus
data: out std_logic_vector(18 downto 0) ); -- SRAM Data bus
end entity OUT_RAM;
architecture BEHAV of OUT_RAM is
signal w_address: unsigned(18 downto 0);
-- signal i_data : unsigned(7 downto 0);
begin
-- Asynchronous Reset and positive edge trigger events
P1: process (clk, rst, w) is
begin
if (rst = '1') then
wen <= '1';
oen <= '1';
addr <= (others => '0'); -- initial address during reset
elsif (clk'event and clk = '1') then
if (w = '1') then
wen <= '0';
oen <= '1';
addr <= std_logic_vector(w_address);
end if;
end if;
end process P1;
-- Address counter
P2: process(clk, rst, w) is
begin
if (rst = '1') then
w_address <= (others => '0');
elsif (clk'event and clk = '1') then
if (w = '1') then
w_address <= w_address + 1;
end if;
end if;
155
end process P2;
-- Chip enable signal
L1: cen <= clk;
-- Data Bus
L2: data <= d_in; --std_logic_vector(i_data);
-- L3: i_data <= unsigned(w_address(7 downto 0)) + 4;
end architecture BEHAV;
5. Parallel Port Interface Module
library IEEE;
use IEEE.std_logic_1164.all;
entity PINTFC is
port( clk, rst: in std_logic;
ppd: in std_logic_vector(7 downto 0);
d_out: out std_logic_vector(7 downto 0) );
end entity PINTFC;
architecture BEHAVIORAL of PINTFC is
signal data: std_logic_vector(7 downto 0);
begin
REC: process(clk, rst) is
begin
if (rst = '1') then
data <= "00000000";
elsif (clk'event and clk = '1') then
data <= ppd;
end if;
end process REC;
L1: d_out <= data;
end architecture BEHAVIORAL;
156
References
[1] Bernard Bosi and Guy Bois, “Reconfigurable Pipelined 2-D Convolvers for fast
Digital Signal Processing”, IEEE Transaction on Very Large Scale (VLSI)
Systems, Vol. 7, No.3, p 299-308, Sep. 1999.
[2] Cheng-The Hsieh and Seung P. Kim, “A Highly-Modular Pipelined VLSI
Architecture for 2-D FIR Digital Filter,” Proceedings of the 1996 IEEE 39th
Midwest Symposium on Circuits and Systems, Part 1, p 137-140, Aug. 1996.
[3] D. D. Haule and A. S. Malowany, “High-speed 2-D Hardware Convolution
Architecture Based on VLSI Systolic Arrays”, IEEE Pacific Rim Conference on
Communications, Computers and Signal Processing, p 52-55, Jun 1989.
[4] D. Patterson and J. Hennessy, Computer Organization & Design: The Hardware
/ Software Interface, Morgan Kaufmann, 1994.
[5] GSI Technology, Product Datasheet, http://www.gsitechnology.com.
[6] H. T. Kung, “Why Systolic Architectures”, IEEE Computer, Vol. 15, p 37-46,
Jan. 1982.
[7] Hyun Man Chang and Myung H. Sunwoo, “An Efficient Programmable 2-D
Convolver Chip”, Proceedings of the 1998 IEEE International Symposium on
Circuits and Systems, ISCAS, Part 2, p 429-432, May 1998.
[8] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach,
Second Edition, Morgan Kaufmann, 1996.
[9] K. Hsu, L. J. D’Luna, H. Yeh, W. A. Cook and G. W. Brown, “A Pipelined
ASIC for Color Matrixing and Convolution”, Proceedings of the 3rd Annual
IEEE ASIC Seminar and Exhibit, Sep. 1990.
[10] Kai Hwang, Computer Arithmetic: Principles, Architecture, and Design, John
Wiley & Sons, 1979.
[11] M. Morris Mano and Charles R. Kime, Logic and Computer Design
Fundamentals, Prentice Hall, 1997.
[12] Michael J. Flynn and Stuart F. Oberman, Advanced Computer Arithmetic Design,
Wiley-Interscience, 2001.
[13] O. L. MacSorley, “High-Speed Arithmetic in Binary Computers”, Proceedings of
the IRE, vol. 49, pp. 67-91, Jan. 1961.
157
[14] V. Hecht, K. Rönner and P. Pirsch, “An Advanced Programmable 2DConvolution Chip for Real Time Image Processing”, Proceedings of IEEE
International Symposium on Circuits and Systems, p 1897-1900, 1991.
[15] Vijay K. Madisetti and Douglas B. Williams, The Digital Signal Processing
Handbook, CRC Press and IEEE Press, 1998.
[16] Wayne Niblack, An Introduction to Digital Image Processing, Prentice/Hall
International, 1986.
[17] Xess Co., XSV Board User Manual, http://www.xess.com/manuals/xsv-manualv1_1.pdf
[18] Xilinx Co., Foundation 4.1i Software Manual,
http://toolbox.xilinx.com/docsan/xilinx4/pdf/manuals.pdf
158
Vita
Albert Wong was born on January 1st 1975 in Sibu, Sarawak, Malaysia. He
attended SMB Methodist secondary school in Sibu and graduated in 1993. He obtained
his Bachelor of Science in Electrical Engineering Degree in May of 1998 from the
University of Kentucky, Lexington, Kentucky. He enrolled in the University of
Kentucky’s Graduate school in the fall semester of 1999.
159
Download