Smart IP of QR Decomposition for ... Prototyping on FPGAs Sunila Saqib

Smart IP of QR Decomposition for Rapid
Prototyping on FPGAs
by
Sunila Saqib
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Science in Engineering
OF TECHNOLOGY
at the
APR 10 2014
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
LBRARIES
February 2014
@ Massachusetts Institute of Technology 2014. All rights reserved.
Author......... . .v.....................................
Departhent of Electrical Engineering and Computer Science
January 09, 2014
Certified by....
Jacob White
Professor, Dept. of Electrical Engineering and Computer Science
Thesis Supervisor
Accepted by......
I
/ (I
Leslie A. Kolodziejski
Chairman, Department Committee on Graduate Theses
2
Smart IP of QR Decomposition for Rapid Prototyping on
FPGAs
by
Sunila Saqib
Submitted to the Department of Electrical Engineering and Computer Science
on January 09, 2014, in partial fulfillment of the
requirements for the degree of
Master of Science in Engineering
Abstract
The Digital Signal Processing (DSP) systems used in mobile wireless communication,
such as MIMO detection, beam formation in smart antennas, and compressed sensing,
all rely on quickly solving linear systems of equations. These applications of DSP have
vastly different throughput, latency and area requirements, necessitating substantially
different hardware solutions. The QR decomposition (QRD) method is an efficient
way of solving linear equation systems using specialized hardware, and is known to be
numerically stable [17]. We present the design and FPGA implementation of smart IP
(intellectual property) for QRD based on Givens-Rotation (GR) and Modified-GramSchmidt (MGS) algorithms. Our configurable designs are flexible enough to meet a
wide variety of application requirements. We demonstrate that our area and timing
results are comparable, and in some cases superior, to state-of-art hardware-based
QRD implementations. Our QRD design based on a Log-domain GR Systolic array
achieved a throughput of 10.1M rows/sec for a complex valued 3x3 matrix on Virtex-6
FPGA, whereas our QRD design based on a Log-domain GR Linear array was found
to be an area optimized solution requiring the fewest FPGA slices. Overall the Logdomain GR Systolic array implementation was found to be the most resource efficient
design (IP for all of our proposed architectures have been prepared and are available at
http://saqib.scripts.mit.edu/qr-code.php). Our set of IP can be configured to satisfy
variety of application demands, and can be used to generate hardware designs with
nearly zero design and debugging time. Moreover the reported results can be used to
pick the optimal design choice based on a given set of design requirements. Since our
architectures are completely modular, their sub-units can be independently optimized
and tested without the need for re-testing the whole system.
Thesis Supervisor: Jacob White
Title: Professor, Dept. of Electrical Engineering and Computer Science
3
4
To my family, for all the love, support,
and the many sacrifices made.
5
6
Acknowledgments
First of all I wish to offer my gratitude to Almighty God for all the blessings bestowed
upon me.
I would like to express my sincerest gratitude to my adviser Dr. Jacob White,
for his patience, timely guidance and insightful critique. Without his guidance this
project could not have been a success.
My deepest appreciation and thanks is extended to Professor Leslie Kolodziejski
and Professor Terry Orlando for their continuous support, encouragement and valuable suggestions. I am also grateful to Janet Fischer, Lisa Bella and EECS graduate
office staff for assistance in various procedures at MIT. I am especially grateful to
Nirav Dave and Richard Uhler for their encouragement, guidance and professional
contributions.
It would not have been possible for me to complete this project without the
motivation from my family. I am grateful for their constant support and tireless
optimism.
It is impossible to make a note of all those whose inspiration have been vital in
the completion of this thesis, I am thankful to all of them.
7
8
Contents
1
2
3
1.1
Problem Statement and Motivation
. . . . . . . . . . . . . . .
19
1.2
Intro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1.3
Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Application of Linear Equation System in Wireless Networks
21
2.1
MIMO Detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.2
Beam Formation
. . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.3
Compressed Sensing
. . . . . . . . . . . . . . . . . . . . . . . . .
24
Selection of Algorithm for Solving Linear Equation System
27
. . . . . . . . . . . . . . . . . . . . . . . .
29
3.2.1
Cholesky Decomposition . . . . . . . . . . . . . . . . .
29
3.2.2
Doolittle LU Decomposition . . . . . . . . . . . . . . .
29
3.2.3
Crout LU Decomposition . . . . . . . . . . . . . . . . .
30
. . . . . . . . . . . . . . . . . . . . . . . .
30
Linear Equation System
3.2
LU Decomposition
3.4
27
. . . . . . . . . . . . . . . . . . . . .
3.1
3.3
4
19
Introduction
QR Decomposition
3.3.1
Givens-Rotation Based QRD
. . . . . . . . . . . . . .
30
3.3.2
Modified Gram Schmidt Based QRD . . . . . . . . . .
33
3.3.3
Householder Based QRD
. . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . .
36
Comparison and Selection
39
Implementation Challenges
9
5
39
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
39
Latency Insensitive . . . . . . . . . . .
. . . . . . . . . . . . . . .
40
Computation Complexity and Scalability
4.2
Modularity
4.3
41
Proposed Parameterized Architecture
5.1
5.2
5.3
5.4
6
. . . . . . . . . . . . . .
4.1
. . . . .
. . . . . . . . . . . . . . .
41
5.1.1
Data type . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
44
5.1.2
M ultiplier . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
46
5.1.3
Storage Space . . . . . . . . . . .
. . . . . . . . . . . . . . .
48
5.1.4
Control Circuitry . . . . . . . . .
. . . . . . . . . . . . . . .
48
5.1.5
Implementation type . . . . . . .
. . . . . . . . . . . . . . .
49
5.1.6
Reuse within design . . . . . . . .
. . . . . . . . . . . . . . .
53
GR Based QRD Linear Array . . . . . .
. . . . . . . . . . . . . . .
53
5.2.1
Reuse across design . . . . . . . .
. . . . . . . . . . . . . . .
54
5.2.2
Storage Space . . . . . . . . . . .
. . . . . . . . . . . . . . .
54
5.2.3
Control Circuitry . . . . . . . . .
. . . . . . . . . . . . . . .
56
. . . .
. . . . . . . . . . . . . . .
58
5.3.1
Reuse across algorithm . . . . . .
. . . . . . . . . . . . . . .
59
5.3.2
Vector operations . . . . . . . . .
. . . . . . . . . . . . . . .
59
5.3.3
Control circuitry
. . . . . . . . .
. . . . . . . . . . . . . . .
60
5.3.4
Storage Space . . . . . . . . . . .
. . . . . . . . . . . . . . .
61
MGS Based QRD Linear Array: . . . . .
. . . . . . . . . . . . . . .
61
5.4.1
Reuse across design and algorithm
. . . . . . . . . . . . . . .
61
5.4.2
Control circuitry
. . . . . . . . .
. . . . . . . . . . . . . . .
61
5.4.3
Storage Space . . . . . . . . . . .
. . . . . . . . .
61
GR Based QRD Systolic Array
MGS Based QRD Systolic Array
63
Results and Discussion
6.1
6.2
Experiment Conditions and Setup . . . .
. . . . . . . . .
63
6.1.1
Configuration Parameters
. . . .
. . . . . . . . .
63
6.1.2
Experiment Design . . . . . . . .
. . . . . . . . .
64
Performance on FPGA . . . . . . . . . .
. . . . . . . . .
65
10
6.2.1
GR based Arrays . . .
. . . . . . . . . .
65
6.2.1.2
MGS based Arrays . . . .
. . . . . . . . . .
69
6.2.2
GR versus MGS . . . . . . . . . . .
. . . . . . . . . .
72
6.2.3
Comparison between Computational Unit Imp lementation Techniques . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
76
6.2.3.1
Linear Arrays . . . . . . . . . . . .
. . . . . . . . . .
76
6.2.3.2
Systolic Arrays . . . . . . . . . . .
. . . . . . . . . .
79
Omega Notation Analysis of Design Parameters with
6.3.1
6.3.2
6.3.3
Throughput . . . . . . . . . . . . . . . . . .
6.3.1.1
Systolic Arrays . . . . . . . . . . .
6.3.1.2
Linear Arrays . . . . . . . . . . . .
Latency
. . . . . . . . . . . . . . . . . . . .
6.3.2.1
Systolic Arrays . . . . . . . . . . .
6.3.2.2
Linear Arrays . . . . . . . . . . . .
6.3.2.3
Internal Unit . . . . . . . . . . . .
6.3.2.4
Boundary Bnit . . . . . . . . . . .
6.3.2.5
Multipliers
. . . . . . . . . . . . .
A rea . . . . . . . . . . . . . . . . . . . . . .
6.3.3.1
Systolic Array . . . . . . . . . . . .
6.3.3.2
Linear Array
6.3.3.3
Internal Unit . . . . . . . . . . . .
6.3.3.4
Boundary Unit . . . . . . . . . . .
6.3.3.5
Multipliers
Target Oriented Optimization
6.4.1
82
Multiplier Implementation (Firm versus Soft)
Array Size . . . . . . . . . . . . . . . . . . . . . . .
6.4
65
.
6.2.1.1
6.2.4
6.3
Linear versus Systolic Arrays
. . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . .
MIMO . . . . . . . . . . . . . . . . . . . . .
.. . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
85
85
85
86
87
87
87
88
88
90
90
90
90
91
91
92
93
93
6.4.1.1
Required Specifications . . . . . . .
93
6.4.1.2
Optimized Configurations . . . . . . . . . . . . . . .
94
11
. . . . . . . . . . . . .
94
6.4.2.1
Required Specifications . . . . . . . . . . . . . . . . .
94
6.4.2.2
Optimized Configurations
. . . . . . . . . . . . . . .
95
Comparison with Previously Reported Results . . . . . . . . . . . . .
95
6.5.1
M IM O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
6.5.2
Beam Formation
. . . . . . . . . . . . . . . . . . . . . . . . .
96
. . . . . . . . . . . . . . . . . .
99
6.4.2
6.5
6.6
7
Beam Formation
............
Guidelines for Architecture Selection
101
Conclusions
A Tables
107
B Source Code
113
12
List of Figures
22
2-1
4 x 4 M IM O ...
................................
3-1
Givens-Rotation based QRD . . . . . . . . . . . . . . . . . . . . . . .
32
3-2
Modified Gram Schmidt based QRD
. . . . . . . . . . . . . . . . . .
34
3-3
Householder based QRD . . . . . . . . . . . . . . . . . . . . . . . . .
35
5-1
Systolic Array Architecture
. . . . . . . . . . . . . . . . . . . . . . .
42
5-2
(a)Boundary unit (b)Internal unit for Givens-Rotation method . . . .
42
5-3
QR(n) top module containing one row and one QR(n-1) top module .
43
5-4
Special case of QR top module, QR(1)
. . . . . . . . . . . . . . . . .
43
5-5
Typeclass QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5-6
Instance of QR typeclass for width equal to 1, terminating case for
. . . . . . . . . . . . . . . . . . . . . . . .
44
5-7
Instance of QR typeclass for width greater than 1 . . . . . . . . . . .
44
5-8
QR instantiation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5-9
Typeclass for Conjugate operation . . . . . . . . . . . . . . . . . . . .
46
5-10 Instance of typeclass Conjugate for FixedPoint data type . . . . . . .
46
5-11 Instance of typeclass Conjugate for Complex data type . . . . . . . .
46
. . . . . .
47
5-13 DSP and LUT based Multiplier . . . . . . . . . . . . . . . . . . . . .
48
5-14 Configuring Type of Multiplier in core units . . . . . . . . . . . . . .
48
recursive QR architecture
5-12 Data type and Matrix Size Configuration in Main Module
5-15 (a)boundary unit with FIFO (b)internal unit with FIFOs at input port
and Internal storage
. . . . . . . . . . . . . . . . . . . . . . . . . . .
49
. . . . . . . . . . . . . . . .
51
5-16 Configuring Log Domain External Unit.
13
. . . . . . . . . . . . . .
52
5-18 Configuring LA based External Unit. . . . . . . . . . . . . .
52
5-19 Configuring Newton Raphson Method based External Unit. .
53
5-20 QR systolic array for 11x11 matrix . . . . . . . . . . . . . .
54
5-17 Linear Approximation of 1/sqrt(x)
5-21 QR linear array for 11x11 matrix
. . . . . . . . . . . . . . .
. . . .
55
5-22 (a) Indexes of values of r generated, while processing 2 coi nsecutive
. . .
55
5-23 Algorithm to generate R state machine. . . . . . . . . . . . .
57
5-24 Algorithm to generate R state machine. . . . . . . . . . . . .
58
5-25 MGS Systolic array . . . . . . . . . . . . . . . . . . . . . . .
59
5-26 (a) Batch Adder (b) Batch Multiplier . . . . . . . . . . . . .
60
rows interleaved (b) QRD linear array for mxm matrix
6-1
DSP block usage in GR Linear and Systolic Arrays with all DSP based
M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-2
Slice LUT usage in GR Linear and Systolic Arrays with all DSP based
M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-3
67
Registers usage in GR Linear and Systolic Arrays with all DSP based
M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-4
66
68
Throughput of GR Linear and Systolic Arrays with all DSP based
M ultipliers . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .
68
6-5
Latency of GR Linear and Systolic Arrays with all DSP based Multipliers 69
6-6
DSP block usage in MGS Linear and Systolic Arrays with all DSP
based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-7
Slice LUT usage in MGS Linear and Systolic Arrays with all DSP based
M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-8
70
Slice Register usage in MGS Linear and Systolic Arrays with all DSP
based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-9
70
71
Throughput of MGS Linear and Systolic Arrays with all DSP based
M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
71
6-10 Latency of MGS Linear and Systolic Arrays with all DSP based Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
6-11 DSP block usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers
. . . . . . . .
73
6-12 Slice LUT usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers
. . . . . . . .
74
6-13 Slice Register usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers . . . . . . .
74
6-14 Throughput of GR and MGS, Linear and Systolic Arrays implemented
using LA blocks and all DSP based Multipliers . . . . . . . . . . . . .
75
6-15 Latency of GR and MGS, Linear and Systolic Arrays implemented
using LA blocks and all DSP based Multipliers . . . . . . . . . . . . .
75
6-16 DSP block usage in GR Linear Arrays with different Computational
block implementations
. . . . . . . . . . . . . . . . . . . . . . . . . .
76
6-17 Slice LUT usage in GR Linear Arrays with different Computational
block implementations
. . . . . . . . . . . . . . . . . . . . . . . . . .
77
6-18 Slice Register usage in GR Linear Arrays with different Computational
block implementations
. . . . . . . . . . . . . . . . . . . . . . . . . .
77
6-19 Throughput of GR Linear Arrays with different Computational block
im plem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6-20 Latency of GR Linear Arrays with different Computational block implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
6-21 DSP block usage in GR Systolic Arrays with different Computational
block implementations
. . . . . . . . . . . . . . . . . . . . . . . . . .
79
6-22 Slice LUT usage in GR Systolic Arrays with different Computational
block implementations
. . . . . . . . . . . . . . . . . . . . . . . . . .
80
6-23 Slice Register usage in GR Systolic Arrays with different Computational block implementations . . . . . . . . . . . . . . . . . . . . . . .
80
6-24 Throughput of GR Systolic Arrays with different Computational block
implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
81
6-25 Latency of GR Systolic Arrays with different Computational block implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
6-26 DSP block usage in GR Linear Arrays with different Multiplier implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
6-27 Slice LUT usage in GR Linear Arrays with different Multiplier implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
6-28 Slice Register usage in GR Linear Arrays with different Multiplier implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6-29 Throughput of GR Linear Arrays with different Multiplier implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-30 Latency of GR Linear Arrays with different Multiplier implementations
16
84
85
List of Tables
6.1
Throughput observed for Systolic Arrays (LA based, with all DSP
86
based shared Multipliers) . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Throughput observed for Linear Arrays (LA based, with all DSP based
shared Multipliers)
6.3
P&R results for GR and MGS based Systolic Arrays for complex valued
input array of size 4x4 and word length (6.10)
6.4
87
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
94
P&R results for GR Linear Arrays for complex valued input array,
word length (6.10)
95
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5
Comparison of our study results with previously reported for MIMO
6.6
Comparison of our study results with previously reported for Beam
97
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
6.7
Selection of Appropriate Architecture . . . . . . . . . . . . . . . . . .
99
A.1
P&R results for GR and MGS based Linear and Systolic Arrays with
form ation
LA, Log and NR based QRD config. for word length (16.16) . . . . .
A.2
108
P&R results for GR and MGS based Linear and Systolic Arrays with
LA, Log and NR based QRD config. for word length (16.16)
17
. . . . .111
18
Chapter 1
Introduction
1.1
Problem Statement and Motivation
Rapid prototyping of digital systems is a technique used to evaluate and investigate
the feasibility of a technology without incurring the exorbitant development costs
associated with full product development.
To further reduce the cost and design-
time of digital system design, prototypes must make use of commodity and shared IP
(intellectual property) wherever possible.
FPGAs (Field Programmable Gate Arrays) enable rapid prototyping by providing a low cost and re-configurable substrate capable of supporting high performance
designs. However, the scale and maximum advantage of rapid prototyping is limited
by availability of the required IPs.
In many digital system domains, a wide variety of applications share the same
core components. For example, mobile wireless communication systems like MIMO
detection, beam-formation in smart antennas, and compressed sensing, involve solving
linear systems of equations which can be implemented efficiently using QR matrix
decomposition (QRD). Such core components are therefore a natural candidates for
shared IP, and can be used for rapid prototyping.
Unfortunately, every application has its own unique set of design limitations for
the core components, and therefore require a different hardware implementation. For
example, a QRD in MIMO may only need to support decomposition of 4x4 matrices.
19
On the other hand, a QRD for beam-formation may need to support the decomposition of 40x40 matrices, resulting in different resource sharing strategy.
Modular, parameterized, "smart IP" components could enable rapid prototyping
by allowing a designer to configure and adapt existing components to their application.
However, it remains to be proven whether smart IP components can be
developed with the flexibility to adapt to a full range of target applications while still
meeting each application's design-requirements.
1.2
Intro.
In this project we present the design and the FPGA implementation of a smart IP
for QRD. The design is based on two algorithms, namely Givens-Rotation (GR) and
Modified-Gram-Schmidt (MGS). Our IPs are flexible enough to meet wide variety of
application requirements, as they can be configured for arbitrary matrix sizes and an
extensible variety of number representations with wide selection of plugable sub-units.
We have demonstrated our area and timing results to be comparable with, and
in some cases superior to, state-of-art hardware-based QRD implementations. As an
added advantage, the implementation can be targeted to an ASIC with a minimum
effort.
1.3
Thesis Structure
The manuscript is organized as follows: Linear equation systems in wireless networks
is discussed in Chapter 2.
Selection of algorithm for solving the linear equation
systems is discussed in Chapter 3, and implementation challenges are listed in Chapter
4. The parameterized architecture design and the implementation of our model is
elaborated in Chapter 5. Our results and their comparison with previously published
results are discussed in Chapter 6.
The conclusion of this study is presented in
Chapter 7.
20
Chapter 2
Application of Linear Equation
System in Wireless Networks
Linear equation system solver and matrix inversion are critically important components for performance enhancements in signal processing systems and communication
networks.
2.1
MIMO Detection
MIMO technology increases the data throughput and improves network performance
as well as link reliability by using multiple antennas at both the transmitter and
receiver end.
The MIMO detection schemes zero forcing (ZF) or minimum mean square error (MMSE) are linear algorithms, which estimate the transmitted data symbols
by multiplying OFDM (orthogonal frequency-division multiplexing) sub-channel received signals from different antennas by inverse of channel characteristic matrix.
These linear algorithms are proved to be efficient with low computational complexity
[111.
In an n receiving and m transmitting antennae MIMO system, with t as number
of symbol periods over which data is being transmitted under a flat fading channel
condition the channel model is represented by:
21
(2.1)
Y=AX+Z
where, Y is n x t matrix of received complex symbols, A is n x m complex-values
channel characteristic matrix, X is m x t matrix of transmitted complex symbols
and Z is n x t additive complex-valued zero-mean white Gaussian noise vector. The
vectors of channel matrices are dynamically coupled i.e. there exists a correlation
between individual sub-channels.
Optimal number of antennas (the dimension of channel matrix) based on channel
capacity maximization while considering the limitations due to correlation between
individual sub-channels, is found to be up to 8 on each side (preferred 4 antennas on
each side) [1] [2] [6] [10] [20]. A MIMO system can be configured to have n transmitters
and m receivers, where n
$
m, resulting in rectangular channel characteristic matrix.
Figure 2-1 shows a MIMO system with 4 transmitters and 4 receivers.
yO
x0
XlIy
x2
Y2
x3
y3
Figure 2-1: 4 x 4 MIMO
The MIMO received signal can be decoded in the following three ways: by computing inverse of matrix A and matrix multiplication of A 1 and Y, by decomposing
the matrix A into lower and upper triangular matrices (LU) followed by forward and
backward substitution, or by decomposing matrix A into orthogonal and upper triangular matrices (QR) followed by matrix multiplication with Hermitian transpose
of
Q (QH)
and backward substitution.
22
2.2
Beam Formation
Wireless signal beam formation reduces interference between wireless networks by
transmitting signals in narrow beam directed only toward the desired destination.
The narrow beam is achieved by scaling the transmitted signal such that the signal
transmitted over multiple antennas constructively interfere (amplify) at the desired
destination and destructively interfere to diminish it in all other directions. For a
given destination the scaling or weight matrix is calculated as w = A-'v, where A is
interference and channel noise co-variance matrix, whereas v is the direction vector.
The number of antennas in the beam-formation system dictates the potential width
of the beam; larger the antenna array narrower the beam and larger the size of weight
matrix.
The interference and channel noise co-variance matrix is built over multiple observations of the channel. The observations recorded for a period in which only the
noise interference is present can be constructed into the matrix:
x(ni)
2
X =
(2.2)
x(np)
This observation matrix X can be used to setup a co-variance matrix for random
variable; noise and interference:
A = scalar x XHX
(2.3)
Dimension of observation matrix is m x n where m is the number of observations
made and n is number of antennas. The number of observations can be increased
or decreased depending on the number of antennas to obtain a square observation
matrix.
The computational complexity of the nontrivial task of computing the inverse of
23
A can be reduced significantly by decomposing X into simple matrices, such as
R where
Q
Q
and
is orthogonal matrix and R is upper triangular matrix.
A = XH*X
= (Q* R)H * (Q* R)= RH * QHQ* R
(2.4)
= RHR
Thus, A-
1
1
can be computed using the expression A-
R-1H * R- 1 , where both
RH and R are triangular matrices. Inverse of a triangular matrix is also a triangular
matrix and can be computed by forward/backward substitution.
thus this simplification reduces the task of computing the inverse of A to computing triangular matrix R and RH, followed by forward and backward substitution and
matrix product (computing the value of
2.3
Q matrix
is not required).
Compressed Sensing
In compressed sensing network, a sparse data signal is sampled with sampling rate
higher than Shannon sampling rate reducing the bandwidth requirement.
In the compressed sensing network, an n dimensional input signal X is compressed
to m measurements C by taking m linear projections, i.e. C = AX where A is matrix
of size m x n, and m < n.
In this case the system is undetermined, with lesser
constraints than the degrees of freedom resulting in infinite solutions that can satisfy
the set of equations.
Since the signal is known to be sparse, the sparsest signal representation satisfying
the system of equations can conditionally be shown to be unique; and solution found
using minI xI 1 subject to c = Ax can be shown to produce the sparsest solution [5].
Orthogonal matching pursuit (OMP) is a fast and relatively simple algorithm
for recovering compressed signal, and is a greedy approach for finding the sparsest
solution. It iteratively improves its estimate of the signal by choosing the column
of a matrix that has the highest co-relation with the residual (highest value of dot
24
product between column and residual vector) [27].
The process of decoding the sparse signal using OMP, explained in [27], involves
the following three steps, summarized here for readers convenience:
1. Matching (vector product) in which it finds co-relation between all columns
vector of the projection matrix q and the received signal. The maximum value
of this vector product gives the vector with highest co-relation:
it = max < rt-1, 4 > 1
(2.5)
2. Projection (solving linear equation) in which it estimates the signal by solving
linear equation. It takes the measurements y and project them on the range of
active subset of sensing matrix F:
X
= (@T@)-
4
Ty
(2.6)
3. Residual computation for next iteration, preparing the residual for the next
iteration:
rt = y - (Dxt
(2.7)
These three steps are repeated to improve the estimate of signal. In the projection
step of each iteration, a linear equations system is solved by computing inverse of a
tall rectangular matrix. Therefore, performance and quality of output of OMP highly
depends on the latency of linear equation solver, and can benefit from low latency
implementation.
25
26
Chapter 3
Selection of Algorithm for Solving
Linear Equation System
3.1
Linear Equation System
Linear arithmetic equation system consists of set of m equations relating n unknowns
expressed in the form:
a11 x 1 + a 1 2 X2
+
+± al-nXn
C1
a 2 1X 1 + a 2 2 X2 ±
+± a2nXn
C2
an1X1 ±
an2X2
±
'
(3.1)
*+
± anmxn ~ cm
where n x m aij (coefficients of x), and m right hand side constants cj are known,
while n xi are unknown. The mathematical expression in the form of matrix is:
AX=C
a11
a 12
-.
a
a 22
...
a 2 n-
am2
-
amn_
21
_am,
n
27
X1
Cl
X
C2
Xm
-Cm
(3.2)
X
If the solution exists then matrix inversion or matrix decomposition and back
substitution can be used to solve these equations.
There are a number of methodologies to come up with the numerical solution from
the given set of equations which can be categorized as direct, iterative and relaxation
methods. Direct methods perform decomposition in fixed throughput/latency for all
inputs; whereas in iteration and relaxation methods, convergence rate and throughput
depends on nature of data and intelligent selection of initial conditions, reordering
of columns and rows and other user dependent factors.
Hence, the iterative and
relaxation methods are not discussed here.
The direct methods include:
1. Crammers rule: This method uses Laplace expansion to compute the elements
of inverse matrix A- 1 . This is computationally expensive and slowest method
for solving linear equations.
2. Elimination methods: These methods including Gauss elimination, Jordan elimination, and LU decomposition transform a given matrix A into either A
other simple matrix form suitable for back/forward substitution.
or
Both the
Gaussian elimination and the Gauss-Jordan method involve the right hand side
matrix C in the solution. Therefore in applications where the left hand side remains constant between consecutive problem setup, no computation effort will
be reused. In such cases, LU is more suitable.
3. Orthogonalization methods: These methods including Givens rotation and Gram
Schmidt and Householder orthogonalization based QR decomposition transform
an input matrix into upper triangular matrix using orthogonalization to eliminate elements in lower triangle of the input matrix. These methods do no involve
the matrix C and therefore consecutive problem setup with same matrix A can
reuse the computational effort
The elimination method in which computational effort can be reused (LU decomposition) and orthogonalization methods (QR decomposition) are discussed in the
following sections.
28
3.2
LU Decomposition
LU decomposition processes decomposes a square matrix A into a lower and an upper
triangular matrices. After this decomposition, forward and backward substitution is
used to solve AX = C. The decomposition process is independent of the value of right
hand side vector C, so for linear equation systems with same left hand side can reuse
the decomposition results. LU decompositions take approximately
3
floating-point
operations for decomposing n x n matrix. The decomposition of the matrix A into
a lower and upper triangular matrix is not unique and it does not work for singular
matrix/under-determined equations. Row or Column pivoting/reordering is required
in this method to avoid division by zero.
Some of the notable decomposition algorithms to achieve LU decomposition based
on Gaussian elimination, are the Cholesky, Doolittle and the Crout decomposition.
3.2.1
Cholesky Decomposition
Cholesky decomposition is an optimized LU decomposition for the Hermitian, positive definite matrix and every real valued symmetric positive definite matrix. It
decomposes matrix A into the product of a lower triangular matrix and its conjugate
transpose A = LL*.
3.2.2
Doolittle LU Decomposition
Doolittle LU decomposition performs decomposition column by column and decomposes a given matrix into unit lower triangular matrix and upper triangular matrix.
The pivoting is performed in such a way that for each k a pivot row is determined and
interchanged with the row k, the rest of the algorithm works similar to the Cholesky
decomposition.
29
3.2.3
Crout LU Decomposition
In Crout method matrix A is decomposed into lower triangular matrix L and upper
unit triangular matrix U. The elements lij of L and ujj of U are computed by solving
equation system LU = A.
3.3
QR Decomposition
Solution of singular matrices expressing either over or under determined set of equations can be computed using QR orthogonalization methods, wherein a m x n matrix
A is decomposed into n x n unitary matrix
Q and m x n upper triangular matrix R.
QR decomposition methods take approximately 2mnr2 floating-point operations
for decomposing m x n matrix.
The three widely used methods for computing QR orthogonalization are:
1. Givens Rotation method
2. Modified Gram Schmidt method
3. Householder orthogonalization method
In each of these methods the matrix A is decomposed into the product of an
orthonormal matrix
Q and an upper triangular matrix R, such that A = Q - R. If
A is invertible then the decomposition resulting in positive diagonal elements in R is
unique.
3.3.1
Givens-Rotation Based QRD
Givens-Rotation rotates a vector in (x, y) plane by an angle 0 such that it becomes
orthogonal to axis y, diminishing its magnitude in that dimension to zero:
cos 0
- sin 0
x
r
(3.3)
sin0
cos0
30
Y
0
where values of the rotation matrix can be determined from the pivot elements
r=
(x) 2 + (y) 2 , cos 0 = Z, sin 0 = -.
Using Given rotation repeatedly, a matrix A can be decomposed into orthogonal
matrix
Q and
an upper triangular matrix R. At each iteration A is rotated clockwise
by angle 0 in (i, j) plane by multiplying it with a rotation matrix of the form:
G(i, j, 0) =
1
0
0
cos0i'
0
0
... 07
0
sin Oj
. .0
sinG5,j
cos Ojj
. .0
0
0
-
(3.4)
to generate transformed A, which is used as input for the next iteration.
Since this multiplication process changes only the ith and jth rows of matrix A,
full matrix multiplication is not required to compute these intermediate transformed
matrices. Therefore this rotation matrix can be reduced to 2x2 matrix of the form:
cos Ojj
G(i, j, 0)
sin Oj'i
and multiplied by only the ith and
jth
- sin Oij
cos
rows.
j()
(3.5)
Also, since after computing the
elements of rotation matrix, same operation is done for the whole row, therefore the
operation can be distributed over computational blocks and done in parallel.
After
2(n-1)
2
iterations of rotation, A will be transformed into upper triangular
matrix R. While performing same rotation operations on an identity matrix will
transform it to the orthogonal matrix
Q.
Matlab code for this algorithm is shown in Figure 3-1. Time complexity of GR
based QRD is 2mn 2
31
1
2
3
4
7 QR decomposition
[m,n] = size(A);
X = zeros(m,m);
R = zeros(m,m); % m-by-m zero matrix to store output
5
6
7
8
9
10
11
12
13
for i = 1:n
[R(1,1), c,s] = external(R(1,1),A(i,1));
[R(1,2:m), X(2,2:m)] = internal(c,s, R(1,2:m), A(i,2:m));
for j = 2:m
[R(j,j), c,s] = external(R(j,j),X(j,j));
[R(j,j+1:m), X(j+1,j+1:m)] = internal(c,s, R(j,j+1:m),
end
end
14
15
16
17
is
19
20
21
22
23
function [r, c, si = external(rin, xin)
if rin == 0 && xin ==O
r = 0; c = 0 ;s =
else
r
norm([rin, xin]);
C = rin / r;
s = xin / r;
end
end
24
25
26
27
28
function [rx] = internal( c, s, rin, xin)
r = (c * rin) + (conj(s)* xin);
x = (-s *rin) + (c * xin);
end
Figure 3-1: Givens-Rotation based QRD
32
X(j,j+1:m));
Modified Gram Schmidt Based QRD
3.3.2
Gram - Schmidt process is used to construct an orthonormal basis
Q
for a set of
linearly independent vectors, expressed as columns in matrix A. This process can be
used to create the orthogonal matrix
can be computed using
rij=
Q
Q; and
the values of upper triangular matrix R
and A employing the formula:
qf&3
i < j(elementsabovetheprinciplediagonal)
||a
i = j (elemetsorthepriciplediagoial)
0
i > j (elements belowt hepri~ncipledi agon al)
(3.6)
1
4 is the orthogonalized intermediate jth column vector of A during jth
where a'i~j
iteration.
For a matrix of size n x m it will take n iterations to complete decomposition of
matrix A. In each iteration of orthogonalization, a row of R and a column of
Q
is
computed such that the jth elements of the principle diagonal of R are computed by
normalizing jth column of A in jth iteration; and the values in Jth column of
Q
are
computed by scaling aij by 1-.
The non-diagonal elements in a row of R are computed by using the formula in jth
iteration for k = 0 ... n , k < jr,k = 0 and for k > j, rj,k = q - ak
); where ak
are the values in column k of matrix A during iteration j - 1. In first iteration, these
are the original values of kth column of A.
After each jth iteration A is updated such that for k = 0... n, k <
ai,
au~r~zai~k
a,k = 0 and for k > i ,k
After
j
= aUk
ai
j
the new
- q,jrj,k.
-qk gys
iterations, A will transform into zero matrix, and the computation of
orthogonal matrix
Q
will be complete. Since in each iteration a vector in
Q
is com-
puted and a vector in A is transformed to 0, they can be contained in singled memory
location.
MGS based QRD performs vector operations to compute column vectors of
Q and
rows of R. Therefore the complete matrix needs to be in memory before the process
can begin. This is unlike Givens-Rotation which can begin first iteration as soon as
33
1
[m,n]
= size(A);
2
Q = zeros(m,n);
3
R = zeros(n,n);
4
Q = A;
5
6
for i=t:n
R(i, i) = norm(Q(:,i))
for j=i+1:n
9
Q(:,j)
11
=
= Q(:,j)
*Q(:,j)
* R(i,
-Q(:,i)
j)
end
12
13
j)
R(i,
10
end
Figure 3-2: Modified Gram Schmidt based QRD
one row of data is available.
Figure 3-2 shows the matlab code for MGS based QRD algorithm. Time com3
2
plexity of MGS based QRD is 2mn + 2mn - m. This process takes 2n arithmetic
operations [20].
3.3.3
Householder Based QRD
Householder transformation uses unitary Hermitian matrix to reflect a given vector
a3 across a plane such that all its coordinates but one disappear.
The elementary Householder matrix used for reflection across the plane orthogonal
to the unit normal vector v can be expressed in matrix form as:
H
=
(37)
I - 2vvT
where I is identity matrix of the same dimensions as H, and vT is transpose of
unit normal vector v.
The reflector matrix that maps a given vector a3 to be a scalar multiple of another
given vector el (first column vector of identity matrix (1, 0, ... ,
by taking v
=
'
O)T)
can be constructed
with u = a, - signrjayjiei where sign = ±1, then the product of
resulting Hermitian matrix and aj will result in:
34
UU T
Ha = (I - signu
a = sign|af I llei
(3.8)
and product of Hermitian matrix and rest of the columns of A will transform A
to A'.
Repeating this process min(m,n) times, for a matrix A,,
(of size x x y) by com-
puting the Hermitian reflecting matrix using its first column, where x = m - k and y
= n - k in kth iteration, constructed by dropping first k rows and columns from the
given matrix Amn will reduce the matrix A to upper triangular matrix. The product
of Hermitian matrices used for this transformation in all iterations forms the unitary
Q
matrix.
Unlike Givens-Rotation based orthogonalization, this process requires the whole
column vector to compute reflection matrix. Reflection matrix is then used in matrix
multiplication. Therefore this method is not suitable for data distributed computation.
Figure 3-3 shows the matlab code for Householder based QRD algorithm. Time
complexity of Householder based QRD is
1
2
3
4
5
6
7
8
4mn2
3
[m,n] = size(A);
Q = eye(m);
for k=1:min(m-1,n)
ak = A(k:end,k);
vk = ak + sign(ak(1))*norm(ak)*[1;zeros(m-k,1)];
Hk = eye(m-k+1) - 2*vk*vk/(vk*vk);
Qk = [eye(k-1) zeros(k-1,m-k+1); zeros(m-k+1,k-1) Hk];
AQk*A;
9
10
end
11
R = A;
Figure 3-3: Householder based QRD
35
3.4
Comparison and Selection
As mentioned earlier, there exists many optimized variant of LU decomposition algorithm for families of matrices which exhibit special characteristics. But LU decomposition fails to find solution if input matrix is singular. Also, like Gaussian elimination
methods, LU requires normalization, row/column reordering to avoid division by zero
as well as for non-diagonal dominant matrices for preserving accuracy.
Although QR decomposition is computationally more complex than most of the
LU decomposition variants but computations done in QR decomposition are unconditionally stable and can be used to decompose singular matrix [8]. Since error
propagates at slower rate in orthogonalization process [171, QR decomposition is
more accurate without maintaining diagonal dominance if large enough word size is
used. Column reordering and normalization can be used to further improve solutions' precision, but is not required to avoid division by zero. Also since pivoting
and row dominance is not required to maintain acceptable precision in QR computation, systolic arrays can be used to distribute the data and operations among parallel
computational blocks without the need of context/data migration. This results in
reduced control logic complexity.
Both LU and QR decomposition reuse computational efforts if consecutive problem set-up has the same left hand side of Eq. 3.1 and Eq. 3.2;
In addition to solving linear equation system, QR decomposition can also be used
to determine the magnitude of determinant of a matrix:
A=QR
det(A) = det(Q).det(R)
(3.9)
whereIdet(Q)| = 1
soldet(A)| = 1.|det(R)|
where det(R) = product of the values on principle diagonal; since its a triangular
matrix.
QR decomposition can also be used to find inverse of co-variance matrix of multi36
ple random variables expressed in the form A = XH * X where the matrix X contains
measurements/snapshots of the outcome for these random variables. The computational intensive task of computing A
can be reduced in complexity by decomposing
X in QR components. After restructuring the formula becomes:
A=XH*X
=(Q*
R)H * (Q* R)
(3.10)
RH *QH * Q* R
=RH*R
Since the final output does not require the value of
Q,
computing the
Q
part of the
decomposition can be omitted while calculating inverse of co-variance matrix.
Because of the generality in applications, less restricting limitations, and suitability for FixedPoint parallel hardware implementation we selected QR decomposition
for our parameterized prototypes.
Classically used algorithms for QRD as discussed earlier are:
(a) Householder
method [9], suitable for software implementation with centralized storage but not for
hardware [26]; (b) Givens Rotation (GR) method [28], favorable for distributed parallel implementations; and (c) Modified Gram Schmidt (MGS) [18], mainly suitable
for only smaller matrices.
We chose GR and MGS for parameterization because of their favorable nature for
hardware implementation, in terms of cycle count, cycle frequency, and area-on-chip
for differing sizes of matrix.
37
38
Chapter 4
Implementation Challenges
Challenges for designing and implementing flexible modular QRD prototype are discussed in the following sections.
4.1
Computation Complexity and Scalability
The main operations involved in the QR decomposition using MGS are norm, inner
product, scaling of vector by inner product, and division of vector by scalar value;
while the main operations involved in the GR based QR decomposition using GR are
(VX)_1 computation and multiplication.
The major challenge is that the prototype should be able to generate scalable
design for larger matrix sizes, for example pipe-lined resources that can be reused to
reduce hardware size to fit it in a given FPGA board or available space on the board,
at the same time it should be able to generate highly parallel design when hardware
resources are not the limiting design factor, for instance a matrix of 4x4 is small in
size and therefore reusing units can reduce the throughput unnecessarily.
4.2
Modularity
Fine-grained modularity is another challenge in implementation. The prototype design must be modular with plug and play configurable architecture in order to facili-
39
tate unit or modular improvement with minimum effect on rest of the architecture.
4.3
Latency Insensitive
To be truly modular the architecture needs to be insensitive to latency, so that when
lesser latency module is available, the rest of the architecture doesn't need to be
redesigned to synchronize the data and control signals.
40
Chapter 5
Proposed Parameterized
Architecture
We present here implementation of 4 architectures that can be scaled up to be used
for beam formation with higher dimensions and configured to use various unit implementation techniques without the need to debug and test each time there is change
in size of control circuitry and data path. These four architectures are: (a) Systolic
Givens rotation based; (b) Linear Given rotation based; (c) Systolic MGS based; and
(d) Linear MGS based.
5.1
GR Based QRD Systolic Array
Givens Rotation (GR) decomposes a matrix A into unitary matrix
Q
and R by ro-
tating it along one axis at a time and nullifying an element in a column vector of
A.
These rotations operation on each matrix element is independent and therefore
can be done in parallel. We can distribute the input matrix over a uniform array of
computation units and then combine the generated output.
Figure
5-1 shows the block diagram for 5x5 Systolic Array for Givens rotation
based QRD. Systolic Array consists of two specialized building blocks (Figure 5-2);
boundary unit and internal unit.
41
SInt
Figure 5-1: Systolic Array Architecture
enRin'Rin
XWin'xt
RKI'VM
Xin
sin
Rin
COS Xin
rout
Cos
Rin
sin
Xnut
(b)
(a)
Figure 5-2: (a)Boundary unit (b)Internal unit for Givens-Rotation method
The boundary unit rotates the input element and generates (a)rotation parameters (cos and sin of 0) and (b)diagonal elements. The internal unit rotates the input
element using the rotation parameters generated by the diagonal element in the same
row and generates non diagonal elements of the output matrix. Both these units
require 3 and 4 multiplication operations respectively. These operations can be performed using dedicated multipliers or a single pipe-lined multiplier with a trade off
between latency and area-on-chip.
We implemented QR as a combination of a row of size n containing one boundary
and n-i internal units, and a SubQR implementing QR of size (n - 1) as shown in
Figure 5-3. As a special case, QR of width equal to 1 is a row which contains only the
boundary unit as shown in Figure 5-4. This architecture implementation facilitates
automated connectivity between the rows, for re-configurable size of matrix.
This is achieved by implementing "typeclass" of QR and defining two instances;
one for QR of size greater than 1, and terminating case of QR of size 1, as shown in
the Figures 5-5, 5-6 and 5-7, respectively. A wrapper on the top level QRD module
42
Figure 5-3: QR(n) top module containing one row and one QR(n-1) top module
Figure 5-4: Special case of QR top module, QR(1)
43
can define a QRD array of specific size by using the code shown in Figure 5-8.
typeclass QRtopModule#(numeric type width);
module [m] mkQRtopModule # (...)
endtypeclas s
Figure 5-5: Typeclass QR
instance QRtopModule#(1);
module [m] mkQRtopModule ( ... );
//return QR module of width EQual to 1
QR#(1, tnum) qrUnit <- nikQReqONE(...);
return qrUnit;
endmodule
endinstance
Figure 5-6: Instance of QR typeclass for width equal to 1, terminating case for
recursive QR architecture
instance QRtopModule#(width)
module
[m]
mkQRtopModule (;
QR module of width Greater Than 1
QR#(width ,tnum) qrUnit <- mkQRgtONE ( ... );
return qrUnit;
endmodule
endinstance
//return
Figure 5-7: Instance of QR typeclass for width greater than 1
The design parameters that can be reconfigured to generate a unique systolic
array implementation are: (i)Width of input matrix, (ii)Data type of input element,
(iii)Implementation techniques of computation units, and (iv)Multiplier style.
5.1.1
Data type
Wireless communication systems usually operate on Complex Valued Matrix, but for
RVD MIMO signal detection approaches like RVD K-best [22], the Complex Valued
44
QR#(width , datatype)
qr <- mkQRtopModule (. .. ) ;
Figure 5-8: QR instantiation
Matrix is decomposed into 2n x 2n sized Real Valued Matrix. The amount of arithmetic operations stay the same either way in non diagonal units. However in diagonal
units, required computation is doubled by decomposing the n x n matrix into 2n x 2n
as the value of diagonal entries of r is always real.
These Complex and Real numbers can be represented in Fixed Point notation or
Floating Point notation. The Fixed Point (FP) implementation is ideal for wireless
communication devices because FP units consume comparatively less power and span
smaller area-on-chip [3].
The number of bits required to represent a FP number without losing information
depends on the size of matrix; as the size of the matrix increases the amount of
computation that each input element goes through to produce output also increases.
In turn this leads to increased amount of computational noise. In order to keep the
computational noise at least below -10 db the number of bits to communicate a single
chunk has to be increased. The process of evaluating the ideal bit length and the
length of fractional part for a 4x4 matrix is discussed in [28], a similar process can be
used to determine the bit length for larger matrices.
To reuse the implementation done for one size of input matrix, we parameterized
the choice of data type, word length and length of fractional part. The top level
module can be configured at the time of instantiating of QRD module to work on a
single specific type of data. And therefore can be modified by single edit point in the
architecture before Verilog code generation and synthesis.
We achieve this effect by keeping the type a configurable parameter and defining
Conjugate, a complex number specific operation, for the Real and Fixed Point numbers. Figures 5-9, 5-10 and 5-11 show typeclass to implement Conjugate operation,
where 'is' and 'fs' are size of integer and fractional part respectively, datat is the
abstract data type and FixedPoint and Complex are Bluespec inbuilt data types.
45
for Fixed Point version of the module, extra wires are timed during synthesis, and
therefore causes no extraneous wires in actual hardware or loss of performance.
typeclass
Conjugate #(type
provisos ( ...
data-t);
);
function data-t con (data-t x);
endtypeclass
Figure 5-9: Typeclass for Conjugate operation
instance Conjugate #(FixedPoint#(is , fs ));
function FixedPoint#(is , fs) con (FixedPoint#(is
return x;
endfunction
endinstance
,
fs)
x);
Figure 5-10: Instance of typeclass Conjugate for FixedPoint data type
instance Conjugate #(Complex#(data-t))
provisos (...
);
function Complex#(data.t) con (
Complex#(data-t) x );
let y = Complex {rel:x.rel , img:0-x.img};
return y;
endfunction
endinstance
Figure 5-11: Instance of typeclass Conjugate for Complex data type
Figure
5-12 demonstrates how our implementation can be configured for 4x4
matrix of type Fixed Point of word length 16 (6 bits for integer and 10 bits for
fractional part).
5.1.2
Multiplier
QR decomposition using either MGS or GR involves both complex and real multiplication. We implemented complex multiplier using 3 fixed-point multiplier and 2 sets
46
mkQR(4,
FixedPoint#(6,10))
qr..FP6-10 <- mkQRtopmodule( ... );
Figure 5-12: Data type and Matrix Size Configuration in Main Module
of adders, to compute complex product using Eq. 5.1
c=axb
c.rel = (a.rel - a.img) + (b.rel - b.img)
c.img
=
(b.rel - b.img)
(5.1)
+ (b.rel + b.img)
For fixed-point multiplier implementation on FPGAs, the in-built firm multiplier
provided as part of DSP chips on an FPGA board can be used. These multipliers are
optimized for area/time efficiency to bridge the performance gap between FPGAs and
ASICs caused by the programmable nature of FPGAs. But in case of scarcity of DSPs
on a specific board, soft multipliers implemented using LUTs inside CLBs can be used.
As each of the applications has different DSP demands, we have parameterized this
choice as well to suit any given case.
The choice of word length affects size and architecture of a fixed-point multiplier.
The accuracy requirement of a QRD system and the word length of an element of
the matrix decide the critical path of a multiplier. As the word length increases,
the critical path of multiplier can exceed the critical path of the overall system.
Thus it can become a limiting factor for the minimum period and clock frequency of
QRD. To manage the length of critical path in order to control the minimum period
and maximum frequency, we implemented a pipeline version in which the number of
pipeline stages is a configurable parameter. Number of stages can be increased to cut
down the data path or it can be decreased to reduce the number of cycles taken to
generate a product depending on specific requirements. It is achieved by passing the
number of stages as a numeric type which is then used by the mkMultiplier module
to generate stages using a for-loop.
Word length of the multiplier and number of pipeline stages in a multiplier can
be configured for both LUT based and DSP multipliers as shown in Figure 5-13. A
47
configured multiplier module can then be passed to external and internal units as
input for synthesis as shown in Figure 5-14.
Multiplier#(Complex#(FixedPoint #(16 ,16)))
<-mkMultiplier DSP ;
mkmul
Multiplier#(Complex#(FixedPoint # (16,16)))
<-mkMultiplier LUT;
mkmul
Figure 5-13: DSP and LUT based Multiplier
External#(Complex#(FixedPoint #(16,6)))
m;
mkExternal (mkRotation (mkmul));
m <-
Internal#(Complex#(Fixedpoint #(16 ,16))) m;
m <- mkInternal (mkComplexMultiplier (mkMultiplierFP16LUT));
Figure 5-14: Configuring Type of Multiplier in core units
5.1.3
Storage Space
Each unit has a register for interim ri, value and a register for input xi,j value. These
both registers are word-length bits large. Boundary unit also has register for storing
xo,
c and s values.
5.1.4
Control Circuitry
Data dependency of one unit on the output of previous unit is shown by blue and
magenta directed lines in Figure 5-1. To assist synchronized data availability for each
block we placed FIFOs at the input port for all the units, as shown in figure 5-15a and
5-15b. At the same time all the computations are implicitly guarded by availability
of data in FIFO.
Because of the internal data dependencies among each subsequent unit on the
previous row and the symmetry of computational complexity of each row FIFO of
48
in
Xin
Xin
Xout
delayed Xin
Xout
Figure 5-15: (a)boundary unit with FIFO (b)internal unit with FIFOs at input port
and Internal storage
length 2 is big enough to synchronize the flow without incurring extra delays because
of FIFO overflow.
Implementation type
5.1.5
The constant throughput for this design is max (throughput of boundary node, throughput of internal node). This throughput can vary depending on the type of implementation of the unit blocks. The main operations involved in QR decomposition using
GR are 1/i
computation and multiplication. Figure 5-2 shows the equations im-
plemented by each block.
We present implementation for three techniques, namely (i) Log domain computation, (ii) Linear Approximation and (iii) Newton Raphson Iteration, as discussed
below.
(i) Log Domain Computation: A multiplication process in linear domain is equivalent to addition in log domain:
a x b = log(a) + log(b)
49
(5.2)
and division operation becomes subtraction. Similarly power operation in linear
domain becomes multiplication operation:
(5.3)
ab = b x log(a)
If the power b is 2 the operation reduces to left shift and if it is 1 operation
reduces to right shift.
By using this simplification technique computational extensive operations such
as
1
can be transformed into shift and addition operations. The log and
exponential values are pre-computed and stored in look-up tables.
of the look-up tables decides what range of inputs can be handled.
The size
Storing
linear-to-log and log-to-linear domain conversions for full range of input values
To reduce the size of these tables while
can result in huge look-up tables.
maintaining the range of input, normalization of value before and after the
conversion can be performed.
The following equations shows how a look-up
table that supports only range for 'a' can be used to translate bigger range of
values:
a.2b = lg2(a.2b)
= log2 (a) + log 2 (2b)
(5.4)
= log2 (a) + b
2a.192(b)
-
2 a.2
(og2(b))
(5.5)
= 2a.b
So an operation in log-domain can be broken down into following steps:
(a) look-up log domain equivalent value of chosen amount of MSBs (equivalent
to right shift)
(b) add de-normalization constant
(c) perform the desired operation in log domain
50
(d) look-up the linear domain equivalent value for the result divided by a
normalization factor
(e) de-normalize it by multiplying with
2 nrmalizationfactar
(left shift)
The division in step 4 can be reduced to shift operations if the normalization
factor is chosen to be a multiple of 2.
For multiplication operation using this approach have large area overhead (storage for log and linear domain translation tables) with no significant improvement in total time of operation (clock cycles x clock period).
Therefore log
domain operation is only favorable for computing 1/sqrt(x).
Ideal size of these lookup tables for maintaining the desired accuracy depends on
the data type and word length [23]. Therefore we parameterized LUT size and
normalization factor for both log-to-linear and linear-to-log domain translations
in our design. QR can be configured to have Log based boundary unit as shown
in Figure 5-16
LogTable#(BitDis , FixedPoint #(16,16),LogLUTsize)
logtbl <-
mkLogTable(;
Exptable#(BitDis , FixedPoint#(16,16),
exptbl <-
ExpLUTsize)
exptbl;
mkLogtable (;
External#(Complex#(FixedPoint # (16,6)))
mkext <-
logtbl;
mkext
mkExternal (mkLogrotation (mkmul, logtbl , exptbl
));
Figure 5-16: Configuring Log Domain External Unit.
(ii) Linear Approximation: The value of 1/sqrt(x) can also be computed by linearly
approximating the value of function along its tangent, using slope and offset
values (Figure 5-17):.
f(x) f (a) + f'(a)(x - a)
51
(5.6)
Figure 5-17: Linear Approximation of 1/sqrt(x)
The values of slope(f'(a)) and offset(f(a)) are pre-computed and stored in
lookup tables.
To increase the accuracy of this operation larger granularity
of the entries along the lower bound of input values need to be stored. This can
be achieved by using upper fractional bits along with the integer bits to index
the look-up tables. Size of these tables can be increased to get higher range of
values. Therefore we have parameterized both size of look-up table, and number
of fractional bits used in index. These two parameters can be tuned to fit the
accuracy requirements. QR can be configured to have LA based boundary unit
as shown in Figure 5-18
LAtable#(BitDis , FixedPoint #(16,16),
tbl <-
LALUTsize)
tbl;
mkLAtable(;
External#(Complex#(FixedPoint #(16 ,6)))
mkext <-
mkExternal(mkLArotation(mkmul,
mkext;
latbl));
Figure 5-18: Configuring LA based External Unit.
(iii) Newton Raphson Iteration: The value of 1/sqrt can also be estimated using
Newton Raphson Iteration method [26].
It performs iterative shift and add
operations to compute 1/sqrt(x). It reduces complexity but increases latency
because of iterative nature. Because the required iteration count depends on
the data length and specific application, we have parameterized it. Figure 5-19
shows QR configured with NR based boundary unit with iteration count = 32.
52
QR#(tnum) qr <- mkQR(mkExternal#(NR#(32,
mkMultiplier)) ,
mkInternal#(mkMultiplier ));
Figure 5-19: Configuring Newton Raphson Method based External Unit.
5.1.6
Reuse within design
Because of parameterized and re-configurable nature of the proposed architecture,
same implementation can generate hardware specialized for varying size of matrix
with different area requirements. The results for area and throughput vs varying sizes
of matrix, acquired after Place and Routing the Verilog implementation generated for
our Bluespec code using BSC are presented in the chapter 6.
5.2
GR Based QRD Linear Array
The hardware for systolic array does not scale well for larger sizes of matrices, as
shown in Figure
5-20, but it can be folded in a linear array of units as shown in
Figure 5-21. This folding technique reduces the hardware requirements exponentially
but decreases the throughput by a factor of n, where n is the width of the matrix. For
the systems, where matrix size is large, area becomes bigger concern than throughput.
We present parameterized folded hardware implementation, wherein we modified
Walke's folding technique [12] to suit automated generation of control signals and
output-reordering sequence. This folding technique, like Walke's folding requires the
size of input array to be an odd number for 100% hardware efficiency. In the case
where the size of matrix is an even number, an extra empty column is appended at
the end of the input matrix.
Our modified version has a different order of flow of inputs to the units; in the
modified version, the input X, C and S are chosen from its own output and output
of one of its two neighbors, whereas in the original sequence input was chosen from
outputs of its two neighbors. The modified sequence can be seen in Figure 5-22 a
for input array of width 5.
The automatic generation of this sequence eliminates
53
the probability of human error. Figure 5-22 b shows our architecture design for the
implementation of parameterized QR decomposition hardware.
S
Int
En
n
n
Inntlt
B Int
o
nt
In
n
l
It nt
Int
BB It
fB~~li
+]t
n +]
Int nt
In
t
Int
t
nt
Int
Int
In
Int
OPI
B Int,
n
Figure 5-20: QR systolic array for 11x11 matrix
5.2.1
Reuse across design
This architecture has the same building blocks as GR based systolic array architecture.
Therefore, it can directly use the design units: (i) Boundary unit, (ii) Internal unit,
(iii) Full Row, (iv) Multiplier, (v) Implementation types; and connect them with
temporary storage and control logic in a new top level module. Internal and boundary
units require slight modification because in linear array these units no longer need
internal temporary registers for intermediate results.
5.2.2
Storage Space
Each unit in the row handles "n" entries of final R matrix. Therefore, each one
to
maintains a vector of n values of type DATA TYPE. This can result in huge size
54
46
I
Figure 5-21: QR linear array for 11x11 matrix
Figure 5-22: (a) Indexes of values of r generated, while processing 2 consecutive
rows interleaved (b) QRD linear array for mxm matrix
55
be implemented in LUT slices and Register files. Therefore we moved this storage in
BRAM. Size of interim R values is n x m x word length bits.
All the interim X, C and S outputs generated in a cycle are utilized in the very
next cycle. Therefore we only need a single set of output registers for each processing
unit. Size of vector of values of interim xout, c and s is (3 x m - 1) x word-length
bits.
Size of look-up table of control signals for X,C and S inputs is n x m x 3 bits.
While control signals for input and output R are stored in two log2 (m) bits counter
registers. Detail of the control signals is discussed in next section.
5.2.3
Control Circuitry
Control circuitry is required to pick the right set of inputs and outputs for each
sub-unit in each of the iterations.
Both input and output R values consumed and generated in each iteration are one
row of the interim R memory block. A counter register is used to select the input row,
while a register storing the previous value of counter is used to select the destination
of the output generated in any given iteration. Figure
5-22a shows the indexes of
the entries of matrix R generated by each unit for a matrix size 5x5. These R values
can be re-routed to get the matrix R using the sequence generated by our proposed
algorithm, shown in Figure 5-23.
The X, C and S outputs generated in each cycle are stored directly in the output
registers array.
Inputs C and S for each internal block comes from output register either for its
own output or for its left neighboring unit. So the control circuitry required to pick
the right set of C, S input pair for next iteration consists of a 2-input-MUX for each
unit (1 bit select signal).
Input X for each internal block comes from 3 sources: from the input matrix row,
its own output register, or output register for its right neighboring unit. Whereas
input X for external unit comes either from the input matrix row or output register
for its right neighboring unit. So the control circuitry required to pick the right X
56
1 %Y input: n matrix size
13
Y% output: nxm matrix of coordinates (a,b)
m = ceil(n/2);
iA = 1; %Zindex for current input row
iB = m; %Xindex for previous input row
cA = 2; %%counter for current row
cB = n+2; Xcounter for previous row
a=0; ,%%x coordinate in the result matrix R
b=0; Yfy coordinate in the result matrix R
flag = False;
strIndl = zeros(n); %%starting indexes
for (all str) strIndl(str)=3;
for (str
m-1)strInd1(str)=2;
14
for steps = 1:1:n
2
3
4
5
6
7
8
9
10
11
12
for j=iA:-1:1
if(cA
15
16
18
19
20
21
22
23
24
25
26
end end end end
for j=iB:-1:1
if (cB-j >0 && cB-j<=n)
a=j;
b=cB-a;
if(a==b)
S(steps,1)= [a, b];
elseif (a+1==b)
S(steps,2)= [a, b];
else
for l=3:m;
if(strcmp(S(stepsl),
S(steps, )=[a,b]
break;
end end end end
cA = cA + 1; Zfincrement index Loop 1
if(mod(stop,2)==O)
cB = cB + 1; Xfincrement index loop 2
else
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
j >0 && cA-j<=n)
a=j;
b=cA-a;
if (ab)
S(steps,1)=[a,b];
elseif (a+1==b)
S (steps, 2)= [a, b];
else
S(stop,strInd1(a))=[a,b];
if(strInd1(a)+1<=m-a+1)
strInd1(a)=strInd1(a)+1;
17
end end
Figure 5-23: Algorithm to generate R state machine.
57
%
sequence for CS and X
input n
i<
3 %' output nxm matrix of indexes for x and cs
4 m = ceil(n/2);
a = 1; b = m-a;
5 max = maximum value in available word length
6 outX = zeros(n,m);
outCS = zeros(n,m);
7 for i=1:n
8
c=0
9
for j=1:m
10
if(j==i
t1(i>m && j>=m))
11
outX(i,j) = max;
12
else
13
if(j-1==c) outX(i,j) = 0; Y right neighbour's output
14
elseif(j-2==c) outX(i,j) = 1; Y its own output
is
end
16
c=c+1;
17
end
1
2
18
if
20
21
22
23
24
25
26
Q=1)
outCS(i,j) = 0; % dont care condition
elseif(i>n-j+1)
outCS(i,j) = 1; % its own output
else
outCS(i,j) = 0; Y left neighbour's output
end
end
19
end
Figure 5-24: Algorithm to generate R state machine.
input for next iteration consists of 2-input MUX for external unit and one 3-input
MUX for each internal unit (1 and 2 bit select signals).
Control signals for these MUXs are stored in an look-up table of size n x m x (1 +
2). Figure 5-24 shows the proposed algorithm for populating this look-up table for
varying sizes of n.
5.3
MGS Based QRD Systolic Array
We arranged the block diagram for MGS systolic array as shown in Figure 5-25 to
illustrate its similarity with Givens systolic architecture. It shows 5x5 systolic architecture for MGS. The Boundary unit DP in this specific architecture implements line
58
7 and 8 of the algorithm shown in 3-2 and involves norm computation,
compu-
tation and vector scaling operations. The internal units, TPs, in a row implement
one iteration of internal loop on line 9 of the algorithm shown in 3-2 and involves
dot product, vector scaling and vector subtraction. The throughput of this design is
equal to the latency of each row, consequently latency of each computational block.
I
DPP
DP
TP
TP
qI TP
T+]
TP
DP
Figure 5-25: MGS Systolic array
5.3.1
Reuse across algorithm
We implemented this design using the same implementation style as that of GR
Systolic array, with different boundary and internal units. Each row now contains 1
new boundary and n-1 new internal units. Implementation of 1 computation unit
and Multiplier unit are reused from previous architecture.
5.3.2
Vector operations
Both boundary and internal units in MGS based QRD involve vector operations such
as dot product, vector scaling and vector difference for vector of size n where n is the
width of input array. Therefore with the increase in size of input array, not only does
the size of systolic array grow, but the size of each sub-unit also grows. This results
in poor scalability.
To improve the scalability of this design we present a pipe-lined architecture
of batch processing unit, which processes an input array of size n while inferring
59
processing-unit (PU) array of size equal to only fraction of n. Figures 5-26 a and b
show our architecture for batch product and batch accumulation.
A shift register of size p is used to feed the next p values of input vector into a
batch processing units in each iteration. A Batch-Product unit takes set of p values
and generates p products in each iteration. While Batch-Accumulator takes in p
inputs and perform p-to-1 compression in each iteration, and only generates output
in "p iteration. The p-to-1 compression tree implementation can affect the critical
path by performing multiple additions in one cycle. This module can be tuned and
tested with different configurations to find an ideal p value.
If it is configured for 4x4 matrix with PU array size = 4 the extra wires will
be trimmed during bluespec compilation process resulting in hardware equivalent to
completely parallel design of vector processing unit [13]. Whereas almost the same
amount of processing units can be used to implement 8x8 QRD if PU array size is
set to 1.
Size of the processing unit array (PU array) is configurable for both DP and TP,
and should be selected such that p is a factor n.
Input Shift registers
Input Shift registers
Output register
Output registers
(b)
(a)
Figure 5-26: (a) Batch Adder (b) Batch Multiplier
5.3.3
Control circuitry
Data dependency of sub-units on the output of previous sub-unit in MGS based
systolic array is similar to data dependency in GR based systolic array. Therefore,
like done for GR based systolic array, we used FIFO of size 2 to synchronize the
inputs for MGS based systolic array sub-units.
60
5.3.4
Storage Space
Each sub-unit has one register to store interim output value rij. There are total
n(n-1)/2 interim output registers, equal to the non-zero entries of output matrix R.
Each unit has its own set of temporary registers to hold a column of input matrix as
this algorithm requires the complete matrix to be in memory before it can begin an
iteration. There are n(n-1)/2 sub-units.
5.4
MGS Based QRD Linear Array:
The similarity between systolic GR and systolic MGS architecture points towards
the possibility of linear MGS, but due to increasingly huge temporary storage space
required for linear array, interleaving of inputs is not feasible for MGS. Nevertheless,
it reduces area by a factor of n, at the cost of decrease in throughput by the factor
of n.
5.4.1
Reuse across design and algorithm
Like linear GR, this architecture re-uses modules/units from systolic array implementation. The only new module implemented for this architecture is the top module
enclosing the unit-row, i.e. the unit-row implemented in systolic array is enclosed in
new top level module which has specialized control circuitry.
5.4.2
Control circuitry
Control circuitry for this algorithm is 2 entry state machine, which can be implemented by single mux and a counter. We used FIFO to guard and synchronize the
inputs for each unit.
5.4.3
Storage Space
Each sub-unit has one register to store interim output value rij. There are only n
interim output registers, equal to the number of sub-units, as the output is pushed
61
out as soon as it is prepared at the end of each iteration. Like systolic array sub-units,
each of the n sub-unit in linear has its own set of temporary registers to hold a column
of input matrix.
62
Chapter 6
Results and Discussion
Experiment Conditions and Setup
6.1
6.1.1
Configuration Parameters
Following are the configurable parameters for our architecture:
1. Algorithm - Givens based QR, MGS based QR
2. Array structure - Systolic, Linear
3. Computational unit's implementation - Linear approximation, Log domain computation, Newton Raphson
4. Multiplier implementation style - Firm (DSP based), Soft(LUT based)
5. Pipeline stages for multiplier
6. Word length for input array elements
7. Multiplier sharing - dedicated, shared pipe-lined multiplier
8. Size of computation vectors in MGS
63
6.1.2
Experiment Design
We tested implementation of four parameterized architectures including:
(1) Systolic Givens rotations based (GR sys)
(2) Linear Givens based (GR lin)
(3) Systolic MGS based (MGS sys)
(4) Linear MGS based (MGS lin)
We used following three implementation techniques for the above mentioned architectures:
(a) Linear approximation (LA)
(b) Log domain computation (Log)
(c) Newton Raphson method (NR)
For each design we used 4 different configurations for the type of multiplier:
(i) All DSP multipliers
(ii) LUT based multipliers used in external units
(iii) LUT based multipliers used in internal units
(iv) All LUT based multipliers
All experiments are configured for complex valued input matrices, with word
length of both real and imaginary parts equal to 32 bits (where both the integer
and fraction parts are 16bits long).
Single shared 3 stage pipe-lined multiplier is
used in internal units for GR based linear and systolic arrays. Length of vector of
processing units for MGS based QR is set to 1 for all sizes.
64
6.2
Performance on FPGA
The experiment set-ups were evaluated by compiling configured QRD from Bluespec
to Verilog code using BSC and subsequently acquiring Place & Route results for
Xilinx-6 FPGA (XC6VLX240T).
FPGAs are composed of array of configurable logic blocks (CLBs) each containing
multiple slices. Each of these slices contain LUTs, flip-flop registers, carry chain and
combinatorial circuitry. Interconnect network, comprised of switch matrices, connects
slices to each other and CLBs with neighboring CLBs. In addition to the uniform
array of CLBs, FPGAs come equipped with specialized state-of-the-art IP blocks such
as Block RAM, digital signal processing blocks (DSPs), analog-to-digital converters,
high speed IOs etc. These blocks are aimed to bridge the performance gap between
custom ASIC and reusable FPGAs for general applications, as well as to reduce the
area overhead of interconnect network used for programming the gate-arrays in an
FPGA.
The FPGA Virtex-6 (XC6VLX240T) has total 37,680 slices. Each slice contains
four 6-input LUTs, each of which can be broken into two 5-input LUTs for maximum
device utilization. Each slice contains 8 Registers (flip-flops). Total count of 6-input
LUTs is 150,720, and that of Slice Registers (flip-flops) is 301,440. It also has 768
DSP48Els (containing 25 x 18, two's complement multiplier/accumulator), 3,770Kb
distributed RAM and 1,885 shift registers.
The Place & Route results of experiments are presented in Appendix-A (Tables
A. 1 and A.2), in respect of various configurations for comparing area/resource utilization as well as throughput and latency. These results are discussed in the following
sections in detail.
6.2.1
Linear versus Systolic Arrays
6.2.1.1
GR based Arrays
Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for GR Linear and Systolic arrays, for all three implementation techniques, are
65
DSP blocks
00
UrN.A based
--0-
Un-Log based
Uri-NR based
500- --- - SYS-LA based
+ Sys-Log based
Sys-NR based
400-
300
4(
100-
1
2
3
4
6
5
size of input array (n)
7
8
9
10
Figure 6-1: DSP block usage in GR Linear and Systolic Arrays with all DSP based
Multipliers
shown in Figures 6-1, 6-2 and 6-3, respectively. Figures 6-4 and 6-5 show throughput
and latency for each of these implementations. These results were computed for all
DSP based 3-stage pipe-lined shared multipliers.
From the data used to plot resource utilization graphs, it can be inferred that
Systolic arrays grow 9 to 11 times faster than linear arrays in terms of DSP utilization; 10-11 times faster in terms of LUT utilization; 6-11 times faster in Registers
utilization, for the three implementation techniques. Growth in resource utilization
is linear in linear arrays, and exponential in systolic arrays.
In terms of throughput, systolic array always out-performs linear array, as can be
seen in Figure 6-4. This is because of the increased level of parallelism achieved in
systolic arrays.
Throughput of systolic array is not directly dependent on the size of input array,
depicted by straight lines in Figure 6-4. Although with the increase in array size,
larger word length is required to preserve the precision, this increase in word length
affects both the cycle time (longer critical path) and count (more pipe-line stages in
multipliers).
Throughput of linear arrays diminishes linearly as the array size increases.
66
From above comparison between these implementations, it can also be inferred
that Log and NR based Systolic arrays are suitable for smaller array size n. In this
analysis the threshold value is 5 for the high DSP utilization applications. For larger
arrays, with size equal to or greater than 6, linear arrays Log and NR are better
suited.
x 10,
LUT
6-
-5-9-+5
+
-4--
Un-RLA be sod
Un-Log rxhsed
ased
Sys-LA b
Sys-Log b ased
Sys-NR
-aed
4.5
4
3.5
Iaa
3-
2.5
2
1,5
1
2
3
4
5
6
size of inptA arrmy (n)
7
8
9
10
Figure 6-2: Slice LUT usage in GR Linear and Systolic Arrays with all DSP based
Multipliers
67
Registers
x 10,
UrnLA be sod
Un-Log besed
Un-NR be ead
4 .5 -Sys-LA bised
-+Sys-Log b ased
+
4Sys-NR based
-
3.5 [
-4-
3
2.5
-
2
1.5
0.51
1
11
4
3
2
6
5
size of input array (n)
1
9
1
B
1
7
10
Figure 6-3: Registers usage in GR Linear and Systolic Arrays with all DSP based
Multipliers
Twougx
2-
Un-LA based
+--
Un-Log based
---14--
0
+
+
--
Un-NR based
Sys-LA based
Sys-Log based
Sys-NR based
64-
2 --
1
2
3
4
5
size of ipu
6
array (n)
7
8
9
10
Figure 6-4: Throughput of GR Linear and Systolic Arrays with all DSP based
Multipliers
68
Latency
70
60-
50
40
30-
20
-
10
Lin-LA based
+-
Lin-Log based
Lin-NR based
-+-
ys-LA based
Sys-Log based
based
4B
+
-$
0
1
2
3
4
5
6
size of irpu array (n)
7
-4-Sys-NR
a
9
10
Figure 6-5: Latency of GR Linear and Systolic Arrays with all DSP based
Multipliers
6.2.1.2
MGS based Arrays
Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for MGS Linear and Systolic arrays, for all LA based implementation techniques,
are shown in Figures 6-6, 6-7 and 6-8, respectively. Figures 6-9 and 6-10 show throughput and latency for each of these implementations. These results were computed for
all DSP based 3-stage pipe-lined shared multipliers. The size of the vector of multipliers and adders used to implement the batch product, accumulator and subtraction
units was set to 1 for all array sizes.
It can be seen from Figure 6-6 that for input array size n larger than 4, systolic
array implemented doesn't fit on the Virtex-6.
DSP resource utilization in linear
array for MGS grows linearly while in systolic array it grows exponentially as can be
seen in Figure 6-6 . Register and LUT utilization grows exponentially in both linear
and systolic designs, but in linear array the growth is slower than the systolic array as
can be seen in Figures 6-8 and 6-7. On the other hand there is a significant reduction
in throughput going from systolic to linear array, as can be seen in Figure 6-9.
The relationship between the area and throughput of linear to systolic array is
same as that of GR based implementation, as discussed in previous section.
69
DSP boksd
500
U-LA
+-4--+-
based
Sys-LA based
450400
350-
300
250-
200-
150-
1
10
9
8
7
8
5
4
3
2
size of iru array (n)
Figure 6-6: DSP block usage in MGS Linear and Systolic Arrays with all DSP based
Multipliers
6x-10
LUT
55
5
5
4.5
4
3.5
2.5 F
2
5
I .1
2
3
4
6
5
7
7
88
9
9
10
10
size of IrpA array (n)
Figure 6-7: Slice LUT usage in MGS Linear and Systolic Arrays with all DSP based
Multipliers
70
Registers
x-10
--
5.5
+--
UnLA based
Sys-LA based
5
4.5
4
3.5
3
2.5-
1.5-
11 -
6
5
size of irxA array (n)
4
3
2
10
9
8
7
Figure 6-8: Slice Register usage in MGS Linear and Systolic Arrays with all DSP
based Multipliers
TVr4pt
61
--+-
Un-LA based
Sys-LA based
2
I
1
2
3
4
6
5
size of Irp array (n)
7
7
9
a
9
1
10
Figure 6-9: Throughput of MGS Linear and Systolic Arrays with all DSP based
Multipliers
71
Latency
3.53
-
1.
2.52
15
0.5-
+- Lin-LA based
-+-Sys-LA
1
2
3
4
6
5
size of inpt array (n)
7
8
9
based
10
Figure 6-10: Latency of MGS Linear and Systolic Arrays with all DSP based
Multipliers
6.2.2
GR versus MGS
DSP block, LUT slices and registers usage for both GR and MGS based, Linear and
Systolic arrays, for LA based computational units are presented in Figures 6-11, 6-12
and 6-13; whereas throughput and latency for each of these four implementations are
shown in Figures 6-14 and 6-15, respectively. These results were computed for all
DSP based 3-stage pipe-lined dedicated multipliers. Word length was set to 32 bits
for all sizes of arrays. For MGS based implementation, size of the vector of multipliers
and adders used to implement the batch product, accumulator and subtraction units
was set to 1 for all sizes.
It is obvious from the Figures 6-11, 6-12 and 6-13 that MGS based implementation
takes more area on chip than all implementation of GR even when shared multipliers/adders are used for each product/sum in batch processing units. Compared to
MGS, latency of GR is better in both array designs.
As discussed in architecture design, for an array of size e where e is even number,
GR linear array is implemented for size e+1. For small arrays, size less than 5, the
area and time overhead because of this extra unit can be significant. In such cases,
MGS and GR systolic are more suitable than GR linear.
72
In terms of latency of decomposing a full matrix, MGS linear and systolic arrays
perform better than GR linear and systolic arrays for all sizes of matrix, respectively.
DSP
600-4-
bIlocks
GRAMLh
GR.-Sys
~.~MGS
500
-
MGS-Sys
400-
300-
200-
100-
01
2
3
4
5
size of inpu
6
7
8
9
10
array (n)
Figure 6-11: DSP block usage in GR and MGS, Linear and Systolic Arrays
implemented using LA blocks and all DSP based Multipliers
73
W
6
5
LUT
+
-,5-4
GR.-Un
GR.-Sys
MGS-Un
MGS-Sys
-
4.5
4
3.5
3
/
2.5
2
1.5
2
1
3
4
5
6
size of irpA array (n)
9
7
10
Figure 6-12: Slice LUT usage in GR and MGS, Linear and Systolic Arrays
implemented using LA blocks and all DSP based Multipliers
WO
x
6r
5.5
5
Regstrs
GR.-LUn
GR -Sys
-+--
MGS-Un
MGS-Sys
4.5 F
4
3.5 F
3
2.5
2
1.5
11
2
m
3
4
5
6
size of WnpW army (n)
7
8
9
10
Figure 6-13: Slice Register usage in GR and MGS, Linear and Systolic Arrays
implemented using LA blocks and all DSP based Multipliers
74
TN-OU~q
7f
M--+-GR.-Un
-~GR.-Sys
+
MGS-Un
-- 4- MGS-Sys
-4
6
5
4
3
2
4
I
6
5
size of irput array (n)
4
3
2
77
1U
8
8
9
8
10
Figure 6-14: Throughput of GR and MGS, Linear and Systolic Arrays implemented
using LA blocks and all DSP based Multipliers
Latency
18 -
16
14
12
10-
8
6
4-
4+-++
R-Un
GR.-Sys
MGS-Un
-MGS-Sys
1
2
3
4
6
5
size of input array (n)
7
7
8
8
9
9
10
10
Figure 6-15: Latency of GR and MGS, Linear and Systolic Arrays implemented
using LA blocks and all DSP based Multipliers
75
6.2.3
Comparison between Computational Unit Implementation Techniques
6.2.3.1
Linear Arrays
Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for GR based, Linear arrays, for all three implementation techniques (LA, Log
and NR) are presented in Figures 6-16, 6-17 and 6-18 respectively. Figures 6-19 and
6-20 show throughput and latency for each of these implementations, respectively.
These results were computed for all DSP based 3-stage pipe-lined dedicated multipliers. Word length was set to 32 bits for all sizes of arrays.
As it is evident from these figures, NR is the most efficient in terms of DSP slice
utilization; whereas Log is most efficient in terms of LUT and Registers utilization as
well as throughput.
Throughput of Systolic array implemented using Log based units is highest, closely
followed by Systolic array implemented using LA unit, as shown in Figure 6-19.
DSP bkocks
100_
90
NRbased
8070
60
+
40-
3020
10
1
2
3
4
y
5
size of inWu allay (n)
7
8
9
10
Figure 6-16: DSP block usage in GR Linear Arrays with different Computational
block implementations
76
LUT
210
LA based
Log based
---
+
a
NR based
3
2
4
6
5
size of input array (n)
8
7
9
10
Figure 6-17: Slice LUT usage in GR Linear Arrays with different Computational
block implementations
1.6
Regsters
o104
x14
SLA
-+-~
.
1.5
based
Log based
NR base d
1.4-
1.3-
1.2
1.1
1-
1
2
3
4
6
5
size of irput army (n)
7
8
9
10
Figure 6-18: Slice Register usage in GR Linear Arrays with different Computational
block implementations
77
3
-9-"
+
LA based
Log based
NR based
2.5
2
1.5
0.5
01
2
3
4
6
5
7
8
10
9
size of input array (n)
Figure 6-19: Throughput of GR Linear Arrays with different Computational block
implementations
Latency
---
4+
1
2
3
4
6
5
size of input array (n)
7
8
LA based
Log based
NR based
9
10
Figure 6-20: Latency of GR Linear Arrays with different Computational block
implementations
78
6.2.3.2
Systolic Arrays
Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for GR based, Systolic arrays, for all three implementation techniques (LA,
Log and NR) are presented in Figures 6-21, 6-22 and 6-23 respectively. Figures 6-24
and 6-25 show the throughput and latency for each of these implementations, respectively. These results were computed for all DSP based 3-stage pipe-lined dedicated
multipliers. Word length was set to 32 bits for all sizes of arrays.
It is obvious from these figures that the trends similar to those observed in the
linear array are seen in systolic arrays in terms of DSP, LUT and Register utilization
as well as throughput and latency for all three implementation techniques.
DSP blocks
600[
LA based
---- Log bosed
+ NR based
+
500-
400-
300-
200-
100
2
3
4
6
5
sLze of iqut array(n)
7
8
9
10
Figure 6-21: DSP block usage in GR Systolic Arrays with different Computational
block implementations
79
x10
LUT
6-
+* LAbad
based
+4-
Log
+
NR based
5.5
54.5
4 -/
3.532.5 2-
1.5
2
1
6
5
4
3
size of Input array (n)
7
8
9
10
Figure 6-22: Slice LUT usage in GR Systolic Arrays with different Computational
block implementations
5-
10
Registers
S
4.5
4
-- +-
LA based
Log based
4
NR baSed
4-
--
3.5
3--
2.52-
1.5 -
1
2
3
4
6
5
size of i'MuA a my (n)
7
a
9
10
Figure 6-23: Slice Register usage in GR Systolic Arrays with different
Computational block implementations
80
ThOUOW
11
-++
10-
LA based
Log based
NR based
987-
6543-
21
2
3
4
6
5
size of inpti array (n)
7
9
8
10
Figure 6-24: Throughput of GR Systolic Arrays with different Computational block
implementations
Latency
8r
-+-
LA based
i
Log based
NR based
+
1
2
3
4
5
stze
6(
of input arrsy (n)
7
8
9
10
Figure 6-25: Latency of GR Systolic Arrays with different Computational block
implementations
81
6.2.4
Multiplier Implementation (Firm versus Soft)
In FPGAs two different types of multipliers can be inferred:
1) Firm, in-built multipliers in DSP slices; and
2) Soft, constructed using CLB LUTs or other memory elements.
Both these multipliers come with their set of pros and cons: For example DSP
multipliers are optimized and are most efficient in terms of area/time utilization but
their position on board is fixed and therefore applications which require high number
of multipliers, the routing cost may over shadow the benefit achieved by using DSP
multipliers. On the other hand the soft multipliers can be plugged-in in any part of the
design by implementing it using LUTs in the CLBs. CLBs come geared with carry
propagate channel, making it easier to implement accumulators.
Since Multipliers
are partial product generators and accumulators, structure of CLB comes in use.
There are multiple different implementations available for Soft multipliers, and the
optimized design in terms of critical path and resource utilization depends on the
relationship between multiplicand/multiplier's word length and LUT input size.
Utilization trends for FPGA resources including DSP blocks, LUT Slices and
Registers, for GR based, Linear arrays, for LA based implementation (all implemented
for 32-bits word length complex valued input) are presented in Figures 6-26, 6-27 and
6-28, respectively; whereas Figures 6-29 and 6-30 show throughput and latency for
each of these implementations. For these tests, four different multiplier configurations
are used in the sub units: (1) All multipliers implemented using DSP blocks (dsp);
(2) LUT based multipliers used in external units (lutex); (3) LUT based multipliers
used in internal units (lutin); and (4) all LUT based multipliers (lut).
As can be seen from these figures, throughput is slightly better for all DSP based
multipliers configuration, for smaller sizes of input array. But with the increase in
array size the difference between all four configurations is negligible.
82
DSP blocks
100
Based - Irt:DSP Based
-+Based - IntLUT Based
+ Ext:LUT Based - t:DSP Based
-+--Ext:LUT Based - nt:LUT Based
-
Ext:DSP
Ext:DSP
I
3
4
5
size of inp
6
7
8
9
10
array (n)
Figure 6-26: DSP block usage in GR Linear Arrays with different Multiplier
implementations
3.2-
LUT
x10
-+-+-
3
+
-+-
Int:DSP
Based
Ext:DSP BWsed Ext:DSP EWased - Itr:LUT Based
Ex:LUT B
Based
- hnt:LUT Based
Ext:LUT B
ased Int:DSP
ased
2.82.6
2.4
2.2-
-+
2
+
1.8
1.6
1.4 L
2
3
4
6
5
size of inpiu array (n)
7
8
9
10
Figure 6-27: Slice LUT usage in GR Linear Arrays with different Multiplier
implementations
83
[--
egisters
x 10
Ext:DSP
-+-
+
1.5 - -- +-
Based - Irt:DSP Based
Ext:DSP Based - Irt:LUT Based
ExtLUT Based- N:DSP Based
- bnt:LUT Based
EXLUT
ased
1.4
1.3
1.2
1.1
L
1'
2
3
---------- L----------L-
4
7
6
5
8
9
10
size of iput array (n)
Figure 6-28: Slice Register usage in GR Linear Arrays with different Multiplier
implementations
TIVoUOW
2.6
Based - Ir:DSP Based
Ext:DSP Based - Irt:LUT Based
Ext:LUT Based - Int:DSP Based
+-Ext:DSP
+-+
2.4
Est:LUT Based - hnt:LUT Based
2.2
2
1.8
1.6
1.4
-
9
1
9
10
1.2
0.8
1
2
3
4
6
5
7
a
size of irpa array (n)
Figure 6-29: Throughput of GR Linear Arrays with different Multiplier
implementations
84
Latency
18
16
-- Ext:DSP
-4-EAtDSP
Based - Int:DSP Based
Based - lit LUT Based
E*dLUT Based - tnt: DSP Based
-4--B:LUT Based - IrCLUT Based
+
1
2
3
4
6
5
size of irWi array (n)
7
8
9
10
Figure 6-30: Latency of GR Linear Arrays with different Multiplier implementations
6.3
Omega Notation Analysis of Design Parameters with increase in Input Array Size
6.3.1
Throughput
Throughput of our implemented architectures is discussed in the following subsections.
6.3.1.1
Systolic Arrays
Throughput for QRD for 3x3 array of complex elements represented in fixed-point
(word length 32, 16 bits for fractional part) using GR based systolic array implementation is shown in Table 6.3. For these results shared 3-staged pipe-lined multiplier
were used.
Both GR and MGS based systolic arrays, can accept a new row in max( latency
of Internal unit, latency of boundary unit) cycles for all input array sizes m x n.
Throughput for GR based systolic array is 0(1) because the latencies of internal
and boundary units do not depend on the array size for a given word length.
Throughput of MGS based systolic array diminishes linearly with the increase in
85
Table 6.1: Throughput observed for Systolic Arrays (LA based, with all DSP based
shared Multipliers)
Algorithm
array
size
PU
array
size
minimum
clock period
(ns)
throughput
(cycles)
maximum
clock frequency
(MHz)
throughput
(MRows/s)
GR
GR
GR
MGS
MGS
MGS
3
5
7
2
3
4
NA
NA
NA
1
1
1
9.628
9.796
9.983
9.955
9.983
9.972
16
16
16
18
28
42
103.86
102.08
100.17
100.45
100.17
100.28
6.49
6.38
6.26
5.58
3.58
2.39
array size (O(n)), because the latency of MG based internal and boundary units is
incremented by a constant number of cycles. This increase in cycles is due to the time
it takes a PU array to consume the extra elements of batch input vector in series. In
case the size of PU array is set equal to size of array, then the throughput for MGS
systolic architecture will also be constant for growing size of input array.
6.3.1.2
Linear Arrays
Throughput for QRD for 3x3 array of complex elements represented in fixed-point
(word length 32, 16 bits for fractional part) using GR based linear array implementation is shown in Table 6.4. For these results shared 3-staged pipe-lined multiplier
were used.
Both GR and MGS based systolic arrays, can accept a new row in m x max(
latency of Internal unit, latency of boundary unit) cycles for all input array sizes m
X n.
Throughput for GR based systolic array decreases by O(n 2 ) or 0(mn) with increase in array size of m x n.
Throughput of MGS based systolic array diminishes by 0(n') for shared multiplier
(PU size < array size) and O(n 2 ) for PU size equal to array size.
86
Table 6.2: Throughput observed for Linear Arrays (LA based, with all DSP based
shared Multipliers)
Algorithm
array
size
PU
array
size
minimum
clock period
(ns)
throughput
(cycles)
maximum
clock frequency
(MHz)
throughput
(MRows/s)
GR
GR
GR
GR
MGS
MGS
MGS
MGS
3
5
7
9
2
3
4
5
NA
NA
NA
NA
1
1
1
1
9.543
9.755
9.567
9.642
9.594
9.987
9.962
10.41
47
77
107
137
85
152
241
338
104.79
102.51
104.53
103.71
104.23
100.13
100.38
96.06
2.23
1.33
0.98
0.76
1.23
0.66
0.42
0.28
6.3.2
Latency
Latency of our implemented architectures and its sub-units is discussed in the following subsections.
6.3.2.1
Systolic Arrays
Latency for 1 row for both GR and MGS based systolic arrays, is (latency of Internal
unit + latency of boundary unit) cycles for all sizes of arrays m x n.
Latency for a matrix for GR based systolic array grows O(n) for array size m x
n (where m < n), as each of the n rows passes through at least m rows of boundary
and internal units in parallel.
Latency for a matrix for MGS based systolic array increases by O(n 2) for shared
multiplier (PU size < array size) and O(n) for PU size equal to array size, for all
array sizes m x n, where m < n.
6.3.2.2
Linear Arrays
Latency for 1 row for both GR and MGS based linear arrays, is (2m - 1)x max(
latency of Internal unit, latency of boundary unit) cycles for all sizes of arrays m x
n.
87
Latency for a matrix for GR based systolic array grows O(m x n) for array size
m x n (where m < n), because each input row passes through at least m rows of
boundary and internal units sequentially, and 2 rows are processed in parallel.
Latency for a matrix for MGS based systolic array increases by O(n 3 ) for shared
multiplier (PU size < array size) and 0(n2 ) for PU size equal to array size, for all
array sizes m x n, where m < n.
6.3.2.3
Internal Unit
1. Latency of internal unit in GR based QRD is equal to latency of 4 multiplications
done in parallel or series
+ latency of 2 additions + 1 cycle FIFO delay. The
value of latency of internal unit, implemented using dedicated 3-stage pipe-lined
multiplier for each product, in our experiment was observed to be 6 cycles. The
value of latency of internal unit, implemented using shared 3-stage pipe-lined
multiplier for 4 product, in our experiment was observed to be 12 cycles.
2. Latency of internal unit in MGS based QRD is equal to latency of dot product
computation unit (DOT) + latency of offset correction (OC); where latency of
DOT
size ofinputvector
size of vector of multipliers
x latency of a multiplication +
size of input vector
size Of vector of adders used for batch accumulationyf
and latency of OC =
size
of
size of input vector subtractinns
size of vector of adders used for
x latency of a multiplication +
tvr
x
x latency of an addition,
of an addition.
latencyion
The value of latency of internal unit, implemented using dedicated 3-stage pipelined multiplier for each product with size of vector of computation units equal
to 3, in our experiment was observed to be 16 cycles.
6.3.2.4
Boundary Bnit
1. Latency of boundary unit in GR based QRD depends on the type of implementation.
88
(a) For LA based the latency is equal to 2 or 3 multiplications
for fixed-point or complex numbers respectively + latency
of table look up + one multiplication + one addition. The
observed value of this latency in our experiment, for word
length 32 bit implemented using dedicated 3-stage pipelined multiplier, is 14 cycles.
(b) For Log domain based the latency is equal to 2 or 3 multiplicationsfor fixed-point or complex numbers respectively
± latency of one Log-values table look up + one shift
± one addition + latency of 2 Exponential-values table
look up in parallel or series. The observed value of this
latency in our experiment, for word length 32 bit implemented using dedicated 3-stage pipe-lined multiplier, is
11 cycles.
(c) For NR based the latency is equal to 2 or 3 multiplication
for fixed-point or complex numbers respectively + 1 square
root computation + 1 divider, where both square root and
divider take
wdienglh
iteraticonspercydle
cycles. The observed value of
this latency in our experiment, for word length 32 bit implemented using dedicated 3-stage pipe-lined multiplier,
is 60 cycles.
2. Latency of boundary unit in MGS based QRD is equal to latency of norm
computation unit (NORM) + latency of square root computation unit (SQR) +
max(latency of a multiplication, latency of vector product (VP)) ,
where latency of NORM =
±szof
vectr
size of inut
size of vector of multipliers
x latency of a multiplication
i
size of input vect orxlaecofnadio;
+ sizeofvector of adders used for batch accumulatcn
latency of an addition
and latency of VP
x latency of a multiplication;
=
size
oinpuJt
iers
and latency of SQR depends on the type of implementation.
89
(a) For LA based the latency is equal to latency of table look
up + one multiplication + one addition.
(b) For Log domain based the latency is equal to latency of
one Log-values table look up
+ one shift + one addition
+ latency of 2 Exponential-values table look up in parallel
or series.
(c) For NR based, it takes
i,,r,
,d2,le,
cycles.
The observed value of this latency in our experiment, for word length 32 bit
implemented using dedicated 3-stage pipe-lined multiplier, is 28 for LA based
implementations.
6.3.2.5
Multipliers
For both LUT and DSP based multiplier implementations, latency is equal to number
of pipeline stages + 1. If multiplier is shared to compute multiple products then the
latency for last product will be
number of pipeline stages + 1 buffering cycle +
number of inputs.
6.3.3
Area
6.3.3.1
Systolic Array
For both GR and MGS, if area of boundary unit is AB gates and area of internal unit
is AI gates, then area of a systolic array implementation is m x AB +
(n+1)
AI gates
+ area of connection network, for all array sizes m x n, with m < n.
6.3.3.2
Linear Array
-m) x
For both GR and MGS, if area of boundary unit is AB and area of internal unit is Al,
then area of a Linear array implementation is AB + (al-1) x AI where al is (-)
+1,
for all array sizes m x n, where m < n. This does not include the area occupied by
the connection network.
90
6.3.3.3
Internal Unit
Area of internal units in gates is as under:
1. Area of internal unit in GR based QRD is equal to area of 1 to 4 multipliers +
area of 1 to 2 adders.
2. Area of internal unit in MGS based QRD is equal to Area of 1 dot product
computation unit (DOT) + area of 1 offset correction (OC), where area of both
DOT and OC = size of vector of multipliers x area of a multiplier + size of
vector of adders x area of an adder.
6.3.3.4
Boundary Unit
1. Area of boundary unit in GR based QRD depends on the implementation.
(a) For LA based the area is equal to 2 to 4 fixed-point multipliers + 1 LA look up table + 3 adder; where each entry
in LA look up table is 2 x word length bits long.
(b) For Log domain based the area is equal to 2 to 3 fixedpoint multipliers + 1 Log-values look up table + 1 shifter
+ 3 adders + 1 to 2 Exponential-values look up tables;
where each entry in Log-values and Exponential-values
look-up table is word length bits long.
(c) For NR based, the area is equal to area of 1 to 3 fixedpoint multipliers + 3 adder + area of 1 square root computational unit + area of 3 divider
Square-root unit has an adder, up to 2bit shifter, 1 word
long input FIFO, 2 words long output FIFO, and 5 state
registers. The five state registers are: 3 data registers
each of length one word, 1 counter register of length
Log2(word length) bits, and 1 flag register of length one
bit.
91
Divider unit has an adder, 1-bit shifter, 2 words long input FIFO, 2 words long output FIFO, and 5 state registers. The five state registers are similar to ones in Squareroot unit.
2. Area of boundary unit in MGS based QRD is equal to area of norm computation unit (NORM) + area of square root computation unit (SQR) + area of a
multiplication + area of vector product (VP)); where area of NORM = size of
vector of multipliers x area of a multiplier + size of vector of adders x area of
an adder; area of VP = size of vector of multipliers x area of a multiplier; and
area of SQR depends on the implementation.
(a) For LA based the area is equal to 1 LA look up table + 1
adder + 1 fixed point multiplier; where each entry in LA
look up table is 2 x word length bits long.
(b) For Log domain based the area is equal to 1 Log-values
look up table + 1 shifter + 1 adder + 1 to 2 Exponentialvalues look up tables; where each entry in Log-values and
Exponential-values look up table is word length bits long.
(c) For NR based square root unit has 5 state registers; 3
data registers of length word-length bits, 1 counter register of length Log 2 (word length) bits, and 1 1-bit flag
register, in addition to word length incoming FIFO, and
2 x word length outgoing FIFO. It also contains a word
length adder and up to 2bit shifter.
6.3.3.5
Multipliers
For complex number with both real and imaginary parts represented by 32-bit fixedpoint number, it takes 6 DSP multipliers in Virtex-6. For a 16-bit fixed-point multiplier, it takes 1 DSP block or around 223 LUTs.
92
Area used by multiplier can be reduced by taking advantage of pipe-lined nature
of the multiplier, and sharing it for computing multiple products instead of using
dedicated multipliers. This reduction in area comes at the cost of increased latency.
But since total latency of a row of internal and external/boundary units in all architectures discussed here is equal to larger of the two latencies, and boundary unit
takes significantly more cycles than internal unit, we can safely share multipliers in
internal unit without loss in overall performance.
6.4
Target Oriented Optimization
Target oriented optimal choice for a specific MIMO and beam formation setup is
discussed in the following sections.
Requirements for compressed sensing are problem specific and may range any
where between very small size with time as critical design parameter to medium
and large size with area as critical design parameters. The optimal solution can be
found out by searching the configuration space for the given input data range and
matrix size. Therefore, the target based optimal choice for compressed sensing is not
discussed here.
6.4.1
MIMO
6.4.1.1
Required Specifications
For MIMO of size 4x4, as recommended in [1], 16 bits are enough for preserving the
precision [19]. In 3G LTE OFDM signal [4] there are up to 2048 sub-carriers. The
channel matrix R must be computed for all the sub-carriers within the duration for
which the channel impulse response is invariant (coherence time), which is computed
as:
te
=(vf,)
where c is speed of light 3 x 108 m/s, v is speed of receiver, and
For v = 250 km/h and
f, =
(6.1)
f, is carrier frequency.
2.4 GHz, the t, = 1.8 ms in which 2048 decomposition
93
must be performed.
6.4.1.2
Optimized Configurations
Due to small input-matrix size, systolic arrays can be employed for best throughput.
Table 6.3 shows the P&R results for systolic GR and MGS array results.
All these implementation complete 2048 computations in less than 1.8ms except
for NR based QRD. For the minimum area Systolic Log based GR QRD provides the
best performance.
Table 6.3: P&R results for GR and MGS based Systolic Arrays for complex valued
input array of size 4x4 and word length (6.10)
Algorithm
Impl.
type
PU
array
size
DSP usage
LUT usage
gr
gr
gr
mgs
mgs
la
log
nr
la
la
NA
NA
NA
1
2
99 (12%)
30 (3%)
22 (2%)
243 (31%)
438 (57%)
19189
19201
19056
33297
41124
(12%)
(12%)
(12%)
(22%)
(27%)
Reg usage
12069
11447
14026
31562
33031
(4%)
(3%)
(4%)
(10%)
(10%)
Processing
for
time
2048 rows
(micro sec)
1.283
0.876
1.867
0.857
0.674
Comparing these results, the most optimized choice for this given problem is
Systolic Log based GR QRD.
6.4.2
Beam Formation
6.4.2.1
Required Specifications
The beam-former weights and channel estimates are computed using pilot symbols
transmitted through dedicated physical control channel (DPCCH). The updated
beam former weights are used for multiplication with the data transmitted through
the DPCCH. For narrower bean, larger dimension of weight matrix is required. As
the area constraints get tighter, and chip area becomes scarcer, a linear array becomes
more suitable choice.
94
6.4.2.2
Optimized Configurations
GR linear array QRD architectures designed in this study can fit on Virtex 6 for
matrices larger than 25x25. However, choice of GR linear based on NR or Log QRD
depends on the DSP blocks and throughput requirements.
Table 6.4 shows area and timing results GR Linear NR and Log based QRD,
implemented for word size 16 (6-bit integer part, 10-bit fraction part) using all DSP
multipliers.
Table 6.4: P&R results for GR Linear Arrays for complex valued input array, word
length (6.10)
Algorithm
Array
size
PU
array
size
DSP usage
LUT usage
Reg usage
Throughput
(micro
sec/row)
gr
gr
gr
gr
gr
gr
gr
gr
17
17
19
19
21
21
25
25
nr
log
nr
log
nr
log
nr
log
25(3%)
27(3%)
28(3%)
30(3%)
31(4%)
33(4%)
37(4%)
39(5%)
18308(12%)
18468(12%)
19078(12%)
19101(12%)
20035(13%)
20177(13%)
22469(14%)
21876(14%)
13997(4%)
13266(4%)
14675(4%)
13978(4%)
15384(5%)
14654(4%)
16805(5%)
16075(5%)
6.92
2.28
7.43
2.54
8.31
2.82
10.45
3.37
6.5
Comparison with Previously Reported Results
We have also compared results of our proposed architecture with those previously
reported, in terms of throughput, latency and area. The specific comparison has
been made with the results reported by [7], [12], [14], [15], [19], [25], [16], [24] and
[21].
This comparison is presented in Tables 6.5 - 6.6 for MIMO and Beam formation,
respectively.
Comparing our results with previously reported ones show that our
implementation outperforms the earlier reported results in terms of performance.
95
6.5.1
MIMO
From the comparison of results for MIMO, it is apparent that our GR bases systolic
array implemented using Log domain computational units has the best throughput,
closely followed by [15] running at 160MHz and [19] when operated at the highest
clock frequency setting. It is well noted that higher clock frequency results in higher
power consumption for any given circuit.
Comparison between CLB count for [21] and our architecture acquired from P&R
reports for Virtex 6 shows that our architecture for Log based, NR and LA based 4x4
QRD implementations took 2.4, 2.6 and 2.7 times more CLBs than implementation
presented in [21] while improving the throughput by 54, 25 and 37 times than that
of [21].
6.5.2
Beam Formation
Since for Beam formation the area is critical requirement, we only compared the area
reported by previous authors, as shown in Table 6.6.
From Table 6.6 it is clear that minimum area is utilized by our GR Linear log
based QRD implementation.
By using the insight that boundary unit takes more
cycles than internal unit, we were able to share the multiplier in internal unit to take
advantage of the extra cycles available for the internal unit to process one input.
Consequently we reduced the total number of mulitpliers used by 2.5x.
96
Table 6.5: Comparison of our study results with previously reported for MIMO
Algorithm
ported
result
.
This
study
clock
freq
Processing MRows
Time for /sec
cycles
per
(Kgates), Slice
QRD
(MHz)
2048 rows
count
23.2
269
(ms)
1.058
0.13 *
480
*
17.7
212
1.343
0.16 *
773
*
[19] low clock-
16.3
160
1.779
0.22 *
1357
freq
[16]
[15]
[14]
[25]
[24]
[21] (32 bits)
[7] (12 bits)
61.8
48.7
27
6
72
1380
-
162
166
-
1.28 *
0.96 *
-
0.16 *
0.12 *
-
-
-
-
60
100
-
-
0.17
0.09
966 *
707 *
88
252
67
-
GR systolic LA
GR systolic log
GR systolic NR
MGS systolic
7449 (19%)
6733 (17%)
7079 (18%)
13470 (35%)
102.12
102.83
114.04
100.37
1.28
0.88
1.87
3.43
6.38
9.35
4.39
2.39
64
44
104
68
15762 (41%)
100.32
2.69
3.04
52
[19] high clockfreq
medium
[19]
Previously clock-freq
re-
Gate count or
Gate equivalent
-
LA (PU=1)
MGS
systolic
LA (PU=2)
a*calculated from given data
97
*
Table 6.6: Comparison of our study results with previously reported for Beam
formation
Real
Algorithm
Real Beta Divisors
Rounder
Shifter
Adders Mul- Multi-
ti-
pli-
pli-
ers
ers
Previo-
[12]
boundary 2
8
1
1
2
0
usly reported
clk
freq.
cell
internal
8
8
1
0
1
0
0
2
0
0
0
0
162
170
21
1
22
0
4
0
0
0
0
3
0
0
0
0
67
0
0
0
0
3
0
0
0
1
3
0
0
0
0
66
0
0
0
1
1
0
0
0
3
result
100MHz cell
output
cell
21
LA
This
study -
linear
array
boundary 3
cell
7
internal
cell
21
GR
linear
Log
cell
cell
150
linear
array
boundary 3
cell
7
internal
cell
21
cell
150
linear
array
boundary 4
NR
cell
internal
7
3
0
0
0
0
cell
21 cell
151
64
0
0
0
3
linear
array
98
6.6
Guidelines for Architecture Selection
From the preceding results and discussion, following guidelines for appropriate selection of an architecture for a given problem can be concluded, presented in Table
6.7
Table 6.7: Selection of Appropriate Architecture
Archi-
tech.
Impl.
for boundary
unit
Critical Design Parameters
Appropriate
tecture
Area
GR Linear Array
any
DSP blocks
GR Linear/Systolic
NR
Throughput
Dynamic range of input elements and Accuracy
Latency, input matrix size < 4
Latency and area, input matrix size < 4
Latency and area, input matrix size > 4
GR Systolic
any
Log
LA
MGS systolic
MGS Linear
GR Systolic
LA/Log
LA/Log
Log
99
100
Chapter 7
Conclusions
We present a highly modular and completely parameterized implementation of two
different algorithms; Givens-Rotation (GR) and Modified-Gram-Schmidt (MGS), chosen for their suitability for hardware implementation. From the results of implementation of four parameterized architectures including systolic Givens rotations based,
linear Givens based, systolic MGS based, and linear MGS based with three different configurations namely, linear approximation, log domain, and Newton Raphson
method, it was concluded:
(1)
Maximum throughput of 10.1 M rows/sec was achieved for implementation of
Givens based systolic array with log domain QRD configuration for 3x3 complex
valued matrix on Virtex-6 FPGA.
(2) The minimum slice utilization was achieved by the Givens based linear array
with log domain QRD configuration at the cost of reduced throughput.
(3) Given based systolic array with log domain proved to be the most resource
efficient algorithm.
(4) MGS based QRD outperforms GR in terms of latency, but is suitable for only
input array sizes < 4 because of exponential growth in area with the increase
in input array size.
101
(5) IP for all the proposed architectures have been prepared and are available at
http://saqib.scripts.mit.edu/qr..code.php.
This set of IPs can be configured to
suit variety of application demands to generate hardware with zero design and
debugging time.
(6) Guidelines concluded from the reported results can be used to pick the optimal
design choice for a given design requirement.
(7) Because our architecture is completely modular, sub-units can be independently
optimized and tested without the need of retesting the whole system.
102
Bibliography
[1] 3gpp tr 25.876 multiple input multiple output in utra. 3rd GenerationPartnership Project, Tech. Rep., October 2005.
[21 Mimo and smart antennas for mobile broadband systems. LTE standard,2013.
[3] Naofal Al-Dhahira and Ali H. Sayedb. Cordic-based mmse-dfe coefficient computation. Digital Signal Processing: A Review Journal, pages 178-194, 1999.
[4] R. Bachl, P. Gunreben, S. Das, and S. Tatesh. The long term evolution towards
a new 3gpp air interface standard. Bell Labs Technical Journal, 11(4):25-51,
2007.
[5] E. Candes and T. Tao.
Decoding by linear programming.
IEEE Trans. on
Inform. Theory, 51(12):4203-4215, 2005.
[6] L. Dai, S. Sfar, and K.B. Letaief. Optimal antenna selection based on capacity
maximization for mimo systems in correlated channels. Communications, IEEE
Transactions,54(3):563-573, March 2006.
[71 Viktor wall Fredrik Edman. A scalable pipelined complex valued matrix inversion
architecture. ISCAS IEEE, pages 4489-4492, 2005.
[8] S. Haykin. Adaptive Filter Theory. Prentice-Hall, third edition, 1994.
[9] S.-F. Hsiao and J.-M. Delosme. Householder cordic algorithms.
IEEE Trans.
Comput., 44:990-1001, August 1995.
[10] A. Jraifi and E.H. Saidi. A prediction of number of antennas in a mimo correlated
channel. International Conference on Intelligent Engineering Systems, 2008.
INES 2008., pages 181-184, Feburary 2008.
[11] T. Kailath, H. Vikalo, , and B. Hassibi. Mimo receive algorithms. Space-Time
Wireless Systems: From Array Processing to MIMO Communications, Cambridge University Press, 2005.
[12] G. Lightbody, R. Walke, and R. WoodsandJ. McCanny. Linear qr architecture
for a single chip adaptive beamformer. The journel of VLSI Signal Processing,
24(1):67-81, 2000.
103
[13] Chih-Hung Lin, R.C.-H. Chang, Chien-Lin Huang, and Feng-Chi Chen. Iterative
qr decomposition architecture using the modified gram-schmidt algorithm. IEEE
InternationalSymposium on Circuits and Systems, 2009. ISCAS 2009, 2009.
[14] Kuang-Hao Lin, Nat. Chung Hsing, R.C. Chang, Chien-Lin Huang, and Feng-Chi
Chen. Implementation of qr decomposition for mimo-ofdm detection systems.
15th IEEE InternationalConference on Electronics, Circuits and Systems, 2008.
ICECS 2008., pages 57-60, 2008.
[15] P. Luethi, A. Burg, S. Haene, D. Perels, N. Felber, and W. Fichtner.
Vlsi
implementation of a high-speed iterative sorted mmse qr decomposition. Proc.
of IEEE ISCAS, page 14211424, 2007.
[16] P. Luethi, ETH Zurich, Zurich, Studer C., Duetsch S., and Zgraggen E. Gramschmidt-based qr decomposition for mimo detection: Vlsi implementation and
comparison. IEEE Asia Pacific Conference on Circuits and Systems, 2008. APC-
CAS 2008., pages 830-833, 2008.
[17] Robert L. Parker. Geophysical Inverse Theory. Princeton University Press, 1994.
[18] H. Sakai. Recursive least-squares algorithms of modified gramschmidt type for
parallel weight extraction. IEEE Trans. Signal Procss, 42:429-433, February
1994.
[19] P. Salmela, A. Burian, H. Sorokin, and J. Takala. Complex-valued qr decomposition implementation for mimo receivers. IEEE InternationalConference on
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008., pages 1433-1436,
2008.
[20] Perttu Salmela, Adrian Burian, Harri Sorokin, and Jarmo Takala. Complexvalued qr decomposition implementation for mimo receivers. IEEE International
Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008.,
pages 1433-1436, March 2008.
Implementation of givens qr[21] Anatoli Sergyienko and Oleg Maslennikov.
decomposition in fpga. PPAM '01 Proceedings of the th InternationalConference
on Parallel Processing and Applied Mathematics-Revised Papers, pages 458-465,
2001.
[22] M. Shabany and P. G. Gulak. A 0.13mm cmos 655 mbps 4x4 64-qam k-best
mimo detector. IEEE Int. Solid-State Circuits Conf. Dig. Tech., pages 256-257,
Feburay 2009.
[23] C. K. Singh, S. H. Prasad, and P. T. Balsara. A fixed-point implementation of
qr decomposition,. IEEE Dallas Workshop CircuitsSystems., Dallas, TX, pages
795-825, October 2006.
104
[24] C.K. Singh, S.H. Prasad, and P.T. Balsara. Vlsi architecture for matrix inversion
using modified gram-schmidt based qr decomposition. . Int. Conf. VLSI Design,
pages 836-841, January 2007.
[25] F. Sobhanmanesh and S. Nooshabadi.
Parametric minimum hardware qrfactoriser architecture for v-blast detection. IEEE IEE Proc. Circuits Devices
Syst, page 433441, 2006.
[26] W.S Song, D.V. Rabinkin, M.M. Vai, and H.T. Nguyen. Vlsi bit-level systolic
sample matrix inversion. MIT Lincoln Laboratory Report NTP-2, 2001.
[27] J. Tropp and A. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. on Inform. Theory, 53(12):4655-4666,
2007.
[28] S. Wang, Jr. E. E., and Swartzlander. The critically damped cordic algorithm for
qr decomposition. . IEEE Asilomar Conf. Signals Syst. Comput, pages 908-911,
November 1996.
105
106
Appendix A
Tables
107
Table A.1: P&R results for GR and MGS based Linear and Systolic Arrays with
LA, Log and NR based QRD config. for word length (16.16)
Sr.
no.
Algc Array
structure
Input Multiplier
artype (Boundary
unit,
ray
Impl PU
tech. array
size
Internal unit)
size
Reg.
LUT
DSP
(%age) (%age) (%age)
10078
1
GR
Linear
3
DSP, DSP
LA
NA
42
14536
(5%)
(9%)
(3%)
2
GR
Linear
5
DSP, DSP
LA
NA
60
16252
11611
(7%)
(10%)
(3%)
17810
13135
3
GR
Linear
7
DSP, DSP
LA
NA
78
(10%)
(11%)
(4%)
9
DSP, DSP
LA
NA
96
19570
14654
4
GR
Linear
(12%)
(12%)
(4%)
5
GR
Systolic 3
DSP, DSP
LA
NA
120
20434
12835
(15%)
(13%)
(4%)
6
GR
Systolic 5
DSP, DSP
LA
NA
294
35541
22632
(38%)
(23%)
(7%)
7
GR
Systolic 7
DSP, DSP
LA
NA
540
58291
37588
(70%)
(38%)
(12%)
8
GR
Linear
3
DSP, DSP
Log
NA
24
13514
9790
(3%)
(8%)
(3%)
9
10
11
12
13
14
15
GR
GR
GR
GR
GR
GR
GR
Linear
Linear
Linear
5
7
9
Systolic 3
Systolic 5
Log
DSP, DSP
Log
DSP, DSP
Log
DSP, DSP
Log
DSP, DSP
Log
DSP, DSP
NA
NA
NA
NA
NA
Systolic 7
DSP, DSP
Log
NA
3
DSP, DSP
NR
NA
Linear
108
36
15240
11281
(4%)
(10%)
(3%)
48
16853
12763
(6%)
(11%)
(4%)
60
18810
14241
(7%)
(12%)
(4%)
72
18422
11975
(9%)
(12%)
(3%)
31877
20951
180
(23%)
(21%)
(6%)
336
53798
35021
(43%)
(35%)
(11%)
16
14355
11344
(2%)
(9%)
(3%)
16
GR
12726
Linear
5
DSP, DSP
NR
NA
28
16223
(3%)
(10%)
(4%)
DSP, DSP
NR
NA
40
17946
14208
17
GR
Linear
7
(5%)
(11%)
(4%)
18
GR
Linear
9
DSP, DSP
NR
NA
52
19690
15685
(6%)
(13%)
(5%)
19
GR
Systolic 3
DSP, DSP
NR
NA
48
20699
15759
(6%)
(13%)
(5%)
20
GR
Systolic 5
DSP, DSP
NR
NA
140
35234
27917
1 (18%)
(23%)
(9%)
21
GR
Systolic 7
DSP, DSP
NR
NA
22
GR
Linear
3
LUT, DSP
LA
NA
280
(36%)
12
59335
(39%)
18083
45195
(14%)
10501
(1%)
(11%)
(3%)
19876
11989
23
24
25
GR
GR
GR
Linear
Linear
Linear
5
LA
LUT, DSP
NA
24
(3%)
(13%)
(3%)
21571
13468
7
LUT, DSP
LA
NA
36
(4%)
(14%)
(4%)
9
LUT, DSP
LA
NA
48
23225
14957
(6%)
(15%)
(4%)
LA
NA
36
30983
14077
(4%)
(20%)
(4%)
LA
NA
120
53334
24457
1(15%)
26
GR
Systolic 3
LUT, DSP
27
GR
Systolic 5
LUT, DSP
(35%)
(8%)
28
GR
Linear
3
LUT, LUT
LA
NA
0
20939
10989
(0%)
(13%)
(3%)
29
GR
Linear
5
LUT, LUT
LA
NA
0
25733
13018
(0%)
(17%)
(4%)
30
GR
Linear
7
LUT, LUT
LA
NA
0
30514
15031
(0%)
(20%)
(4%)
31
GR
Systolic 3
LUT, LUT
LA
NA
0
39815
15261
(0%)
(26%)
(5%)
0
82603
28697
(0%)
(54%)
(9%)
24
17018
10480
(3%)
(11%)
(3%)
1_
1__
32
33
GR
GR
Systolic 5
Linear
3
LA
LUT, LUT
LA
DSP, LUT
109
NA
NA
34
GR
Linear
5
DSP, LUT
LA
NA
24
21558
12403
(3%)
(14%)
(4%)
35
GR
Linear
7
DSP, LUT
LA
NA
24
26284
14703
(3%)
(17%)
(4%)
66
28655
14040
(8%)
(19%)
(4%)
114
63666
26992
(14%)
(42%)
(8%)
102
17866
13671
(13%)
(11%)
(4%)
150
22551
19629
(19%)
(14%)
(6%)
198
30825
31333
(25%)
(20%)
(10%)
246
41146
44419
(32%)
(27%)
(14%)
19982
14482
36
GR
Systolic 3
DSP, LUT
LA
NA
37
GR
Systolic 5
DSP, LUT
LA
NA
38
39
40
41
MGS Linear
MG' Linear
2
3
MGS Linear- 4
MG' Linear
5
LA
DSP, DSP
LA
DSP, DSP
LA
DSP, DSP
LA
DSP, DSP
1
1
1
1
1
138
MGS Systolic 2
DSP, DSP
LA
(17%)
(13%)
(4%)
43
MGS Systolic 3
DSP, DSP
LA
1
288
33009
29353
(37%)
(21%)
(9%)
44
MG' Systolic 4
DSP, DSP
LA
1
486
55966
56403
1_
1(63%)
(37%)
(18%)
42
______
1_
1_
110
Table A.2: P&R results for GR and MGS based Linear and Systolic Arrays with
LA, Log and NR based QRD config. for word length (16.16)
Sr.
no.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Algc Array
structure
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
GR
Linear
Linear
Linear
Linear
Systolic
Systolic
Systolic
Linear
Linear
Linear
Linear
Systolic
Systolic
Systolic
Linear
Linear
Linear
Linear
Systolic
Systolic
Systolic
Linear
Linear
Linear
Linear
Systolic
Systolic
Linear
Linear
Linear
Systolic
Systolic
Linear
Linear
Linear
Systolic
Systolic
Input Multiplier
type (Boundarunit,
ary
ray
Impl PU
tech. array
Min.
period
size
Internal unit)
size
(ns)
3
5
7
9
3
5
7
3
5
7
9
3
5
7
3
5
7
9
3
5
7
3
5
7
9
3
5
3
5
7
3
5
3
5
7
3
5
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
DSP, DSP
LUT, DSP
LUT, DSP
LUT, DSP
LUT, DSP
LUT, DSP
LUT, DSP
LUT, LUT
LUT, LUT
LUT, LUT
LUT, LUT
LUT, LUT
DSP, LUT
DSP, LUT
DSP, LUT
DSP, LUT
DSP, LUT
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
9.54
9.76
9.57
9.64
9.63
9.80
9.98
9.44
9.44
9.17
9.80
9.99
10.69
9.94
9.24
9.87
9.35
9.59
9.36
9.98
9.93
9.97
9.75
9.90
9.92
9.99
9.99
9.94
9.99
9.98
9.89
10.38
9.98
9.93
9.79
9.99
9.98
LA
LA
LA
LA
LA
LA
LA
Log
Log
Log
Log
Log
Log
Log
NR
NR
NR
NR
NR
NR
NR
LA
LA
LA
LA
LA
LA
LA
LA
LA
LA
LA
LA
LA
LA
LA
LA
111
Through
put
(Mrows
/sec)
2.23
1.33
0.98
0.76
6.49
6.38
6.26
2.58
1.56
1.15
0.84
10.01
8.50
9.15
0.57
0.32
0.24
0.18
2.89
2.71
2.72
2.13
1.33
0.94
0.74
6.25
6.26
2.14
1.30
0.94
6.32
6.02
2.13
1.31
0.95
6.25
6.26
Latency
(ins)
3.24
6.41
10.53
16.06
3.09
3.15
3.22
2.80
5.48
8.97
14.54
2.20
2.36
2.21
12.75
26.89
43.06
67.13
6.93
7.43
7.42
3.39
6.41
10.90
16.53
3.21
3.22
3.38
6.56
10.99
3.17
3.34
3.39
6.52
10.78
3.21
3.21
38
39
40
41
42
43
44
MG Linear
MG Linear
MG Linear
MG Linear
MG Systolic
MG Systolic
MGI Systolic
2
3
4
5
2
3
4
DSP,
DSP,
DSP,
DSP,
DSP,
DSP,
DSP,
LA
LA
LA
LA
LA
LA
LA
DSP
DSP
DSP
DSP
DSP
DSP
DSP
112
1
1
1
1
1
1
1
9.59
9.99
9.96
10.41
9.96
9.98
9.97
1.23
0.66
0.42
0.28
5.58
3.58
2.39
0.82
1.52
2.40
3.52
0.23
0.43
0.68
Appendix B
Source Code
113
Contents
1
DataType.bsv
117
2
Conjugate.bsv
118
3
Double.bsv
119
4
BSVDouble.c
121
5
BSVDouble.h
124
6
GR specific Rotate.bsv
125
7
GR specific LArotation.bsv
126
8 LAtable.bsv
129
9
131
GR specific Logrotation.bsv
10 Exptable.bsv
134
11 Logtable.bsv
136
12 GR Linear specific UnitRow.bsv
138
13 GR Linear specific mkExternal.bsv
140
14 GR Linear specific mkInternal.bsv
141
15 GR Linear specific QR.bsv
143
16 GR Linear specific mkQR.bsv
144
17 GR Linear specific Memory.bsv
148
18 GR Linear specific States.bsv
150
19 GR Linear specific FixedPointQR.bsv
152
20 GR Linear specific Scemi.bsv
154
21 GR Systolic specific FullRow.bsv
155
22 GR Systolic specific mkFullRow.bsv
156
23 GR Systolic specific mkExternal.bsv
157
24 GR Systolic specific mkInternalRow.bsv
158
25 GR Systolic specific mkInternal.bsv
159
26 GR Systolic specific QR.bsv
161
114
27 GR Systolic specific mkQR.bsv
162
28 GR Systolic specific FixedPointQR.bsv
164
29 GR Systolic specific mkStreamQR.bsv
166
30 GR Systolic specific Scemi.bsv
167
31 MGS specific BatchAcc.bsv
168
32 MGS specific BatchCS.bsv
169
33 MGS specific BatchProduct.bsv
170
34 MGS specific BatchSub.bsv
171
35 MGS specific mkDot.bsv
172
36 MGS specific mkNorm.bsv
174
37 MGS specific mkOffsetCorrection.bsv
176
38 MGS specific mkVecProd.bsv
179
39 MGS specific SqrtInv.bsv
181
40 MGS specific LASqrtInv.bsv
182
41 MGS specific LogSqrtInv.bsv
184
42 MGS specific NRSqrtInv.bsv
186
43 MGS specific mkDP.bsv
187
44 MGS specific mkTP.bsv
189
45 MGS specific UnitRow.bsv
191
46 MGS specific QR.bsv
193
47 MGS specific mkStreamQR.bsv
194
48 MGS specific Scemi.bsv
196
49 MGS Systolic specific FixedPointQR.bsv
197
50 MGS Systolic specific mkQR.bsv
200
51 MGS Linear specific FixedPointQR.bsv
202
52 MGS Linear specific mkQR.bsv
204
53 Multiplier.bsv
206
115
54 PipelinedMultiplier.bsv
209
55 GR specific ComplexFixedPointRotation.bsv
212
56 GR Systolic specific StreamQR.bsv
214
57 Divider.bsv
215
58 SquareRoot.bsv
219
116
1
DataType.bsv
Author: Sunila Saqib saqib@mit.edu
// Configuration file.
typedef TMul#(TAdd#(16,16),2) BitLen; //Multiplier's bit length
typedef 3 Stages; //Multiplier's pipeline stages
typedef FixedPoint#(16,16) FP;
typedef Complex#(FP) CP-FP;
typedef 3 Dim; //Dimensions of input matrix
typedef 3 PUarrSize; //size of array of Processing units
typedef 1024 LAlutSize; //size of LUT for LA based 1/sqrt(X)
typedef 128 ExplutSize; //size of LUT for Log based 1/sqrt(X)
typedef 1024 LoglutSize; //size of LUT for Log based 1/sqrt(X)
typedef 4 BitDis; // bit displacement for lookup operation 1
typedef 4 BitDisExp; // bit displacement for lookup operation 2
typedef 1 Depth; //depth of sized fifos in dp and tp
typedef 5 Num~fMat; //for performance script
typedef 3 RowsPerMat; //for performance script
117
2
Conjugate.bsv
Author: Sunila Saqib saqib(mit.edu
4 typeclass Conjugate #(type data-t);
function data-t con (data-t x);
endtypeclass
instance Conjugate #(Double);
function Double con (Double x);
return x;
endfunction
endinstance
instance Conjugate #(Real);
function Real con (Real x);
return x;
endfunction
endinstance
instance Conjugate #(Complex#(tnum))
provisos(Arith#(tnum));
function Complex#(tnum) con (Complex#(tnum) x);
img: -x.img};
let y = Complex {rel:x.rel
return y;
endfunction
endinstance
instance Conjugate #(FixedPoint#(is, fs));
function FixedPoint#(is, fs) con (FixedPoint#(is, fs) x);
,C
return x;
endfunction
endinstance
118
3
Double.bsv
Author: Sunila Saqib saqibOmit.edu
import
import
import
import
import
import
import
import
"BDPI"
"BDPI"
"BDPI"
"BDPI"
"BDPI"
"BDPI"
"BDPI"
"BDPI"
function
function
function
function
function
function
function
function
Double add (Double a, Double b);
Double sub (Double a, Double b);
Double minus (Double a);
Double divide (Double a, Double b);
Double multiply (Double a, Double b);
Double absolute (Double a);
Double squareroot (Double a);
Bool lessthanequal (Double a, Double b);
typedef struct {
Bit#(64) bits;
} Double deriving(Bits, Eq);
instance RealLiteral#(Double);
function Double fromReal(Real x);
return Double { bits: $realtobits(x) };
endfunction
endinstance
import "BDPI" dbl-print
=
function Action dblWrite(Double d);
instance Arith#(Double);
function Double \+ (Double x, Double y);
Double result = add(x,y);
return result;
endfunction
function Double \ (Double x, Double y);
Double result = sub(x,y);
return result;
endfunction
function Double negate(Double x) = minus(x);
function Double \/ (Double x, Double y);
Double result = divide(x,y);
return result;
endfunction
function Double \* (Double x, Double y);
multiply(x,y);
Double result
return result;
endfunction
function Double abs (Double x);
Double result = absolute(x);
return result;
endfunction
i endinstance
119
function Double sqrt(Double x);
Double result = squareroot(x);
return result;
endfunction
instance Literal #(Double);
function Double fromInteger(Integer x);
return fromReal(fromInteger(x));
endfunction
endinstance
instance Ord#(Double);
function Bool \<= (Double x, Double y) = lessthanequal(x, y);
endinstance
instance FShow#(Real);
function Fmt fshow(Real x);
match {.n, f} = splitReal(x);
return $format(n, ".", trunc(10000*f));
endfunction
endinstance
120
4
BSVDouble.c
/*
Author: Sunila Saqib saqibmit.edu
*/
4 #include <stdio.h>
#include "BSVDoubLe.h"
3
#include
"math.h"
double dbl-unpack(unsigned long long int x)
{
double* xp = (double*)(&x);
return *xp;
}
unsigned long long int dbl-pack(double x)
{
unsigned long long int* xp = (unsigned long long int*)(&x);
return *xp;
S}
void dbl-print(unsigned long long int d)
{
printf("X0.20f", dblunpack(d));
double asdouble(long long int x)
{
double* dblptr = (double*)&x;
return *dblptr;
}
long long int asllint(double x)
{
long long int* lliptr = (long long int*)&x;
return *lliptr;
: long long int add (long long int a, long long int b)
{
double ain = asdouble(a);
double bin = asdouble(b);
double result = ain + bin;
long long int result out = asllint(result);
return result-out;
i
long long int divide (long long int a, long long int b)
{z
121
double ain = asdouble(a);
double bin = asdouble(b);
double result = ain / bin;
long long int result-out = asllint(result);
return resultout;
sa
}
long long int absolute (long long int a)
double ain = asdouble(a);
double result = sqrt(ain*ain);//abs(ain);
long long int result-out = asllint(result);
return result-out;
}
long long int multiply (long long int a, long long int b)
asdouble(a);
double am
double bin = asdouble(b);
double result = ain * bin;
long long int result-out = asllint(result);
return result-out;
}
long long int square (long long int a) //power of 2
double ain = asdouble(a);
double result = ain*ain;//(double)pow((double)ain,2);
long long int result-out = asllint(result);
return result-out;
long long int squareroot (long long int a)
{
double ain = asdouble(a);
double result = (double)sqrt((double)ain);
long long int result-out = asllint(result);
return result-out;
long long int minus (long long int a)
double ain = asdouble(a);
double result = -ain;
long long int result-out = asllint(result);
return result-out;
long long int sub (long long int a, long long int b)
122
{
double ain = asdouble(a);
double bin = asdouble(b);
double result = ain - bin;
long long int result-out = asllint(result);
return result-out;
}
unsigned char lessthanequal(long long int a, long long int b)
double ain = asdouble(a);
double bin = asdouble(b);
return ain <= bin ? 1 : 0;
}
123
5
BSVDouble.h
Author: Sunila Saqib saqib@mit.edu
#ifndef BSVDOUBLEH
#define BSVDOUBLEH
typedef unsigned long long BSVDouble;
// Convert between double and BSVDouble
BSVDouble dbl-pack(double x);
double dbl-unpack(BSVDouble x);
// Print the value of the given double to stdout.
void dbl-print(BSVDouble d);
#endi f//BSVDOUBLE-H
124
6
GR specific Rotate.bsv
/*
Author: Sunila Saqib saqibomit.edu
*/
//interfaces for rotation units
typedef struct {
tnum x;
tnum r;
} RotateInput#(type tnum) deriving(Eq, Bits);
interface Rotate#(type tnum);
interface Put# (RotateInput# (tnum)) request;
interface Get#(RotationCS#(tnum)) csout;
interface Get#(tnum) rout;
endinterf ace
typedef struct {
tnum c;
tnum s;
} RotationCS#(type tnum) deriving(Bits, Eq);
125
7
GR specific LArotation.bsv
Author: Sunila Saqib saqib~mit.edu
//LA based rotation module
odule| [m] kLArotation(m#(Multiplier#(FixedPoint#(is, fs)))
mkmul,m#(LAtable#(fb, FixedPoint#(is, fs),hight)) mkLUT,
Rotate#(Complex#(FixedPoint#(is, fs))) ifc)
provisos(Add#(a__, fs, TMul#(2, fs)),
Add#(b__, 1, TAdd#(is, TMul#(2, fs))),
Mul#(2, fs, TAdd#(c__, fs)),
IsModule#(m, me));
FIFO#(Complex#(FixedPoint#(is,fs))) rinReg <-mkFIFO();
xinReg <-mkFIF();
FIFO#(Cied~oint#(is,fs)))
Reg#(FixedPoint#(is,fs)) offsetReg <-mkRegUo;
Reg#(FixedPoint#(is,fs)) iresReg <-mkRegUo;
Reg#(FixedPoint#(is,fs)) prodReg <-mkRegU();
LAtable#(fb, FixedPoint#(is,fs),hight) tbl <- mkLUTO;
FIFO#(RotationCS#(Complex#(FixedPoint#(is, fs))))
csout-fifo
<-
mkFIF01();
let csout-g = toGet(csout-fifo);
let csout-p = toPut(csout-fifo);
FIFO#(Complex#(FixedPoint#(is, fs))) rout-fifo <- mkFIFO();
let rout-g = toGet(rout-fifo);
let rout-p = toPut(rout-fifo);
Vector#(4,Multiplier#(tnum)) multiplier <replicateM(mkmul());
function Action multiply(a, b, index) =
multiplier[index],request put(tuple2(a, b));
Stmt interim = seq while(True) seq
action
let prodi <- multiplier[Olresponse.get(;
let prod2 <- multiplier[1].response.get();
let prod3 <- multiplier[2].response.get(;
FixedPoint#(is, fs) ires = prodi + prod2 + prod3;
iresReg<=ires;
tbl.tableIndex.put(ires);
endaction
action
let tblEntry <- tbl-tableEntry.get(;
offsetReg <= tblEntry.offset;
Bit#(TSub#(fs, fb)) important = truncate(pack(iresReg));
4
FixedPoint#(is,fs) diff = unpack(zeroExtend(important));
multiply(tblEntry.slope, diff,O);
q'
4 endaction
action
let prod <- multiplier[O].response.get(;
4h
prodReg <= prod;
F
126
endaction
action
let prod = prodReg;
FixedPoint#(is, fs) temp = offsetReg + prod;
multiply(temp, iresReg,O);
multiply(temp, rinReg.first().rel,1);
multiply(temp, xinReg.first().rel,2);
multiply(temp, xinReg.first().img,3);
endaction
par
action
let prodi <-multiplier[O].response.geto;
Complex#(FixedPoint#(is,fs)) rout =Complex { rel:prodi,
img: 0};
rout-p.put(rout);
rinReg.deqo;
xinReg.deqo;
endaction
action
let prod2 <- multiplier[1].response.geto;
let prod3 <- multiplier[2].response.geto;
let prod4 <- multiplier[3].response.geto;
Complex#(FixedPoint#(is,fs)) cout =Complex { rel:prod2,
img: 0};
Complex#(FixedPoint#(is,fs)) sout =Complex { rel:prod3,
img: prod4};
csout-p.put(RotationCS { c: cout, s: sout } );
endaction
endpar
endseq endseq;
mkAutoFSM(interim);
interface Put request;
method Action put(inputs);
//for complex
Complex#(FixedPoint#(is,fs)) rin = inputsr;
Complex#(FixedPoint#(is,fs)) xin = inputs.x;
if (rin == Complex{rel:0,img:0} && xin ==
Complex{rel:O,img:0}) begin
routp.put(O);
csoutp.put(RotationCS { c: 0, s: 1 });
end else begin
//(rin*rin)+(xin*con(xin));
multiply(rin.rel,rin.rel, 0);
multiply(xin.rel,xin.rel, 1);
multiply(xin.img,xin.img, 2);
rinReg.enq(rin);
xinReg.enq(xin);
end
endmethod
endinterface
127
interface Get csout= csout-g;
interface Get rout
rout-g;
endmodule
128
8
LAtable.bsv
Author: Sunila Saqib saqib@mit.edu
i */
4// Set of modules/functions to generate slope and offset
// look-up tables, for LA based operations.
// Get the appropriate linear approximation parameters
// This computes the slope
function FixedPoint#(is,fs) getSlope(Real index2LUT);
Real i = (-1)*(1/(2* pow(index2LUT,3/2) ));
return fromReal(i);
endfunction
7/ This computes the offset
function FixedPoint#(is,fs) getOffset(Real index2LUT);
Real i = 1 / sqrt(index2LUT);
return fromReal(i);
endfunction
//one entry of the LA LUT table - has a "offset" and a "slope"
typedef struct {
tnum offset;
tnum slope;
} LinearApproxStruct#(type tnum) deriving(Eq, Bits);
//Structure of the table
typedef Vector#(l, LinearApproxStruct#(FixedPoint#(is,fs)))
LinearApproxTable#(type 1, type is, type fs);
// Generate the linear approximation look-up-table
function LinearApproxTable#(size, is, fs) genLAlut(Integer
fBits);
LinearApproxTable#(size, is, fs) la = newVector;
Integer tableSize = value0f(size);
Integer iterationCount = tableSize;
Real step =0;
for (Integer s = 1; s <iterationCount; s = s+1) begin
step=fromInteger(s)/(2.0**fromInteger(fBits));
la[s].slope = getSlope(step);
getOffset(step);
la[s].offset
4u
end
return la;
endfunction
//interface
( interface LAtable#(numeric type fb, type tnum, numeric type
tableSize);
interface Put#(tnum) tableIndex;
129
interface Get#(LinearApproxStruct#(tnum)) tableEntry;
i( endinterface
//module
odule
LAtable(LAtable#(fb, FixedPoint#(is,fs), tableSize)ifc)
provisos(Add#(a__, fs, TMul#(2, fs)),Add#(b__, 1, TAdd#(is,
TMul#(2, fs))));
let actualTableSize = value0f(tableSize);
let fractionBits = value0f(fb);
let indexSize = valueof(TLog#(tableSize));
LinearApproxTable#(tableSize,is,fs) laLUT =
genLAlut(fractionBits);
FIFO#(LinearApproxStruct#(FixedPoint#(is, fs))) outfifo <mkFIF01();
interface Put tableIndex;
method Action put(index);
Bit#(TLog#(tableSize)) indexValue =
pack(index)[indexSize+(valueof(fs) fractionBits)-1:(valueof(fs) - fractionBits)];
outfifo-enq(laLUT[indexValuel);
endmethod
endinterface
interface Get tableEntry = toGet(outfifo);
endmodule
130
9
GR specific Logrotation.bsv
/*
Author: Sunila Saqib saqib@mit.edu
//Log domain rotation module
Logrotation(m#(Multiplier#(FixedPoint#(is, fs)))
odule [ml
mkmul,m#(LogTable#(fbl, FixedPoint#(is, fs),hightL))
mkLog,m#(ExpTable#(fbe, FixedPoint#(is, fs),hightE)) mkExp,
Rotate#(Complex#(FixedPoint#(is, fs))) ifc)
provisos(Add#(a__, fs, TMul#(2, fs)), Add#(b__, 1,TAdd#(is,
TMul#(2, fs))), Mul#(2, fs, TAdd#(c-_, fs)),IsModule#(m, m_-));
FIFO#(Complex#(FixedPoint#(is,fs))) rinReg <-mkFIFO();
FIFO#(Complex#(FixedPoint#(is,fs))) xinReg <-mkFIFO();
Reg#(FixedPoint#(is,fs)) prodReg <-mkRegUo;
Integer offset = 1/(2**valueof(fbe));
Vector#(1 ,LogTable#(fbl, FixedPoint#(is,fs), hightL))
logtbl <- replicateM(mkLogo);
Vector#(2,ExpTable#(fbe, FixedPoint#(is,fs), hightE))
exptbl <- replicateM(mkExpo);
FIFO#(RotationCS#(Complex#(FixedPoint#(is, fs))))
csout fifo <- mkFIF01();
let csout-g = toGet(csout-fifo);
let csout-p = toPut(csout-fifo);
FIFO#(Complex#(FixedPoint#(is, fs))) rout-fifo <- mkFIFO();
let rout-g = toGet(rout-fifo);
let rout-p = toPut(rout-fifo);
Vector#(3,Multiplier#(FixedPoint#(is, fs))) multiplier <replicateM(mkmul());
function Action multiply(a, b, index) =
multiplier[index].request.put(tuple2(a, b));
function Action result(Reg#(FixedPoint#(is,fs)) dst,int index);
action
let res <- multiplier[index].response.get();
dst <= res;
endaction
endfunction
Stmt interim = seq while(True) seq
action
let prodi <- multiplier[01.response.geto;
let prod2 <- multiplier[ll.response.get(;
let prod3 <- multiplier[2].response.get(;
FixedPoint#(is, fs) ires = prodi + prod2 + prod3;
logtbl[01.tableIndex.put(ires);
4"
1 endaction
+ action
let log-res <- logtbl[0].tableEntry.get(;
let r-new-sqr-log = log-res.offset
*
let r-new-log = r-new-sqr-log>>l;
131
let r-new-inv-log = 0- rnew-log;
exptbl[01.tableIndex.put(r-new-log);
exptbl[1].tableIndex.put(r-new-inv-log);
endaction
action
let r-new <- exptbl[0].tableEntry.geto;
let r-inv <- exptbl[1].tableEntry.geto;
Complex#(FixedPoint#(is,fs)) rout =Complex {
rel:r-new.offset, img: 0};
routp.put(rout);
prodReg <= r-inv.offset;
endaction
action
let r-inv = prodReg;
let xin = xinReg.firsto;
let rin = rinReg.firsto;
multiplierl.request.put(tuple2(xin.rel,r-inv));
multiplier2.request.put(tuple2(xin.img,r-inv));
multiplier3.request.put(tuple2(rin.rel,r-inv));
endaction
action
let prodi <- multiplierl.response.geto;
let prod2 <- multiplier2.response.geto;
let prod3 <- multiplier3.response.geto;
Complex#(FixedPoint#(is,fs)) cout =Complex { rel:prod3,
img: 0};
Complex#(FixedPoint#(is,fs)) sout =Complex { rel:prodi,
img: prod2};
csout_p.put(RotationCS { c: cout, s: sout } );
xinReg.deqo; rinReg.deqo;
endaction
endseq endseq;
mkAutoFSM(interim);
interface Put request;
method Action put(inputs);
Complex#(FixedPoint#(is,fs)) rin = inputs.r;
Complex#(FixedPoint#(is,fs)) xin = inputs.x;
if (rin == Complex{rel:0,img:0} && xin
Complex{rel:0,img:0}) begin
rout-p.put(0);
csout-p.put(RotationCS { c: 0, s: 1 });
end else begin
//(rin*rin)+(xin*con(xin));
multiply(rin.rel,rin.rel,0);
multiply(xin.rel,xin.rel,1);
multiply(xin.img,xin.img,2);
rinReg.enq(rin);
xinReg.enq(xin);
end
endmethod
132
endinterf ace
interface Get csout
interface Get rout
endmodule
=
csout-g;
rout-g;
133
10
Exptable.bsv
Author: Sunila Saqib saqib~mit.edu
// Set of modules/functions to generate Log-to-linear
// translation table
// Get the appropriate Exponential transformatin parameter
// This computes the exponenet
function FixedPoint#(is,fs) getExp(Real index2LUT);
Real i = 2**index2LUT;
return fromReal(i);
endfunction
//one entry of the Exp LUT table - has a "exp value"
typedef struct {
tnum expval;
} ExpStruct#(type tnum) deriving(Eq, Bits);
//structure of table
typedef Vector#(l, ExpStruct#(FixedPoint#(is,fs)))
ExpTableEntries#(type 1, type is, type fs);
// Generate the Exponential value look-up-table
i function ExpTableEntries#(size, is, fs) genExpLUT(Integer
fBits);
ExpTableEntries#(size, is, fs) la = newVector;
Integer tableSize = value0f(size);
Integer iterationCount = tableSize;
Real step =0;
for (Integer s = 0; s <iterationCount; s = s+1) begin
step=fromInteger(s)/(2.0**fromInteger(fBits));//not
sure if this works
la[s].expval = getExp(step);
end
return la;
endfunction
function ExpTableEntries#(size, is, fs) genExpLUTneg(Integer
fBits);
ExpTableEntries#(size, is, fs) la = newVector;
Integer tableSize = value0f(size);
Integer iterationCount = tableSize;
Real step =0;
for (Integer s = 0; s <iterationCount; s = s+1) begin
step=-(fromInteger(s)/(2.0**fromInteger(fBits)));
la[s].expval = getExp(step);
end
134
return la;
endfunction
//interface
interface ExpTable#(numeric type fb, type tnum, numeric type
tableSize);
interface Put#(tnum) tableIndex;
*
interface Get#(ExpStruct#(tnum)) tableEntry;
endinterface
//module
module 3ExpTable(ExpTable#(fb,
FixedPoint#(is,fs), tableSize) ifc)
provisos(Add#(a__, fs, TMul#(2, fs)),Add#(b-_, 1, TAdd#(is,
TMul#(2, fs))));
let actualTableSize = value0f(tableSize);
let fractionBits = value0f(fb);
let indexSize = valueof(TLog#(tableSize));
ExpTableEntries#(tableSize,is,fs) expLUT =
genExpLUT(fractionBits);
ExpTableEntries#(tableSize,is,fs) expLUTneg
genExpLUTneg(fractionBits);
FIFO#(ExpStruct#(FixedPoint#(is, fs))) outfifo <mkFIF01();
interface Put tableIndex;
method Action put(index);
if(index<O) begin
Bit#(TLog#(tableSize)) indexValue =
pack(-index)[indexSize+(valueof(fs)
fractionBits)-1:(valueof(fs) - fractionBits)];
outfifo.enq(expLUTneg[indexValuel);
end else begin
Bit#(TLog#(tableSize)) indexValue
pack(index)[indexSize+(valueof(fs) fractionBits)-1:(valueof(fs) - fractionBits)];
outfifo.enq(expLUT[indexValue]);
end
endmethod
endinterface
interface Get tableEntry = toGet(outfifo);
endmodule
135
11
Logtable.bsv
Author: Sunila Saqib saqib@mit.edu
// Set of modules/functions to generate Linear-to-log
// translation table
7/ Get the appropriate log transformation parameter
// This computes the logval
function FixedPoint#(is,fs) getLog(Real index2LUT);
Real i = log2(index2LUT);
return fromReal(i);
endfunction
//one entry of the Log LUT table - has a "log value"
typedef struct {
tnum logval;
} LogStruct#(type tnum) deriving(Eq, Bits);
//structure of the table
typedef Vector#(l, LogStruct#(FixedPoint#(is,fs)))
LogTableEntries#(type 1, type is, type fs);
/7 Generate the linear approximation look-up-table
function LogTableEntries#(size, is, fs) genLogLUT(Integer
fBits);
LogTableEntries#(size, is, fs) la = newVector;
Integer tableSize = value0f(size);
Integer iterationCount = tableSize;
Real step =0;
for (Integer s = 1; s <iterationCount; s = s+1) begin
step=fromInteger(s)/(2.0**fromInteger(fBits));
la[s].logval = getLog(step);
end
return la;
endfunction
//interface
interface LogTable#(numeric type fb, type tnum, numeric type
tableSize);
interface Put#(tnum) tableIndex;
interface Get#(LogStruct#(tnn)) tableEntry;
, endinterface
4
//module
odule ~LogTable(LogTable#(fb, FixedPoint#(is,fs), tableSize)ifc)
provisos(Add#(a__, fs, TMul#(2, fs)),Add#(b__, 1, TAdd#(is,
4J
TMul#(2, fs))));
17
136
let actualTableSize
=
value0f(tableSize);
let fractionBits = value0f(fb);
let indexSize = valueof(TLog#(tableSize));
LogTableEntries#(tableSize,is,fs) logLUT
genLogLUT(fractionBits);
FIFO#(LogStruct#(FixedPoint#(is, fs))) outfifo <mkFIF01();
interface Put tableIndex;
method Action put(index);
Bit#(TLog#(tableSize)) indexValue =
pack(index)[indexSize+(valueof(fs) fractionBits)-I:(valueof(fs) - fractionBits)];
outfifo.enq(logLUT[indexValue]);
endmethod
endinterface
interface Get tableEntry = toGet(outfifo);
endmodule
137
12
GR Linear specific UnitRow.bsv
Author: Sunila Saqib saqib@mit.edu
//a = number of external nodes
//b = number of internal nodes
//tnum = data type of computation
interface UnitRow#(numeric type a,numeric type b, type tnum);
interface Vector#(TAdd#(a,b), Put#(tnum)) xin;
interface Vector#(TAdd#(a,b), Put#(tnum)) rin;
interface Vector#(b, Put#(RotationCS#(tnum))) csin;
interface Vector#(b, Get#(tnum)) xout;
interface Vector#(TAdd#(a,b), Get#(tnum)) rout;
interface Vector#(TAdd#(a,b), Get#(RotationCS#(tnum))) csout;
endinterface
AUnitRow(m#(External#(tnum)) mkext,
mkint, UnitRow#(a,b,tnum) ifc)
provisos (IsModule#(m,m-_),Bits#(tnum, a_-));
Vector#(a,External#(tnum)) vecExternal <- replicateM(
mkext() );
Vector#(b,Internal#(tnum)) vecInternal <- replicateM(
module En]
m#(Internal#(tnum))
mkint() );
Vector#(TAdd#(a,b), Put#(tnum)) xins = newVector;
Vector#(TAdd#(a,b), Put#(tnum)) rins = newVector;
Vector#(b, Put#(RotationCS#(tnum))) csins = newVector;
Vector#(b, Get#(tnum)) xouts = newVector;
Vector#(TAdd#(a,b), Get#(tnum)) routs = newVector;
Vector#(TAdd#(a,b), Get#(RotationCS#(tnum))) csouts =
newVector;
for (Integer i = 0; i < valueof(a); i = i+1) begin
xins[i] = vecExternal[i].xin;
rins[i] = vecExternal[i].rin;
routs[i] = vecExternal[i].rout;
end
for (Integer i = valueof(a); i < valueof(TAdd#(a,b)); i
i+1) begin
xins[i] = vecInternal[i-valueof(a)].xin;
rins[i] = vecInternal[i-valueof(a)].rin;
routs[i] = vecInternal[i-valueof(a)].rout;
end
for (Integer i = 0; i < valueof(a); i = i+1)
csouts[i] = vecExternal[i].csout;
for (Integer i = valueof(a); i < valueof(TAdd#(a,b)); i
i+1)
csouts[i] = vecInternal[i-valueof(a)].csout;
for (Integer i = 0; i < valueof(b); i = i+1)
138
=
csins[i] = vecInternal[i].csin;
for (Integer i = 0; i < valueof(b); i
xouts[i] = vecInternal[i].xout;
interface xin = xins;
interface rin = rins;
interface csin = csins;
interface xout = xouts;
interface rout = routs;
interface csout = csouts;
endmodule
139
=
i+1)
13
GR Linear specific mkExternal.bsv
Author: Sunila Saqib saqib@mit.edu
interface External#(type tnum);
interface Put#(tnum) xin;
interface Put#(tnum) rin;
interface Get#(RotationCS#(tnum)) csout;
interface Get#(tnum) rout;
endinterface
module [ml
External(m#(Rotate#(tnum)) mkrotate,
External#(tnum) ifc)
provisos(IsModule#(m,m__),Literal#(tnum),Bits#(tnum, a__));
Rotate#(tnum) rotationUnit <- mkrotateo;
Reg#(Maybe#(tnum)) xinReg <-mkReg(tagged Invalid);
Reg#(Maybe#(tnum)) rinReg <-mkReg(tagged Invalid);
rule rotate if (rinReg matches tagged Valid .r &&& xinReg
matches tagged Valid x);
rotationUnit.request.put(RotateInput {x: x, r: r});
rinReg <= Invalid;
xinReg <= Invalid;
endrule
interface Put xin;
method Action put(x) if(xinReg matches tagged Invalid);
xinReg <= tagged Valid (x);
endmethod
endinterface
interface Put rin;
method Action put(r) if(rinReg matches tagged Invalid);
rinReg <= tagged Valid (r);
endmethod
endinterface
interface Get csout = rotationUnit.csout;
interface Get rout = rotationUnit.rout;
endmodule
140
14
GR Linear specific mklnternal.bsv
Author: Sunila Saqib saqibcmit.edu
interface Internal#(type tnum);
interface Put#(tnum) xin;
interface Put#(RotationCS#(tnum)) csin;
interface Put#(tnum) rin;
interface Get#(tnum) xout;
interface Get#(RotationCS#(tnum)) csout;
interface Get#(tnum) rout;
endinterface
Internal(m#(Multiplier#(tnum)) mkmul,
module Em]
Internal#(tnum) ifc)
provisos (IsModule#(m, m-_), Arith#(tnum), Bits#(tnum,
a__),Conjugate::Conjugate#(tnum), Print#(tnum));
let xins <- mkFIF01);
let rins <- mkFIF01();
FIFO#(RotationCS#(tnum)) csins <- mkFIF010;
match {.xout-g, .xout-p} <- mkGPFIF01();
match {.rout-g, .rout-p} <- mkGPFIF01();
match {.csout-g, .csoutp} <- mkGPFIF01();
Multiplier#(tnum) multiplier <- mkmul();
function Action multiply(a, b, index) =
multiplier.request.put(tuple2(a, b));
function Action result(Reg#(tnum) dst, int index);
action
let res <- multiplier.response.get();
dst <= res;
endaction
endfunction
let xi = xins.first(;
let cs = csins.first(;
let m-r = rins.first(;
Reg#(tnum) cr <- mkRegU(;
Reg#(tnum) sx <- mkRegU(;
Reg#(tnum) cx <- mkRegU(O;
Reg#(tnum) sr <- mkRegU(;
Reg#(Bit#(11)) clk <- mkReg(Q);
Reg#(Bool) timeit <- mkReg(False);
rule tick;
clk <= clk +f;
endrule
Stt work = seq while (True) seq
par
seq
multiply(csc, m-r,G);
141
multiply(con(cs.s), xi,1);
multiply(cs.c, xi,2);
action
multiply(cs.s, m-r,3);
endaction
endseq
seq
result(cr,0);
result(sx,1);
result(cx,2);
result(sr,3);
endseq
endpar
action
let nr = cr + sx;
let xo = cx - sr;
xoutp.put(xo);
rout-p.put(nr);
csout-p.put(cs);
xins.deqO;
csins.deqO;
rins.deqO;
endaction
endseq endseq;
mkAutoFSM(work);
interface Put xin;
method Action put(x);
xins.enq(x);
endmethod
endinterface
interface Put csin;
method Action put(cs);
csins.enq(cs);
endmethod
endinterface
interface Put rin;
method Action put(r);
rins.enq(r);
endmethod
endinterface
interface Get xout = xout-g;
interface Get rout = rout-g;
interface Get csout = csout-g;
endmodule
142
15
GR Linear specific QR.bsv
Author: Sunila Saqib saqibOmit.edu
interface QR#(numeric type width, type tnum);
interface Put#(Terminating#(tnum)) xin;
interface Get#(tnum) rout;
endinterf ace
143
16
GR Linear specific mkQR.bsv
/*
Author: Sunila Saqib saqib@mit.edu
// Multiplexers
function tnum multiplexer3(Bit#(2) sel, tnum a, tnum b, tnum c);
return (sel[1]==0)?((sel[0]==0)?a:b):(c);
endfunction
function tnum multiplexer2(Bit#(1) sel, tnum a, tnum b);
return (sel[0]==)?a:b;
endfunction
//
module
QR(m# (External#(tnum)) mkExt, m#(Internal#(tnum))
odule Em]
mkInt, QR#(nTyp, tnum) ifc)
provisos(IsModule#(m,m__),Literal#(tnum),Bits#(tnum, a__),
Log#(TDiv#(TMul#(nTyp,TAdd#(nTyp,1)),2),adsize),
Div#(nTyp,2,mTyp),Add#(bTyp,aTyp,mTyp), Add#(aTyp,0,1),
DefaultValue::DefaultValue#(tnum));
Vector#(nTyp,FIFO#(Terminating#(tnum))) xinFIFO
<-replicateM(mkFIF01);
Reg#(Bit#(adsize)) incount <- mkReg(0);
Reg#(Bit#(adsize)) outcountI <- mkReg(0);
Reg#(Bit#(adsize)) outcountJ <- mkReg(0);
/* temporary storages */
Memory#(mTyp,Bit#(adsize),tnum) mem <- mkMemoryo;
FIFO#(Vector#(mTyp, tnum)) currentR <- mkFIF01);
FIFO#(Vector#(mTyp, tnum)) routputFIFO <- mkFIF01();
Vector#(mTyp,Reg#(RotationCS#(tnum))) csMem <replicateM(mkReg(RotationCS{c:0,s:0}));
Vector#(bTyp,Reg#(tnum)) xMem <- replicateM(mkReg(0));
/* flags */
Reg#(Bit#(adsize)) counter <- mkReg(0);
Reg#(Bit#(adsize)) prevCounter <- mkReg(0);
Reg#(Bool) putNext <- mkReg(True);
Reg#(Bool) acceptinput <-mkReg(True);
Reg#(Bool) resetall <-mkReg(False);
Reg#(Bool) resetR <-mkReg(False);
Reg#(Bool) set <-mkReg(False);
/* sub modules*/
UnitRow#(aTyp,bTyp,tnum) ur <- mkUnitRow(mkExt,mkInt);
StateMachine#(nTyp,mTyp,adsize) tbl <- mkStateMachine(;
rule getOutput if (putNext==False);
I ,r[i]
/* taking care of r output*/
Vector#(mTyp,tnum) r = newVector;
for(Integer i =0;i<valueof(mTyp);i=i+1)
<- ur.rout[i.get;
144
if(!resetR) begin
mem.write.put(MemoryWrite {ad: prevCounter, val:r});
end else begin
routputFIFO.enq(r);
Vector#(mTyp,tnum) rReset = replicate(O);
mem.write.put(MemoryWrite {ad: prevCounter,val:rReset});
end
/* taking care of x output*/
Vector#(bTyp,tnum) x = newVector;
for(Integer i=O;i<valueof(bTyp);i=i+1) begin
x[i] <- ur.xout[i].get;
if(!resetall) begin
xMem[i] <= X[i];
end else begin
xMem[i] <=0;
end
end
/* taking care of cs output*/
Vector#(mTyp,tnum) c = newVector;
Vector#(mTyp,tnum) s = newVector;
for(Integer i=O;i<valueof(mTyp);i=i+1) begin
let cs <- ur.csout[i].get;
c[i]=cs.c; s[i]=cs.s;
if(!resetall) begin
cMem[i]
<=
cs.c;
sMem[i] <= cs.s;
end else begin
cMem[i]
<=
0;
sMem[i] <= 0;
end
end
/* setting flag */
putNext <=True;
endrule
rule putInput if(putNext==True);
let currentState <- tbl.getState(counter);
/* putting in r */
Vector#(mTyp,tnum) rvalue =replicate(O);
if(!set) begin
set<=!set;
end else begin
rvalue = currentRfirsto;
currentR.deqo;
end
for(Integer i =O;i<valueof(mTyp);i=i+1) begin
let val = rvalue[i];
ur.rin[i].put(val);
end
/* putting in x */
for (Integer i=O;i<valueof(mTyp);i=i+1) begin
145
tnum xins = fromInteger(O);
if( i==O)
xins = multiplexer2(tpl_1(currentState[i])[1],
xMem[i], xinFIFO[counter].first().data);
else if( i<valueof(bTyp))
xins = multiplexer3(tpll(currentState[i]), xMem[i],
xMem[i-1], xinFIFO[counterl.first().data);
else
xins = multiplexer2(tpl_1(currentState[i])[1],
xMem[i-1], xinFIFO[counter].first().data);
ur.xin[i].put(xins);
end
/* putting in cs
for (Integer i=O;i<valueof(bTyp);i=i+1) begin
RotationCS#(tnum) csvec = multiplexer2(tpl_2(
currentState[i+valueof(aTyp)]),csMem[il,csMem[i+1]);
ur.csin[i].put(csvec);
end
/* setting flags */
/* counter: ranges from 0 to n-1, it represents the rows in
r-memory. prevcounter: ranges from 0 to n-2, it represents
the previous row in r-memory (where the ouput is inserted) *
prevCounter<= counter;
let nextCounter = counter + 1;
if(nextCounter >= fromInteger(valueof(nTyp))) begin
nextCounter=0;
for(Integer d=0; d<valueof(nTyp); d=d+i) begin
xinFIFO[d].deqo;
end
let res = xinFIFO[0].firsto.islast;
if(res) begin
resetall <= True;
end
end
counter <= nextCounter;
putNext<=False;
mem.read.request.put(nextCounter);
if(xinFIFO[0].first().islast)
resetR <=True;
endrule
rule getCurrentR;
let r <- mem.read.response.geto;
currentR.enq(r);
endrule
interface Put xin;
method Action put(xinval) if (acceptinput);
xinFIFO[incount],enq(xinval);
if(fromInteger(valueof(nTyp))==(incount+l)) begin
146
incount<=;
if (xinval .islast==True)
acceptinput<=False;
end else incount<= incount+1;
endmethod
endinterf ace
interface Get rout;
method ActionValue#(tnum) get() if (resetR);
let val = routputFIFO.firsto[outcountJ];
let ci = outcountI;//O to n (rows in rmem)
let cj = outcountJ;//O to m (columns in rmem)
if (fromInteger(valueof (mTyp))==(outcountJ+1)) begin
outcountJ<=Q;
routputFIFO .deq( ;
if (fromInteger(valueof (nTyp)).==(outcountI+1)) begin
outcountI<=O;
acceptinput<=True;
resetRC=False;
resetall<=False;
end else
outcountI <= outcountI+1;
end else begin
outcountI<=outcountI;
outcountJC= outcountJ+1;
end
return val;
endmethod
endinterface
endmodule
147
17
GR Linear specific Memory.bsv
Author: Sunila Saqib saqib@mit.edu
/* default value for Complex numbers */
instance DefaultValue #(Complex#(FixedPoint#(is,fs)) );
defaultValue = Complex { rel : FixedPoint {i:O,f:0}, img
FixedPoint {i:O,f:O} };
endinstance
/* default value for Fixed-Point numbers */
instance DefaultValue #(FixedPoint#(is,fs));
defaultValue = FixedPoint {i:O,f:0};
endinstance
/* implementation of Memory starts here */
typedef Vector#(m, dnum) Values#(type m, type dnum);
typedef struct {
snum ad;
Values#(m,dnum) val;
} MemoryWrite#(numeric type m, type snum, type dnum)
deriving(Eq, Bits);
function BRAMRequest#(snum, dnum) makeRequest(Bool write, snum
addr, dnum data)
provisos(Arith#(snum),Bits#(dnum,a__));
return BRAMRequest{
write: write,
responseOnWrite:False,
address: addr,
datain: data
endfunction
interface Memory#(numeric type col, type snum, type dnum);
interface Server#(snum, Values#(col,dnum)) read;//send in
the count and get the whole row
interface Put#(MemoryWrite#(col,snum,dnum)) write;
endinterface
module jMemory(Memory#(col,snum,tnum))
provisos(Arith#(snum),Bits#(tnum,a__),Bits#(Tuple3#(snum,
snum, snum), b__),Bits#(Memory::MemoryWrite#(col, snum, tnum),
e__),Bits#(snum, n--), Literal#(tnum),PrimIndex#(snum, c--));
BRAMConfigure cfg = defaultValue;
cfg.allowWriteResponseBypass = False;
cfg.loadFormat = tagged Hex "bram2.txt";
Vector#(col,BRAM2Port#(snum, tnum)) duts <replicateM(mkBRAM2Server(cfg));
FIFO#(Values#(col,tnum)) val<-mkFIF01();
148
FIFO#(snum) ad<-mkFIF01();
FIFO#(tnum) value<-mkFIFO();
FIFO#(Tuple2#(snum,snum)) address<-mkFIFO();
FIFO#(snum) addressBeingRead <-mkFIFO();
FIFO#(MemoryWrite#(col,snum,tnum)) wrt<-mkFIF01();
rule writeVal;
let writeMem = wrt.firsto;
wrt.deqo;
let adres = writeMem.ad;
let valus = writeMem.val;
for(Integer i=O;i<valueof(col);i=i+1) begin
let valuel = valus[i];
duts[i].portBrequest.put(makeRequest(True,adres,
valuel));
end
endrule
rule readReq;
let readAd = ad.firsto;
ad.deqO;
Vector#(col,tnum) value = newVector;
for(Integer i=O;i<valueof(col);i=i+1) begin
let adres = readAd;
duts[i].portA.request.put(makeRequest(False,adres,
value[i]));
end
endrule
rule readRes;
Values#(col, tnum) values;
for(Integer i=O;i<valueof(col);i=i+1) begin
values[i]<-duts[i].portA response.get;
end
val.enq(values);
endrule
interface Put write = toPut(wrt);
interface Server read;
interface Put request = toPut(ad);
interface Get response = toGet(val);
endinterface
endmodule
149
18
GR Linear specific States.bsv
Author: Sunila Saqib saqibomit.edu
typedef Vector#(m,Tuple2#(Bit#(adsize),Bit#(adsize)))
State#(type m, type adsize);
typedef Vector#(n,State#(m,adsize)) StateTable#(type n, type
type adsize);
function StateTable#(nTyp,mTyp,adsize) getStates();
State#(mTyp,adsize) row= newVector;
StateTable#(nTyp,mTyp,adsize) machine= newVector;
Bit#(adsize) outx; Bit#(adsize) outcs; Integer c=O;
for(Integer i=1;i<=valueof(nTyp) ;i=i+1) begin
c=0;
for (Integer j=1;j<=valueof (mTyp) ;j=j+1) begin
if(j==i I1(i>valueof(mTyp) &,&
j>=valueof (mTyp))) begin
outx = maxBound;
end else begin
if(j-1==c) outx = 0;//right unit
else if(j-2==c) outx = 1;//same unit
c=c+1;
end
if (j==1) outcs = 0;
else if(i>valueof(nTyp) -j+1)
outcs = 1; //same unit
else
outcs = 0; // left unit
row[j-11=tuple3(outx,outcs,0);
end
m,
machine[i-1]=row;
end
return machine;
endfunction
// Generate the linear approximation look-up-table
function StateTable#(nTyp, mTyp, adsize) genTableo;
StateTable#(nTyp,mTyp, adsize) la = newVector;
la = getStateso;
return la;
4t endfunction
44
//interface
interface StateMachine#(numeric type n,numeric type
type adsize);
method ActionValue#(State#(m, adsize))
getState(Bit#(adsize) stateCounter);
endinterf ace
150
m,
numeric
//module
module
HStateMachine(StateMachine#(nTyp, mTyp, adsize) ifc);
StateTable#(nTyp,mTyp, adsize) sm = genTableo;
method ActionValue# (State#(Bit#(adsize))) getState(indx);
let tpl = sm[indx];
return tpl;
endmethod
endmodule
151
19
GR Linear specific FixedPointQR.bsv
S/*
Author: Sunila Saqib saqib(mit.edu
4
(*
synthesize *)
module JPipelinedMultiplierUGDSP(
PipelinedMultiplier#(Stages, Bit#(BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen))
imkPipelinedMultiplierUGo;
return m;
endmodule
m
<-
m
<-
(* synthesize *)
(* doc = "synthesis attribute mult-style of
mkPipelinedMultiplierUGLUT is pipejlut" *)
module dJPipelinedMultiplierUGLUT(
PipelinedMultiplier# (Stages, Bit# (BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen))
mkPipelinedMultiplierUG();
return m;
endmodule
(* synthesize *)
module MultiplierFP16DSP (Multiplier#(FP));
let m <- mkPipelinedMultiplierFixedPoint (mkDePipelinedMultip
lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGDSP)));
return m;
endmodule
(* synthesize *)
module EjultiplierFP16LUT (Multiplier#(FP));
let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip
lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGLUT)));
return m;
endmodule
(* synthesize *)
JLAtable-FP(LAtable#(BitDis, FP,LAlutSize) ifc);
module
let tbl <- mkLAtable(;
return tbl;
ic endmodule
( synthesize *)
Logtable-FP(LogTable#(BitDis, FP,LoglutSize) ifc);
o3 module
let tbl <- mkLogTable(;
,4
,I
return tbl;
endmodule
152
(* synthesize *)
module 9ExptableFP(ExpTable#(BitDisExp, FP,ExplutSize) ifc);
let tbl <- mkExpTableo;
return tbl;
endmodule
A(
synthesize *)
module ExternalFixedPoint (External# (CPFP));
//a- DSP based
let mkmul = mkMultiplierFP16DSP;
/7 b. LUT based
let mkmul = mkMultiplierFP16LUT;
//
/7 1. LA based
let mkrot = mkLArotation(mkmul,mkLAtableFP);
77 2. Log baned
77 let mkrot = mkLogrotation(mkmul,mkLogtableFP,mkExptableFP);
// 3. NR based
7/ let mkrot = mkComplexFixedPointRotation(mkmul);
let m <- mkExternal(mkrot);
return
m;
endmodule
(* synthesize *)
InternalFixedPoint(Internal#(CP-FP));
module
77 a. DSP based
let mkmul = mkMultiplierFP16DSP;
// b. LUT based
let mkmul = mkMultiplierFP16LUT;
7/
let m <- mkInternal(mkComplexMultiplier(mkmul));
return
m;
endmodule
module AQRFixedPoint(QR#(width, CPFP))
provisos(Add#(a__, 1, TDiv#(width, 2)));
let mkext = mkExternalFixedPoint;
let mkint = mkInternalFixedPoint;
m <- mkQR(mkext,
return m;
let
mkint);
endmodule
153
20
I
4
GR Linear specific Scemi.bsv
Author: Sunila Saqib saqib@mit.edu
typedef Dim ScemiQRWidth;
typedef CPFP ScemiQRData;
typedef QR#(ScemiQRWidth, ScemiQRData) ScemiQR;
(* synthesize *)
module HModule] mkScemiQR(ScemiQR);
let m
<-
mkQRFixedPoint;
return m;
endmodule
module RModule] mkScemiDut(Clock qrclk, ScemiQR ifc);
Reset myrst <- exposeCurrentReseto;
Reset qrrst <- mkAsyncReset(1, myrst, qrclk);
ScemiQR qr <- mkScemiQR(clocked-by qrclk, reset-by qrrst);
ScemiQR myqr <- mkSyncStreamQR(qr, qrclk, qrrst);
return myqr;
endmodule
module PceMiModule] mkSceMiLayer(Clock qrclk, Empty ifc);
SceMiClockConfiguration conf = defaultValue;
SceMiClockPortIfc clk-port <- mkSceMiClockPort(conf);
ScemiQR qr <- buildDut(mkScemiDut(qrclk), clkport);
Empty xin <- mkPutXactor(qr.xin, clk-port);
Empty rout <- mkGetXactor(qr.rout, clk-port);
Empty shutdown <- mkShutdownXactoro;
endmodule
( synthesize *)
module iTCPBridge (;
Clock myclk <- exposeCurrentClock;
Empty scemi <- buildSceMi(mkSceMiLayer(myclk), TCP);
endmodule
154
21
GR Systolic specific FullRow.bsv
Author: Sunila Saqib saqib@mit.edu
interface FullRow#(numeric type width, type tnum);
interface Vector#(width, Put#(Terminating#(tnum))) xin;
interface Vector#(TSub#(width, 1), Get#(Terminating#(tnum)))
xout;
interface Vector#(width, Get#(tnum)) r;
endinterf ace
155
22
GR Systolic specific mkFullRow.bsv
Author: Sunila Saqib saqib@mit.edu
interface FullRow#(numeric type width, type tnum);
interface Vector#(width, Put#(Terminating#(tnum))) xin;
interface Vector#(TSub#(width, 1),
Get#(Terminating#(tnum))) xout;
interface Vector#(width, Get#(tnum)) r;
endinterface
FullRow (m#(External#(tnum)) mkext,
module "]
m#(Internal#(tnum)) mkint, FullRow#(width,tnum) ifc)
provisos(IsModule#(m, m__), Bits#(tnum, a__), Add#(1, b_-,
width));
External#(tnum) ex <- mkext;
InternalRow#(TSub#(width, 1),tnum) intRow <-mkInternalRow(mkint);
mkConnection(ex.cs, intRow.cs);
interface Put xin = cons(ex.xin, intRow.xin);
interface Get xout = intRow.xout;
interface Get r = cons(ex.r, intRow.r);
endmodule
156
23
GR Systolic specific mkExternal.bsv
/*
Author: Sunila Saqib saqib@mit.edu
4
interface External#(type tnum);
interface Put#(Terminating#(tnum)) xin;
interface Get#(RotationCS#(tnum)) cs;
interface Get#(tnum) r;
endinterface
module Em] jExternal(m# (Rotate#(tnum)) mkrotate, tnum
diagonalLoad, External#(tnum) ifc)
provisos(IsModule# (m,m_) ,Literal# (tnum) ,Bits#(tnum, a_-),
Print#(tnum));
Rotate#(tnum) rotationUnit <- mkrotateo;
Reg#(Maybe#(tnum)) r-local-reg <- mkReg(tagged Valid
diagonalLoad);
FIFO#(tnum) r-local <- mkFIFOO;
rule external-node-get-output if (r-local-reg matches
tagged Invalid);
tnum r <- rotationUnit.rout.geto;
rjlocal-reg <= tagged Valid (r);
endrule
Reg#(Bool) dofinish <- mkReg(False);
rule finish (r-local-reg matches tagged Valid .r &&&
dofinish);
r-local.enq(r);
r-local-reg<= tagged Valid diagonalLoad;
dofinish <= False;
endrule
interface Put xin;
method Action put(x) if (r-local-reg matches tagged
Valid .r &&& !dofinish);
rotationUnit.request.put(RotateInput {x: x.data, r: r});
rjlocal-reg <= Invalid;
dofinish <= x.islast;
endmethod
endinterface
interface Get cs = rotationUnit.csout;
interface Get r = toGet(r-local);
endmodule
157
24
GR Systolic specific mklnternalRow.bsv
Author: Sunila Saqib saqibomit.edu
interface InternalRow#(numeric type width, type tnum);
interface Vector#(width, Put#(Terminating#(tnum))) xin;
interface Put#(RotationCS#(tnum)) cs;
interface Vector#(width, Get#(Terminating#(tnum))) xout;
interface Vector#(width, Get#(tnum)) r;
endinterface
InternalRow(m#(Internal#(tnum)) mkint,
module MI
InternalRow#(width, tnum) ifc)
provisos (IsModule#(m, m-_), Bits#(tnum, tnumsz));
Vector#(width,Internal#(tnum)) vecInternal
vecInternal <- replicateM( mkint() )
Vector#(width, Put#(Terminating#(tnum))) xins = newVector;
Vector#(width, Get#(Terminating#(tnum))) xouts = newVector;
Vector#(width, Get#(tnum)) routs = newVector;
for (Integer i = 0; i < valueof(width); i = i+1) begin
vecInternal[i].xin;
xins[i]
xouts[i] = vecInternal[i].xout;
routs[i] = vecInternal[i].r;
if (i+1 < valueof(width)) begin
mkConnection(vecInternal[i].csout,
vecInternal[i+1].csin);
end else begin
rule eatcsout (True);
let foo <- vecInternal[i].csout.geto;
endrule
end
end
Put#(RotationCS#(tnum)) cs_
if (valueof(width) == 0) begin
cs_ = interface Put;
method Action put(x) = noAction;
endinterface;
end else begin
cs = vecInternal[Ol.csin;
end
interface xin = xins;
interface cs = cs-;
43
interface xout = xouts;
interface Get r = routs;
i endmodule
158
25
GR Systolic specific mklnternal.bsv
/*
Author: Sunila Saqib saqibtmit.edu
interface Internal#(type tnum);
interface Put#(RotationCS#(tnum)) csin;
interface Put#(Terminating#(tnum)) xin;
interface Get#(Terminating#(tnum)) xout;
interface Get#(RotationCS#(tnum)) csout;
interface Get#(tnum) r;
endinterface
Internal(m#(Multiplier#(tnum)) mkmul,
module [ml
Internal#(tnum) ifc)
provisos (IsModule#(m, m__), Arith#(tnum), Bits#(tnum,
a-_),Conjugate::Conjugate#(tnum), Print#(tnum));
let xins <- mkFIFO();
FIFO#(RotationCS#(tnum)) csins <- mkFIF01();
match {.xout-g, .xout_p} <- mkGPFIF01();
match {.rout-g, rout_p} <- mkGPFIF01();
match {.csout-g, .csout_p} <- mkGPFIF01();
Reg#(tnum) m-r <- mkReg(O);
Multiplier#(tnum) multiplier <- mkmul();
let xi = xins.firsto.data;
let cs = csins.firsto;
function Action multiply(a, b, index) =
multiplier.request.put(tuple2(a, b));
function Action result(Reg#(tnum) dst, int index);
action
let res <- multiplier-response.geto;
dst <= res;
endaction
endfunction
Reg#(tnum) cr <- mkRegUO;
Reg#(tnum) sx <- mkRegUO;
Reg#(tnum) cx <- mkRegUO;
Reg#(tnum) sr <- mkRegUO;
Stit work = seq while (True) seq
par
csout-p.put(cs);
seq
multiply(cs.c, m-r,O);
multiply(con(cs.s), xi,1);
multiply(cs.c, xi,2);
action
multiply(cs.s, m-r,3);
csins.deq();
endaction
159
endseq
seq
result(cr,O);
result(sx,1);
result(cx,2);
result(sr,3);
endseq
endpar
action
let nr = cr + sx;
let xo = cx
sr;
xoutp.put(Terminating { data: xo, islast:
xins.first().islast});
if (xins.first().islast) begin
m-r <= 0;
rout-p.put(nr);
end else begin
m-r <= nr;
end
xins.deqO;
endaction
endseq endseq;
mkAutoFSM(work);
interface Put xin
method Action put(x);
xins.enq(x);
endmethod
endinterface
interface Put csin;
method Action put(cs);
csins.enq(cs);
endmethod
endinterface
interface Get xout = xout-g;
interface Get r = rout-g;
interface Get csout = csout-g;
endmodule
160
26
GR Systolic specific QR.bsv
Author: Sunila Saqib saqib(mit.edu
interface QR#(numeric type width, type tnum);
interface Vector#(width, Put#(Terminating#(tnum))) rowin;
interface Vector#(width, Get#(tnum)) rowout;
endinterf ace
161
27
GR Systolic specific mkQR.bsv
Author: Sunila Saqib saqib@mit.edu
4
//make QR with width Greater Than ONE
module ml
QRgtNE(m#(External#(tnum)) mkext,
m#(Internal#(tnum)) mkint, QR#(width, tnum) ifc)
provisos(Bits#(tnum, tnumnsz), Literal#(tnum),
IsModule#(m,m__), QRtopModule#(TSub#(width,1)),
Add#(1, b__, width));
FullRow#(width, tnum) rowi <- mkFullRow(mkext, mkint);
QR#(TSub#(width, 1), tnum) subQR <- mkQRtopModule(mkext,
mkint);
mkConnection(rowl.xout, subQR.rowin);
Vector#(width, Reg#(Bit#(TAdd#(1, TLog#(width)))))
rowsTaken <- replicateM(mkReg(O));
Vector#(TSub#(width, 1), FIFO#(tnum)) subrouts <replicateM(mkFIFO);
mkConnection(subQR.rowout, map(toPut, subrouts));
Vector#(width, Get#(tnum)) routs = newVector;
for (Integer i = 0; i < valueof(width); i = i+1) begin
routs[i] = interface Get
method ActionValue#(tnum) get();
if (rowsTaken[il
fromInteger(valueof(width)-1))
rowsTaken[i] <= 0;
else rowsTaken[i] <= rowsTaken[i] + 1;
if(rowsTaken[i] == 0) begin
let r <- row.r[i]get(;
return r;
end else if (i == 0) return 0;
else begin
let r <- toGet(subrouts[i-1]).get();
return r;
end
endmethod
endinterface;
end
rowi.xin;
interface Put rowin
interface Get rowout = routs;
endmodule
// make QR with width EQual to ONE
mkext,
*ReqONE(m#(External#(tnum))
a module [in]
I4 m#(Internal#(tnum)) mkint, QR#(1, tnum) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, tnum-sz));
q
FullRow#(1, tnum) rowi <- mkFullRow(mkext, mkint);
interface Put rowin = rowl.xin;
162
interface Get rowout = rowi.r;
endmodule
typeclass QRtopModule #(numeric type width);
module [ml
QRtopModule(m#(External#(tnum)) mkext,
m#(Internal#(tnum)) mkint, QR#(width, tnum) ifc)
provisos(IsModule#(m,m_) ,Bits#(tnum,tnum_sz) ,Literal#(tnum));
endtypeclass
instance QRtopModule#(1);
module Em] FQRtopModule(m#(External#(tnum)) mkext,
m#(Internal#(tnum)) mkint, QR#(1, tnum) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, tnumsz));
QR#(1, tnum) qrUnit <- mkQReqONE(mkext, mkint);
return qrUnit;
endmodule
endinstance
instance QRtopModule#(width)
provisos (QRtopModule#(TSub#(width,I)), Add#(1, widthml,
width));
QRtopModule(m# (External#(tnum)) mkext,
mkint, QR#(width, tnum) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, tnumsz),Literal#(tnum));
QR#(width,tnum) qrUnit <- mkQRgtONE(mkext, mkint);
return qrUnit;
endmodule
endinstance
module
[ml
m#(Internal#(tnum))
module
HcQRtop(m#(External#(tnum)) mkext,
mkint, QR#(width, tnum) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, tnumnsz),
Literal#(tnum), QRtopModule#(width));
QR#(width,tnum) qrUnit <- mkQRtopModule(mkext, mkint);
return qrUnit;
endmodule
M
m#(Internal#(tnum))
163
28
GR Systolic specific FixedPointQR.bsv
/*
Author: Sunila Saqib saqib@mit.edu
)
module JPipelinedMultiplierUGDSP(
PipelinedMultiplier#(Stages,Bit#(BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen))
(* synthesize
m
<-
mkPipelinedMultiplierUG();
return m;
endmodule
(* synthesize *)
( doc = "synthesis attribute mult-style of
mkPipelinedMultiplierUGLUT is pipejlut" *)
module k PipelinedMultiplierUG-LUT(
PipelinedMultiplier#(Stages, Bit#(BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen)) m <-
mkPipelinedMultiplierUG();
return m;
endmodule
(* synthesize *)
PipelinedMultiplierUG-32(PipelinedMultiplier#(Stages,
module
Bit#(BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen)) m <-
mkPipelinedMultiplierUG();
m;
return
endmodule
(* synthesize *)
module jMultiplierFP16DSP (Multiplier#(FP));
let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip
lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUG_32)));
return m;
- endmodule
(* synthesize *)
module jMultiplierFP16LUT (Multiplier#(FP));
let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip
lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUG-LUT)));
m;
return
, endmodule
,(synthesize *)
module HLAtableFP(LAtable#(BitDis, FP,LAlutSize) ifc);
let tbl <- mkLAtable(;
164
return tbl;
endmodule
ai
(* synthesize *)
LogtableFP(LogTable#(BitDis, FP,LoglutSize) ifc);
module
let tbl <- mkLogTableo;
return tbl;
endmodule
(s synthesize *)
module iExptableFP(ExpTable#(BitDisExp, FP,ExplutSize) ifc);
let tbl <- mkExpTableo;
return tbl;
endmodule
(* synthesize *)
module JExternalFixedPoint(External#(CPFP));
// a. DSP based
let mkmul = mkMultiplierFP16DSP;
// b. LUT based
//
let mkmul = mkMultiplierFP16LUT;
// 1. LA based
//
let mkrot = mkLArotation(mkmul,mkLAtableFP);
// 2. Log baned
//
let mkrot = mkLogrotation(mkmul,mkLogtableFP,mkExptableFP);
/ 3. NR based
let mkrot = mkComplexFixedPointRotation(mkmul);
let m <- mkExternal(mkrot, 0);
return m;
endmodule
(* synthesize *)
InternalFixedPoint(Internal#(CPFP));
module
// a. DSP based
let mkmul = mkMultiplierFP16DSP;
7/ b. LUT based
let mkmul = mkMultiplierFPl6LUT;
//
let m <- mkInternal(mkComplexMultiplier3(mkmul));
return m;
endmodule
module NQRFixedPoint(QR#(width, CPFP))
provisos(QRtopModule#(width));
let mkext = mkExternalFixedPoint;
let mkint = mkInternalFixedPoint;
let m <- mkQRtop(mkext, mkint);
'45
return m;
c, endmodule
165
29
GR Systolic specific mkStreamQR.bsv
Author: Sunila Saqib saqibomit.edu
// Turn a normal QR implementation into a streaming QR
// implementation.
module Em] HStreamQR(m#(QR#(width, tnum)) mkqr,
StreamQR#(width, tnum) ifc)
provisos(IsModule#(m, a__), Bits#(tnum, tnum__);
Reg#(Bit#(TLog#(width))) xcin <- mkReg(O);
Reg#(Bit#(TLog#(width))) rcout <- mkReg(O);
QR#(width, tnum) qr <- mkqro;
interface Put xin;
method Action put(Terminating#(tnum) x);
qr.rowin[xcin].put(x);
if (xcin == fromInteger(valueof(width)-1)) begin
xcin <= 0;
end else begin
xcin <= xcin + 1;
end
endmethod
endinterface
interface Get rout;
method ActionValue#(tnum) geto;
tnum r <- qr.rowout[rcout].geto;
if (rcout == fromInteger(valueof(width)-1)) begin
rcout <= 0;
end else begin
rcout <= rcout + 1;
end
return r;
endmethod
endinterface
endmodule
module HStreamQRTestFixedPoint (Empty);
let mkqr = mkQRFixedPoint;
StreamQR#(Dim, CPFP) qr <- mkStreamQR(mkqr);
mkStreamQR3Test(qr, le-3);
endmodule
166
30
GR Systolic specific Scemi.bsv
Author: Sunila Saqib saqibomit.edu
typedef Dim ScemiQRWidth;
typedef CPFP ScemiQRData;
typedef StreamQR#(ScemiQRWidth, ScemiQRData) ScemiQR;
(* synthesize *)
module NModule] mkScemiQR(ScemiQR);
let m <- mkStreamQR(mkQRFixedPoint);
return m;
endmodule
module [kodule] mkScemiDut(Clock qrclk, ScemiQR ifc);
Reset myrst <- exposeCurrentReseto;
Reset qrrst <- mkAsyncReset(1, myrst, qrclk);
ScemiQR qr <- mkScemiQR(clocked-by qrclk, reset-by qrrst);
ScemiQR myqr <- mkSyncStreamQR(qr, qrclk, qrrst);
return myqr;
endmodule
module HSceMiModule] mkSceMiLayer(Clock qrclk, Empty ifc);
SceMiClockConfiguration conf = defaultValue;
SceMiClockPortIfc clk-port <- mkSceMiClockPort(conf);
ScemiQR qr <- buildDut(mkScemiDut(qrclk), clk-port);
Empty xin <- mkPutXactor(qr.xin, clk-port);
Empty rout <- mkGetXactor(qr.rout, clkport);
Empty shutdown <- mkShutdownXactoro;
endmodule
(* synthesize *)
TCPBridge 0;
module
Clock myclk <- exposeCurrentClock;
Empty scemi <- buildSceMi(mkSceMiLayer(myclk), TCP);
endmodule
167
31
MGS specific BatchAcc.bsv
Author: Sunila Saqib saqibomit.edu
/* Batch Accumulator : accumulates all the values in a vector
tnum: type of data accumulated
cnum
count of data units accumulated
mTyp : count of vectors to be accumulated */
interface BatchAcc#(type tnum,numeric type tsize,numeric type
interface Put#(Vector#(tsize,tnum)) invec;
interface Get#(tnum) outval;
endinterface
mTyp);
module qBatchAcc (BatchAcc#(tnum,tsize, mTyp) ifc )
provisos(Bits#(tnum, a__),Arith#(tnum));
Bit#(TAdd#(TLog#(mTyp),1)) mType = fromInteger(valueof (mTyp));
FIFO#(tnum) outputVal <-mkFIF01();
Reg#(tnum) sumReg <- mkReg(O);
Reg# (Bit#(TAdd#(TLog#(mTyp),1))) counter <- mkReg(O);
interface Put invec;
method Action put(invector);
let prev-sum = sumReg;
let sum = fold (\+ , cons(prev-sum,invector));
if (counter+1==mType) begin
counter<=O;
sumReg<=0;
outputVal.enq(sum);
end else begin
counter<=counter+-1;
sumReg<= sum;
end
endmethod
endinterf ace
interface Get outval = toGet(outputVal);
endmodule
168
32
MGS specific BatchCS.bsv
Author: Sunila Saqib saqib~mit.edu
/* Batch Complex Square : computes individual rel*rel + img*img
for a vector of complex values.
tnum : type of complex data accumulated
tsize : count of data units in a vector */
interface BatchCS#(type tnum, numeric type tsize);
method Action put(Vector#(tsize,Complex#(tnum)) invector);
interface Get#(Vector#(tsizetnum)) outvec;
endinterface
module [ml RBatchCS (m#(Multiplier# (tnum)) mkmul,
BatchCS#(tnum,tsize) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, a__),Arith#(tnum));
Vector#(tsize, Multiplier#(tnum)) relP <replicateM(mkmul());//product of real part
Vector#(tsize, Multiplier#(tnum)) imgP <replicateM(mkmul());//product of imaginary part
FIFO#(Vector#(tsize,Complex#(tnum))) inputVec <- mkFIFO();
FIFO#(Vector#(tsize,tnum)) outputVec <- mkFIFO();
rule cloudblock;
Vector#(tsize, tnum) outvectop = newVector;
Vector#(tsize, tnum) outvectopl = newVector;
Vector#(tsize, tnum) outvectop2 = newVector;
for(Integer i =O;i<valueof(tsize);i=i+1) begin
outvectopi[i] <- relP[il.response.get();
outvectop2[i] <- imgP[ilresponse.geto;
end
for(Integer j =0;j<valueof(tsize) ;j=j+1) begin
outvectop[j] = outvectopl[j] + outvectop2[j];
end
outputVec.enq(outvectop);
endrule
method Action put(invector);
for(Integer i =O;i<valueof(tsize);i=i+) begin
relP[il.request.put(tuple2(invector[il.rel,
invector[ii.rel));
imgP[i].request.put(tuple2(invector[i].img,
invector[i].img));
end
endmethod
interface Get outvec = toGet(outputVec);
endmodule
169
33
MGS specific BatchProduct.bsv
Author: Sunila Saqib saqibomit.edu
/* Batch Product : computes vector product of 2 vectors.
tnum : type of data values
tsize : count of data units in a vector */
interface BatchProduct#(type tnum, numeric type tsize);
interface Put#(Tuple2#(Vector#(
tsize,tnum),Vector#(tsize,tnum))) invec;
interface Get#(Vector#(tsize,tnum)) outvec;
endinterface
module [] NBatchProduct (m#(Multiplier# (tnum)) mkmul,
BatchProduct#(tnum,tsize) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, a__),Arith#(tnum));
Vector#(tsize, Multiplier#(tnum)) mul <-replicateM(mkmul());
FIFO#(Vector#(tsize,tnum)) outputVec <- mkFIF01();
rule cloudblock;
Vector#(tsize, tnum) outvectop = newVector;
for(Integer i =O;i<valueof(tsize);i=i+1)
outvectop[i] <- mul[il.response.geto;
outputVec.enq(outvectop);
endrule
interface Put invec;
method Action put (invector);
for(Integer i =O;i<valueof(tsize);i=i+1)
mul[i].request.put(tuple2(tpl_1(invector)[i],
tpl_2(invector)[i]));
endmethod
endinterface
interface Get outvec = toGet(outputVec);
endmodule
170
34
MGS specific BatchSub.bsv
/'r
Author: Sunila Saqib saqib@mit.edu
/* Batch Subtraction : computes element wise difference, veci
type of data values
tnum
*/
tsize : count of data units in a vector
interface BatchSub#(type tnum, numeric type tsize);
interface Put#(Tuple2#(Vector#(
tsize,tnum),Vector#(tsize,tnum))) invec;
interface Get#(Vector#(tsize,tnum)) outvec;
endinterface
module jBatchSub (BatchSub#(tnum,tsize) ifc)
provisos(Bits#(tnum, a__),Arith#(tnum));
FIFO#(Vector#(tsize,tnum)) outputVec <- mkFIF01();
function tnum subtract(tnum x, tnum y) = x-y;
interface Put invec;
method Action put(invector);
let res = zipWith(subtract, tpl_1(invector),
tpl_2(invector));
outputVec.enq(res);
endmethod
endinterface
interface Get outvec = toGet(outputVec);
endmodule
171
-
vec2
35
MGS specific mkDot.bsv
/*
Author: Sunila Saqib saqib(mit.edu
//take in two distinct vectors and domputes thier dot product
interface Dot#(type itnum, numeric type n, numeric type m);
//n - length of the vector
//m items will be handled in one cycle, m copies of hardware
//n total entries in the vector
interface Put#(Vector#(n,itnum)) inveci;//complex
interface Put#(Vector#(n,itnum)) invec2;//complex
interface Get#(itnum) outval;//real
endinterface
Dot(m#(Multiplier#(tnum)) mkmul, Dot#(tnum,nTyp,mTyp) ifc)
module [m]
provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum,
a__),Conjugate::Conjugate#(tnum),Print#(tnum),
DefaultValue::DefaultValue#(tnum),Add#(mTyp, b__, nTyp));
//storage
FIFO#(Vector#(nTyp,tnum)) infifol <- mkFIFO();
FIFO#(Vector#(nTyp,tnum)) infifo2 <- mkFIFO();
Reg#(Vector#(nTyp,tnum)) inregi <- mkRegUO;
Reg#(Vector#(nTyp,tnum)) inreg2 <- mkRegUO;
FIFO#(tnum) outfifo <- mkFIFO();
//units
BatchProduct#(tnum,mTyp) bprod <- mkBatchProduct(mkmul);
BatchAcc# (tnum,mTyp,TDiv#(nTyp,mTyp)) bacc <- mkBatchAcco;
//control logic
Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) counter <mkReg(O);
Reg#(Bool) processing <- mkReg(False);
//logic cloud
rule putting-input;//implicit guard : infifos not empty
Vector#(nTyp,tnum) invectori = newVector;
Vector#(nTyp,tnum) invector2 = newVector;
Vector#(nTyp,tnum) invec2 = newVector;
if(!processing)begin
invectorl = infifoi.firsto;
invec2 = infifo2.firsto;
invector2 = map(con, invec2);
end else begin
invectorl = inregi;
invector2 = inreg2;
end
Vector #(mTyp,tnum) my-invec_1 = take(invectori);
413
Vector #(mTyp, tnum) my-invec-2 = take(invector2);
bprod.invec.put(tuple2(my-invecl,my-invec_2));
let my-newinvector_1 =
shiftOutFromO(defaultValue,invectorl,valueof(mTyp));
172
let my-newinvector_2 =
shiftOutFromQ(defaultValue,invector2,valueof(mTyp));
inregi <= my-newinvector_1;
inreg2 <= my-newinvector_2;
if(counter + 1 ==
fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin
counter<=Q;
processing <=False;
infifol.deq();
infifo2.deqo;
end else begin
counter<=counter+1;
processing <=True;
end
endrule
batch product
rule connect-dot-and-acc;//inplicit gaurd
// unit generates output
let products <- bprod.outvec.geto;
bacc invec.put(products);
endrule
rule getting-output;//implicit gaurd : batch acc unit
// generates output
let summation <- bacc.outval.geto;
outfifo.enq(summation);
endrule
interface Put inveci = toPut(infifoi);
interface Put invec2 = toPut(infifo2);
interface Get outval = toGet(outfifo);
endmodule
173
36
MGS specific mkNorm.bsv
Author: Sunila Saqib saqib@mit.edu
*/
//computes norm of a vector (dot product with its conjugate)
interface Norm#(type tnum, numeric type n, numeric type m);
//n - length of the vector
//m items will be handled in one cycle, m copies of hardware
//n total entries in the vector
interface Put#(Vector#(n,tnum)) invec;//complex
interface Get#(tnum) outval;//real
endinterface
Norm(m#(Multiplier#(tnum)) mkmul,
module Em]
Norm#(Complex#(tnum),nTyp,mTyp) ifc)
provisos (IsModule#(m, m__),Bits#(Complex#(tnum),a-_),
Conjugate::Conjugate#(Complex#(tnum)),DefaultValue::
DefaultValue#(Complex#(tnum)), Add#(mTyp, b__,nTyp));
//storage
FIFO#(Vector#(nTyp, Complex#(tnum))) infifo <- mkFIFf(O;
Reg#(Vector#(nTyp, Complex#(tnum))) inreg <- mkRegUO;
FIFO#(Complex#(tnum)) outfifo <- mkFIFOO;
//units
BatchCS#(tnum,mTyp) bprod <- mkBatchCS(mkmul);
BatchAcc#(tnum,mTyp,TDiv#(nTyp,mTyp)) bacc <- mkBatchAcco;
//control logic
Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1))))counter<-mkReg(C);
Reg#(Bool) processing <- mkReg(False);
//logic cloud
rule putting-input;//implicit gaurd : infifo.notEmpty()
Vector#(nTyp,Complex#(tnum)) invector = newVector;
if(!processing)begin
invector = infifo.firsto;
end else begin
invector = inreg;
end
Vector#(mTyp, Complex#(tnum)) my-invector = take(invector);
bprod.put(my-invector);
//bprod.invec.put(my-invector);
let my-newinvector =
shiftOutFromO(defaultValue,invector,valueof(mTyp));
inreg <= my-newinvector;
if(counter + 1 ==
fromInteger(valueof(nTyp)/valueof(mTyp))) begin
counter<=O;
processing <=False;
infifo.deqo;
end else begin
174
counter<=counter+1;
processing <=True;
end
endrule
rule connect-product-and-acc;
//implicit gaurd : bProd completes generatig output
let products <- bprod.outvec.get();
bacc.invec.put(products);
endrule
rule getting-output;
//implicit gaurd : bAdd completes generating output
let summation <- bacc.outval.geto;
outfifo.enq(cmplx(summation,O));
endrule
toPut(infifo);
interface Put invec
interface Get outval = toGet(outfifo);
endmodule
175
37
MGS specific mkOffsetCorrection.bsv
Author: Sunila Saqib saqib@mit.edu
'r/
// computes A[i] - R * Q[i]
interface OffsetCorrection#(type itnum, numeric type n, numeric
type m);
interface Put#(Tuple3#(Vector#(n,itnum), itnum,
Vector#(n,itnum))) invec; // Q, R, A.
interface Get#(Vector#(n,itnum)) outvec; // A[i] - R * Q[i]
endinterface
OffsetCorrection(m#(Multiplier#(tnum)) mkmul,
module Em]
OffsetCorrection#(tnum,nTyp,mTyp) ifc)
provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum,
a__),Conjugate::Conjugate#(tnum),
DefaultValue::DefaultValue#(tnum),Add#(mTyp, b__, nTyp));
//storage
FIFO#(Vector#(nTyp, tnum)) infifoQ <- mkFIFO();
FIFO#(tnum) infifoR <- mkFIFO();
FIFO#(Vector#(nTyp, tnum)) infifoA <- mkFIFO();
FIFO#(Vector#(nTyp, tnum)) outfifo <- mkFIFO();
Reg#(Vector#(nTyp, tnum)) inregQ <- mkRegUO;
Reg#(Vector#(nTyp, tnum)) inregA <- mkRegUO;
Reg#(Vector#(nTyp, tnum)) outreg <- mkRegUO;
//units
BatchProduct#(tnum,mTyp) bprod <- mkBatchProduct(mkmul);
BatchSub#(tnum,mTyp) bsub <- mkBatchSubo;
//control logic
Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) inputCounterl
<- mkReg(O);
Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) inputCounter2
<- mkReg(O);
Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) outputCounter
<-
mkReg(O);
Reg#(Bool) processingi <- mkReg(False);
Reg#(Bool) processing2 <- mkReg(False);
//logic cloud
rule putting-input;//implicit guard : infifos.notEmpty
Vector#(nTyp,tnum) invecQtop = newVector;
tnum invalRtop = infifoR.firsto;
if(!processingi) begin
invecQtop = infifoQ.firsto;
end else begin
invecQtop = inregQ;
end
Vector #(mTyp, tnum) myQ = take (invecQtop);
bprod.invec.put(tuple2(my-Q,replicate(invalRtop)));
176
let my-newinvector = shiftOutFromO(
defaultValue,invecQtopvalueof(mTyp));
// shift m elements out from 0 index.
inregQ <= my-newinvector;
if(inputCounterl + 1
==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin
inputCounter1<=0;
processing1<=False;
infifoQ.deqo;
infifoR.deqo;
end else begin
inputCounterl<=inputCounterl+1;
processingi<=True;
end
endrule
rule connecting-prod-and-sub;
// implicit gaurd : batch product unit generates output
Vector#(nTyp, tnum) invecAtop = newVector;
if(!processing2) begin
invecAtop = infifoA.firsto;
end else begin
invecAtop = inregA;
end
let products <- bprod.outvec.geto;
Vector #(mTyp, tnum) myA = take (invecAtop);
bsub.invec.put(tuple2(myA,products));
let my-newinvector = shiftOutFromO(defaultValue,
invecAtopvalueof(mTyp));
inregA <= mymnewinvector;
if(inputCounter2 + 1
==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin
inputCounter2<=0;
processing2<=False;
infifoA.deqo;
end else begin
inputCounter2<=inputCounter2+1;
processing2<=True;
end
endrule
rule getting-output;
//impicit gaurd: batch sub unit generates output
Vector#(mTyp,tnum) res <- bsub.outvec get();
Vector#(nTyp,tnum) outputs = outreg;
Vector#(TAdd#(nTyp,mTyp),tnum) newoutput =
append(outreg,res);
Vector#(nTyp,tnum) newoutreg = drop(newoutput);
if(outputCounter + 1
==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin
outputCounter<=0;
177
outfifo.enq(newoutreg);
end else begin
outputCounter<=outputCounter+1;
outreg<= newoutreg;
end
endrule
interface Put invec;
method Action put (in);
infifoQ.enq(tpl_1(in));
infifoR.enq(tpl_2(in));
infifoA.enq(tpl-3(in));
endmethod
endinterface
interface Get outvec = toGet(outfifo);
endmodule
178
38
MGS specific mkVecProd.bsv
Author: Sunila Saqib saqib@mit.edu
*/
interface VecProd#(type tnum, numeric type n, numeric type
//n total entries in the vector, m entries will be handled
//simultaneously
interface Put#(Tuple2#(Vector#(n,tnum), tnum)) invec;
interface Get#(Vector#(n,tnum)) outvec;
endinterface
m);
module Em] NvecProd(m# (Multiplier# (tnum)) mkmul,
VecProd#(tnum,nTyp,mTyp) ifc)
provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum,
a__),Conjugate::Conjugate#(tnum), Print4(tnum),
DefaultValue::DefaultValue#(tnum),Add#(mTyp, b_,nTyp));
//storage
FIFO#(Vector#(nTyp, tnum)) infifoQ <- mkFIFO();//inputs
FIFO#(tnum) infifoR <- mkFIFO(;
Reg#(Vector#(nTyp, tnum)) inregQ <- mkRegU();//temporary
Reg#(Vector#(nTyp, tnum)) outreg <- mkRegUO;
FIFO#(Vector#(nTyp, tnum)) outfifo <- mkFIFO();//output
//units
BatchProduct#(tnum,mTyp) bprod <- mkBatchProduct(mkmul);
//control logic
Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) inputCounter
<- mkReg(O);
Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) outputCounter
<- mkReg(O);
Reg#(Bool) processing <- mkReg(False);
//logic cloud
rule putting-input;//will be implicitly guarded by the infifo.
Vector#(nTyp,tnum) invecQtop = newVector;
tnum invalRtop = infifoR.first();
if(!processing) invecQtop = infifoQ.firsto;
else invecQtop = inregQ;
Vector #(mTyp, tnum) myQ = take (invecQtop);
Vector #(mTyp, tnum) myR = replicate(invalRtop);
bprod.invec.put(tuple2(myQ,myR));
let my-newinvector =
shiftOutFromO(defaultValue,invecQtop,valueof(mTyp));
// shift m elements out from 0 index.
inregQ <= my-newinvector;
if(inputCounter + 1
==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin
inputCounter<=0;
processing<=False;
infifoQ.deqo;
179
infifoR.deqo;
end else begin
inputCounter<=inputCounter+1;
processing<=True;
end
endrule
rule getting-output;//impicit guard: sub produces output
Vector#(mTyp,tnum) res <- bprod.outvec.geto;
Vector#(nTyp,tnum) outputs = outreg;
Vector#(TAdd#(nTyp,mTyp),tnum) newoutput =
append(outreg,res);
Vector#(nTyp,tnum) newoutreg = drop(newoutput);
if(outputCounter +- 1
==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin
//n should be multiple of m
outputCounter<=0;
outfifo.enq(newoutreg);
end else begin
outputCounter<=outputCounter+1;
outreg<= newoutreg;
end
endrule
interface Put invec;
method Action put (in);
infifoQ.enq(tpl_1(in));
infifoR.enq(tpl_2(in));
endmethod
endinterface
interface Get outvec = toGet(outfifo);
endmodule
180
39
MGS specific Sqrtlnv.bsv
Author: Sunila Saqib saqibomit.edu
//complex square root module
interface SqrtInv#(type tnum);
interface Put#(tnum) x;//x
interface Get#(tnum) xs;//x square root
interface Get#(tnum) xsi;//x square root inv
endinterface
module Em]
SqrtInvCPFP(m#(SqrtInv#(FixedPoint#(is,fs)) )
mksi, SqrtInv#(Complex#(FixedPoint#(is,fs))) ifc)
provisos(IsModule#(m, m--));
SqrtInv#(FixedPoint#(is,fs)) sq <- mksi;
interface Put x ;
method Action put(x);
sq.x.put(x.rel);
endmethod
endinterface
interface Get xs
method ActionValue#(Complex#(FixedPoint#(is,fs))) get();
let xs <-
sqxs.get(;
return cmplx(xs,O);
endmethod
endinterface
interface Get xsi
method ActionValue#(Complex#(FixedPoint#(is,fs))) geto;
let xsi <- sq.xsi.get(;
return cmplx(xsi,O);
endmethod
endinterface
endmodule
181
40
MGS specific LASqrtlnv.bsv
/*
Author: Sunila Saqib saqib~mit.edu
// fixed point square root, sqrt(A) = offset + ((AA.i)*(slope))
[in]
SqrtInvFP(m#(Multiplier#(FixedPoint#(is, fs)))
mkmul,m#(LAtable#(fb, FixedPoint#(is, fs),tableSize)) mkLUT,
SqrtInv#(FixedPoint#(is,fs)) ifc)
provisos(IsModule#(m, m__));
FIFO#(FixedPoint#(is,fs)) xfifo <- mkFIFO();
FIFO#(FixedPoint# (is,fs)) offsetReg <-mkFIFO() ;//stage 2
FIFO#(FixedPoint#(is,fs)) xsififo <- mkFIFO0;
FIFO#(FixedPoint#(is,fs)) xsfifo <- mkFIFO();
Reg#(FixedPoint#(is,fs)) xsiReg <- mkReg(U);
LAtable#(fb, FixedPoint#(is,fs),tableSize ) tbl <- mkLUT(;
Multiplier#(FixedPoint#(is, fs)) multiplieri <- mkmul;
Stit interim = seq while(True) seq
action
let tblEntry <- tbl.tableEntry.geto;
offsetReg.enq(tblEntry.offset);
let xin = xfifo.firsto;
Bit#(TSub#(fs, fb)) important = truncate(pack(xin));
FixedPoint#(is,fs) diff = unpack(zeroExtend(important));
multiplieri.request.put(tuple2(tblEntry.slope, diff));
endaction
action
let prod <- multiplier.response.geto;
let offset = offsetReg-firsto;
offsetReg.deqo;
FixedPoint#(is, fs) temp = offset + prod;
xsififo.enq(temp);
xsiReg <= temp;
endaction
action
let xin = xfifo.firsto;
xfifo.deqo;
multiplieri.request.put(tuple2(xin, xsiReg));
endaction
action
let prod <- multiplier.response.geto;
xsfifo.enq(prod);
endaction
endseq endseq;
mkAutoFSM(interim);
interface Put x;
method Action put(x);
tbl.tableIndex.put(x);
module
182
xfifo.enq(x);
endmethod
endinterf ace
interface Get xsi = toGet(xsififo);
interface Get xs = toGet(xsfifo);
endmodule
183
41
MGS specific LogSqrtlnv.bsv
/*
Author: Sunila Saqib saqib@mit.edu
// fixed point square root
module Em]
SqrtInvLogFP(m#(Multiplier#(FixedPoint#(is,
fs))) mkmul,m#(LogTable#(fbl, FixedPoint#(is, fs),hightL))
mkLog,m#(ExpTable#(fbe, FixedPoint#(is, fs),hightE)) mkExp,
SqrtInv#(FixedPoint#(is,fs)) ifc)
provisos(Add#(a__, fs, TMul#(2, fs)),Add#(b__,I,TAdd#(is,
TMul#(2,fs))),Mul#(2,fs,TAdd#(c__,fs)),IsModule#(m,m__));
Vector#(1, LogTable#(fbl, FixedPoint#(is,fs), hightL))
logtbl <- replicateM(mkLog();
Vector#(2, ExpTable#(fbe, FixedPoint#(is,fs), hightE))
exptbl <- replicateM(mkExp();
FIFO#(FixedPoint#(is, fs)) xs-fifo <- mkFIFO();
let xs-g = toGet(xs-fifo);
let xs-p = toPut(xs-fifo);
FIFO#(FixedPoint#(is, fs)) xsi-fifo <- mkFIFO();
let xsi-g = toGet(xsi-fifo);
let xsi-p = toPut(xsi-fifo);
Stmt interim = seq while(True) seq
action
let log-res <- logtbl[O].tableEntry.geto;
let r-new-sqr-log = log-res.offset ;// x +
fromInteger(valueof(fbl));
let r-new-log = r-new-sqr-log>>1;
let r-new-inv-log = 0- r-new-log;
exptbl[O].tableIndex.put(r-new-log);
exptbl[1].tableIndex.put(rnew_inv-log);
endaction
par
action
let r-new <- exptbl[0].tableEntry.geto;
xs-p.put(r-new.offset);
endaction
action
let r-inv <- exptbl[ll].tableEntry.geto;
xsi-p.put(r-inv.offset);
endaction
endpar
i endseq endseq;
Y mkAutoFSM(interim);
,1
interface Put x;
method Action put(x);
logtbl[O].tableIndex.put(x);
6
endmethod
endinterface
184
interface Get xs = xsg;
interface Get xsi = xsi-g;
endmodule
185
42
MGS specific NRSqrtlnv.bsv
/*
Author: Sunila Saqib saqib@mit.edu
// fixed point square root
SqrtInvNRFP(m#(Multiplier#(FixedPoint#(is, fs)))
module [ml
mkmul, SqrtInv#(FixedPoint#(is,fs)) ifc)
provisos(Add#(a__, fs, TMul#(2, fs)),
Add#(b__, 1, TAdd#(is, TMul#(2, fs))),
Mul#(2, fs, TAdd#(c__, fs)),
IsModule#(m, m__)
SquareRoot#(FixedPoint#(is, fs)) sqrt <mkFixedPointSquareRoot(1);
Divider#(FixedPoint#(is, fs)) dr <- mkFixedPointDivider(2);
match {.xs-g, xs-p} <- mkGPFIFO();
match {.xsi-g, xsi_p} <- mkGPFIFO();
Reg#(Bit#(11)) clk <- mkReg(O);
Reg#(Bool) timeit <- mkReg(False);
rule tick;
clk <= clk +1;
endrule
rule dodivide (True);
match {.nr, .*} <- sqrt-response.geto;
xs-p.put(nr);
dr.request put(tuple2(fromInteger(i), nr));
endrule
rule dofinalize (True);
match {.xsi, .*} <- dr.response.geto;
xsi-p.put(xsi);
endrule
interface Put x;
method Action put(x);
sqrt.request.put(x);
endmethod
endinterface
interface Get xs = xs-g;
interface Get xsi = xsi-g;
endmodule
186
43
MGS specific mkDP.bsv
Author: Sunila Saqib saqibomit.edu
//boundary unit.
interface DP#(type itnum, numeric type n, numeric type m);
/ n - length of the vector
// m items will be handled in one cycle, m copies of hardware
// n total entries in the vector
interface Put#(Vector#(n,itnum)) invec;
interface Get#(itnum) rout;//r
interface Get#(Vector#(n,itnum)) qout;//r
endinterface
DP ((m# (Multiplier# (tnum)))
module Em]
mkmul,(m#(Norm#(tnum,nTyp,mTyp))) mknorm, m#(SqrtInv#(tnum))
mksqrt,DP#(tnum,nTyp,mTyp) ifc)
provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum,
a__),Conjugate::Conjugate#(tnum),Print#(tnum),
DefaultValue::DefaultValue#(tnum),Add#(mTyp, d__, nTyp));
Integer depth = valueof(Depth);
FIFOF#(Vector#(nTyp, tnum)) infifo <- mkSizedFIFOF(depth);
FIFOF#(tnum) rfifo <- mkSizedFIFOF(depth);
FIFOF#(Vector#(nTyp,tnum)) qfifo <- mkSizedFIFOF(depth);
//units
Norm#(tnum,nTyp,mTyp) norm <- mknorm;
SqrtInv#(tnum) si <- mksqrt;
VecProd#(tnum,nTyp,mTyp) oc <- mkVecProd(mkmul);
//control logic
Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) counter<-mkReg(O);
//logic cloud
rule connect-dot-and-si;
//inplicit gaurd : bprod completes generatig output
let x <- normoutval.geto;
si.x.put(x);
endrule
rule connect-si-and-out; // inplicit gaurds
//si generate soutput, xfifo and infifo not empty
let xsi <- si.xsi.geto;
let inputvector = infifo.firsto;
infifo.deqO;
oc.invec.put(tuple2(inputvector, xsi));
endrule
rule getting-output-r-val;
//implicit gaurd : si generates xs output
let r <- si.xs.geto;
rfifo.enq(r);
endrule
rule getting-output-q-vec;
187
//implicit gaurd : batch sub unit completes generating output
let q <- oc.outvec.geto;
qfifo.enq(q);
endrule
interface Put invec;
method Action put(in);
infifo.enq(in);
norm invec.put(in);
endmethod
endinterface
interface Get rout = toGet(rfifo);
interface Get qout = toGet(qfifo);
endmodule
188
44
MGS specific mkTP.bsv
Author: Sunila Saqib saqib@mit.edu
interface TP#(type tnum, numeric type n, numeric type m);
//n total entries in the vector, m entries will be handled
simultaneously
interface Put#(Vector#(n,tnum)) inveci;
interface Put#(Vector#(n,tnuim)) invec2;//q vector,
interface Get#(tnum) rout;
interface Get#(Vector#(n,tnum)) qout;
interface Get#(Vector#(n,tnum)) invec2out;//passing the q,
endinterface
module [(g jTP(m# (Multiplier# (tnum)) mkmul,
TP#(tnum,nTyp,mTyp) ifc)
provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum,
a__),Conjugate::Conjugate#(tnum), Print#(tnum),DefaultValue::
DefaultValue#(tnum),Add#(mTyp, b__, nTyp));
//storage
Integer depth = valueof(Depth);
FIFOF#(Vector#(nTyp,tnum)) infifoQ <- mkSizedFIFOF(depth);
FIFOF#(Vector#(nTyp,tnum)) infifoA <- mkSizedFIFOF(depth);
FIFOF#(Vector#(nTyp,tnum)) infifoQinterim <- mkSizedFIFOF(depth);
FIFOF#(Vector#(nTyp,tnum)) infifoAinterim <- mkSizedFIFOF(depth);
FIFOF#(tnum) outfifoR <- mkSizedFIFOF(depth);
FIFOF#(Vector#(nTyp,tnum)) outfifoQ <- mkSizedFIFOF(depth);
FIFOF#(Vector#(nTyp,tnum)) invec2outfifo <- mkBypassFIFOF();
//units
Dot#(tnum,nTyp,mTyp) dot <- mkDot(mkmul);
OffsetCorrection#(tnum,nTyp,mTyp) vecp <- mkOffsetCorrection(mkmul);
//logic cloud
rule connecting-dot-and-vecp;
let dotproduct <- dot.outval.geto;
outfifoR.enq(dotproduct);
let inQtop = infifoQ.firsto;
let inAtop = infifoA.firsto;
infifoQ.deqo;
infifoA.deq();
vecp.invec.put(tuple3(inQtop,dotproduct,inAtop));
endrule
rule getting-output;
//inpicit gaurd: sub produces output
let outvec <- vecp.outvec.geto;
44
'11
7
outfifoQ.enq(outvec);
endrule
interface Put inveci;
method Action put(inl);
189
infifoA.enq(ini);
dot.invecl.put(inl);
endmethod
endinterface
interface Put invec2;
method Action put(in2);
infifoQ.enq(in2);
dot.invec2.put(in2);
invec2outfifo.enq(in2);
endmethod
endinterface
interface Get rout = toGet(outfifoR);
interface Get qout = toGet(outfifoQ);
interface Get invec2out = toGet(invec2outfifo);
endmodule
190
45
MGS specific UnitRow.bsv
/*
Author: Sunila Saqib saqibimit.edu
4
interface UnitRow#(numeric type tsize, numeric type tdim, type
tnum);
//tsize = size of row, tnum = number type
//tdim = dimention of the nxn matrix to be decomposed
interface Vector#(tsize, Put#(Vector#(tdim,tnum))) xin;
interface Vector#(tsize, Get#(tnum)) rout;
interface Vector#(TSub#(tsize,1), Get#(Vector#(tdim,tnum)))
qout;
endinterface
A1nitRow(m#(DP#(tnum,nTyp,mTyp)) mkdp,
module "
m#(TP#(tnum,nTyp,mTyp)) mktp, UnitRow#(n,nTyp, tnum) ifc)
provisos (IsModule#(m,m__),Bits#(tnum, a__));
//units
DP#(tnum,nTyp,mTyp) vecDP <- mkdp; //External unit
Vector#(TSub#(n,1),TP#(tnum,nTyp,mTyp)) vecTP <replicateM( mktp() ) ; //vector of internal units
//interface variables
Vector#(n, Put#(Vector#(nTyp,tnum))) xins = newVector;
Vector#(n, Get#(tnum)) routs = newVector;
Vector#(TSub#(n,1), Get#(Vector#(nTyp,tnum))) qouts =
newVector;//n for linear case
//connecting output from internal unit to next internal
//unit in the row - tunnel for q vector
for (Integer i = 0; i < valueof(TSub#(n,i)); i = i+1)
if (i+1 < valueof(TSub#(n,1)))
mkConnection(vecTP[i].invec2out,
vecTP[i-1].invec2);
else
rule eatit (True);
let x <- vecTP[il.invec2out.geto;
endrule
//connecting output from boundary unit with input of
//internal unit - q vector
if(1<valueof(n))
mkConnection(vecDP.qout,vecTP[0],invec2);
else
rule eatitagain;
let x <- vecDP.qout.geto;
endrule
//connecting input x vectors
xins[O] = vecDP.invec;
for (Integer i = 0; i < valueof(TSub#(n,1)); i = i+1)
xins[i+1] = vecTP[ii.inveci;
191
//connecting output q vectors
for (Integer i = 0; i < valueof(TSub#(n,1)); i = i+1)
qouts[i] = vecTP[i]. qout;
//connecting output r vectors
routs[0] = vecDP.rout;
for (Integer i = 0; i < valueof(TSub#(n,1)); i = i+1)
routs[i+1] = vecTP[il.rout;
//connecting interface variables to the module interfaces
qouts;
interface out
outs;
=I
out
interface
ins;
xi
interface
I
endmodule
192
46
MGS specific QR.bsv
Author: Sunila Saqib saqib~mit.edu
interface QR#(numeric type width,numeric type n, type tnum);
interface Vector#(width, Put#(Vector#(n,tnum))) rowin;
interface Vector#(width, Get#(tnum)) rowout;
endinterf ace
193
47
MGS specific mkStreamQR.bsv
Author: Sunila Saqib saqib@mit.edu
// Turn a normal QR implementation into a streaming QR
// implementation.
module Em] HStreamQR(m#(QR#(width,nn, tnum)) mkqr,
StreamQR#(width, tnum) ifc)
provisos(IsModule#(m, a__), Bits#(tnum, tnumn_),
Print#(tnum), Add#(width,nn,TAdd#(width,nn)),
DefaultValue::DefaultValue#(tnum), Print::Print#(tnum));
Reg#(Bit#(TLog#(width))) xcin <- mkReg(0);
Reg#(Bit#(TLog#(width))) rcout <- mkReg(0);
Vector#(nn, Vector#(nn, Reg#(tnum))) xins <replicateM(replicateM(mkRegU());
Reg#(Bit#(TAdd#(TLog#(nn),1))) i <- mkReg(0);
Reg#(Bit#(TAdd#(TLog#(nn),1))) j <- mkReg(0);
let size = value0f(nn);
QR#(width,nn, tnum) qr <- mkqro;
interface Put xin;
method Action put(Terminating#(tnum) x);
if(i+1==fromInteger(size)) begin
if(j+1==fromInteger(size)) j<=0;
else j<=j+1;
i<=0;
end else begin
j<=i;
end
if(j+1==fromInteger(size)) begin
Vector#(nn, tnum) column = newVector;
for (nt o =0; o < fromInteger(size)-I; o =
column[o] = xins[i][o];
x.data;
column[size-1]
qr.rowin[i].put(column);
end else
xins[i][(j] <= x. data;
endmethod
endinterface
interface Get rout;
method ActionValue#(tnum) get();
tnuim r <- qr.rowout[rcout].get(;
if (rcout == fromInteger(valueof(width)-1))
rcout <= 0;
else rcout <= rcout + 1;
return r;
endmethod
endinterface
194
o
+
1)
,
endmodule
StreamQRTestFixedPoint (Empty);
module
let mkqr = mkQRCPFP;
StreamQR#(Dim,CPFP) qr <- mkStreamQR(mkqr);
mkStreamQR3Test(qr, 0.001);
endmodule
195
48
MGS specific Scemi.bsv
Author: Sunila Saqib saqibomit.edu
"/
typedef Dim ScemiQRWidth;
typedef CPFP ScemiQRData;
typedef StremQR#(ScemiQRWidth, ScemiQRData) ScemiQR;
(* synthesize *)
module [odule] mkScemiQR(ScemiQR);
let m <- mkStreamQR(mkQRCPFP);
return m;
endmodule
module HModule] mkScemiDut(Clock qrclk, ScemiQR ifc);
Reset myrst <- exposeCurrentReseto;
Reset qrrst <- mkAsyncReset(1, myrst, qrclk);
ScemiQR qr <- mkScemiQR(clocked-by qrclk, reset-by qrrst);
ScemiQR myqr <- mkSyncStreamQR(qr, qrclk, qrrst);
return myqr;
endmodule
module PSceMiModule] mkSceMiLayer(Clock qrclk, Empty ifc);
SceMiClockConfiguration conf = defaultValue;
SceMiClockPortIfc clk-port <- mkSceMiClockPort(conf);
ScemiQR qr <- buildDut(mkScemiDut(qrclk), clk-port);
Empty xin <- mkPutXactor(qr.xin, clk-port);
Empty rout <- mkGetXactor(qr.rout, clk-port);
Empty shutdown <- mkShutdownXactoro;
endmodule
(- synthesize *)
module kTCPBridge 0;
Clock myclk <- exposeCurrentClock;
Empty scemi <- buildSceMi(mkSceMiLayer(myclk), TCP);
endmodule
196
49
MGS Systolic specific FixedPointQR.bsv
Author: Sunila Saqib saqib@mit.edu
(* synthesize *)
module JPipelinedMultiplierUGDSP
(PipelinedMultiplier#(Stages, Bit#(BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen)) m <mkPipelinedMultiplierUGO;
return m;
endmodule
(* synthesize *)
(* doc = "synthesis attribute mult-style of
mkPipelinedMultiplierUG-LUT is pipejlut" *)
module gPipelinedMultiplierUGLUT
(PipelinedMultiplier#(Stages, Bit#(BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen)) m <mkPipelinedMultiplierUGo;
return
m;
endmodule
(* synthesize *)
module BMultiplierFP-DSP (Multiplier#(FP));
let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip
lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGDSP)));
return m;
endmodule
(* synthesize *)
JultiplierFPLUT (Multiplier#(FP));
module
let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip
lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGLUT)));
return m;
endmodule
(* synthesize *)
LAtableFP(LAtable#(BitDis, FP,LAlutSize) ifc);
module
let tbl <- mkLAtableo;
return tbl;
endmodule
(* synthesize *)
module RLogtableFP(LogTable#(BitDis, FP,LoglutSize) ifc);
let tbl <- mkLogTableo;
return tbl;
endmodule
197
(* synthesize *)
module 9ExptableFP(ExpTable#(BitDisExp, FP,ExplutSize) ifc);
let tbl <- mkExpTableo;
return tbl;
endmodule
(* synthesize *)
SqrtInvCPFP (SqrtInv#(CPFP));
module
7/ a. DSP based
let mkmul = mkMultiplierFPDSP;
// b. LUT based
let mkmul = mkMultiplierFP16LUT;
//
// 1. LA
/7
let mksi = mkSqrtInvFP(mkmul,mkLAtableFP);
// 2. Log
let mksi =mkSqrtInvLogFP(mkmul,
7/
mkLogtableFP,mkExptableFP);
//
// 3. NR
let mksi =
mkSqrtInvNRFP(mkmul);
let m <- mkSqrtInvCPFP(mksi);
return m;
endmodule
(* synthesize *)
module JNormCPFP (Norm#(CPFP,Dim,PUarrSize));
let
m
<-
return
mkNorm(mkMultiplierFPDSP);
m;
endmodule
(* synthesize *)
DPCPFP(DP#(CPFP,Dim,PUarrSize));
module
let mkmul = mkComplexMultiplier(mkMultiplierFPDSP);
let m <- mkDP(mkmul, mkNormCPFP, mkSqrtInvCPFP);
return
m;
endmodule
(* synthesize *)
module STPCPFP(TP#(CPFP,Dim,PUarrSize));
let mkmul = mkComplexMultiplier(mkMultiplierFPDSP);
let m <- mkTP(mkmul);
return m;
endmodule
QRCPFP(QR#(width,Dim,CPFP))
module
provisos(QRtopModule#(width));
let mkdp = mkDP-CPFP;
let mktp = mkTPCPFP;
m <- mkQRtop(mkdp,
let
return m;
mktp);
198
endmodule
199
50
MGS Systolic specific mkQR.bsv
Author: Sunila Saqib saqib~mit.edu
//make QR with width Greater Than ONE
module [n]
QRgtONE(m#(DP#(tnum,nTyp,mTyp)) mkdp,
m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(width,nTyp, tnum) ifc)
provisos(Bits#(tnum, tnumnsz), Literal#(tnum),
IsModule#(m,m__), QRtopModule#(TSub#(width,1)),
Add#(1, b__,width));
UnitRow#(width,nTyp, tnum) rowi <- mkUnitRow(mkdp,
mktp);//make one full row
QR#(TSub#(width, 1),nTyp, tnum) subQR <mkQRtopModule(mkdp, mktp);
mkConnection(rowl.qout, subQR.rowin);
Vector#(width, Reg#(Bit#(TAdd#(1, TLog#(width)))))
rowsTaken <- replicateM(mkReg(O));
Vector#(TSub#(width, 1), FIFO#(tnum)) subrouts <replicateM(mkFIFO);
mkConnection(subQR.rowout, map(toPut, subrouts));
Vector#(width, Get#(tnum)) routs = newVector;
for (Integer i = 0; i < valueof(width); i = i+1) begin
routs[i] = interface Get
method ActionValue#(tnum) geto;
if (rowsTaken[i] == fromInteger(valueof(width)-1))
rowsTaken[i] <= 0;
else begin
rowsTaken[i] <= rowsTaken[i] + 1;
if(rowsTaken[i] == 0) begin
let r <- rowi.rout[i].geto;
return r;
end else if (i == 0) return 0;
else begin
let r <- toGet(subrouts[i-11).geto;
return r;
end
end
endmethod
endinterface;
end
interface Put rowin = rowi.xin;
interface Get rowout = routs;
endmodule
i
mnake QR with width EQual to ONE
QReqNE(m#(DP#(tnum,nTyp,mTyp)) mkdp,
module [n]
m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(1,nTyp, tnum) ifc)
200
provisos(IsModule#(m,m__), Bits#(tnum, tnumisz));
UnitRow#(1,nTyp, tnum) rowi <- mkUnitRow(mkdp, mktp);
interface Put rowin
rowi.xin;
interface Get rowout = rowi.rout;
endmodule
typeclass QRtopModule #(numeric type width);
module [ml
QRtopModule(m#(DP#(tnum,nTyp,mTyp)) mkdp,
m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(width,nTyp, tnum) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, tnum-sz), Literal#(tnum));
endtypeclass
instance QRtopModule#();
module Em]
QRtopModule(m#(DP#(tnum,nTyp,mTyp)) mkdp,
m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(1,nTyp, tnum) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, tnum-sz));
QR#(1,nTyp, tnum) qrUnit <- mkQReqONE(mkdp, mktp);
return qrUnit;
endmodule
endinstance
instance QRtopModule#(width)
provisos (QRtopModule#(TSub#(width,1)), Add#(1, widthmi,
width));
module [ml jQRtopModule(m#(DP#(tnum,nTyp,mTyp)) mkdp,
m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(width,nTyp, tnum) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, tnumnsz), Literal#(tnum));
QR#(width,nTyp,tnum) qrUnit <- mkQRgtONE(mkdp, mktp);
return qrUnit;
endmodule
endinstance
QRtop(m#(DP#(tnum,n,mTyp)) mkdp,
mktp, QR#(width,n, tnum) ifc)
provisos(IsModule#(m,m__), Bits#(tnum, tnumsz),
Literal#(tnum), QRtopModule#(width));
QR#(width,n,tnum) qrUnit <- mkQRtopModule(mkdp, mktp);
return qrUnit;
endmodule
module [ml
m#(TP#(tnum,n,mTyp))
201
51
MGS Linear specific FixedPointQR.bsv
Author: Sunila Saqib saqibOmit.edu
(* synthesize *)
module }PipelinedMultiplierUGDSP
(PipelinedMultiplier#(Stages, Bit#(BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen)) m <-
mkPipelinedMultiplierUG();
return m;
endmodule
(* synthesize *)
(1 doc = "synthesis attribute mult-style of
mkPipelinedMultiplierUG-LUT is pipe-lut" *)
module RPipelinedMultiplierUGLUT
(PipelinedMultiplier#(Stages, Bit#(BitLen)));
PipelinedMultiplier#(Stages, Bit#(BitLen))
m
<-
mkPipelinedMultiplierUG();
return m;
endmodule
(* synthesize *)
module
MultiplierFPDSP (Multiplier#(FP));
let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip
lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGDSP)));
return m;
endmodule
(* synthesize *)
module
MultiplierFPLUT (Multiplier#(FP));
let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip
lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGLUT)));
return m;
endmodule
(* synthesize *)
module jLAtableFP(LAtable#(BitDis, FP,LAlutSize) ifc);
let tbl <- mkLAtable(;
return tbl;
endmodule
(* synthesize *)
module JLogtableFP(LogTable#(BitDis, FP,LoglutSize) ifc);
let tbl <- mkLogTable(;
return tbl;
r endmodule
202
(* synthesize *)
module J~xptableFP(ExpTable#(BitDisExp, FP,ExplutSize) ifc);
let tbl <- mkExpTableo;
return tbl;
endmodule
('
synthesize *)
module HSqrtInvCPFP (SqrtInv#(CPFP));
7/ a. DSP based
let mkmul = mkMultiplierFP_DSP;
// b. LUT based
S/
let mkmul = mkMultiplierFP16LUT;
// 1. LA
//
let mksi = mkSqrtInvFP(mkmul,mkLAtableFP);
// 2. Log
let mksi = mkSqrtInvLogFP(mkmul,
/7
mkLogtableFP,mkExptableFP);
// 3. NR
let mksi =
mkSqrtInvNRFP(mkmul);
let m <- mkSqrtInvCPFP(mksi);
return m;
endmodule
(* synthesize *)
module 9JNormCPFP (Norm#(CP-FP,Dim,PUarrSize));
let m <- mkNorm(mkMultiplierFP DSP);
return m;
endmodule
(* synthesize *)
module SDPCPFP(DP#(CPFP,Dim,PUarrSize));
let mkmul = mkComplexMultiplier(mkMultiplierFPDSP);
let m <- mkDP(mkmul, mkNormCPFP, mkSqrtInvCPFP);
return
m;
endmodule
(* synthesize *)
TPCPFP(TP#(CPFP,Dim,PUarrSize));
module
let mkmul = mkComplexMultiplier(mkMultiplierFPDSP);
let m <- mkTP(mkmul);
return m;
endmodule
QRCP-FP(QR#(Dim,Dim,CPFP));
module
let mkdp = mkDPCPFP;
let mktp = mkTP-CP-FP;
let
m
return
<-
mkQRtop(mkdp, mktp);
m;
endmodule
203
52
MGS Linear specific mkQR.bsv
S/*
Author: Sunila Saqib saqibmit.edu
module [ml iQRtop(m#(DP#(tnum,nTyp,mTyp)) mkdp,
m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(n,nTyp,tnum) ifc)
provisos(IsModule#(m,m__),Literal#(tnum),Bits#(tnum, a__),
Add#(TLog#(n),1,adsize),DefaultValue::DefaultValue#(tnum));
Vector#(n,FIFO#(Vector#(nTyp,tnum))) xinFIFO <-replicateM(mkFIF01);
Vector#(n,FIFO#(tnum)) routFIFO <- replicateM(mkFIFO);
Vector#(n,Put#(Vector#(nTyp,tnum))) xinPut;
Vector#(n,Get#(tnum)) routGet;
Vector#(n,FIFO#(Vector#(nTyptnum))) qFIFO
<-replicateM(mkFIF01);
Reg#(Bit#(adsize)) counter <- mkReg(O);
; Reg#(Bit#(adsize)) counterR <- mkReg(O);
Reg#(Bit#(adsize)) counterQ <- mkReg(O);
UnitRow#(n,nTyp,tnum) ur <- mkUnitRow(mkdp,mktp);
Reg#(Bool) resetall <-mkReg(False);
rule get-qout;//implicit gaurd ur generates output
for(Integer i=O;i<valueof(TSub#(n,1));i=i+1) begin
let x <- ur.qout[i].get(;
if(counterQ!=fromInteger(valueof(nTyp)-1))
qFIFO[i] .enq(x);
end
if(counterQ+1==fromInteger(valueof(nTyp)))
counterQC=0;
else counterQ<=counterQ+1;
endrule
rule get-rout;
Vector#(n, tnum) rs = newVector;
for(Integer i=O;i<valueof(n);i=i+1)
rs[i] <- ur.rout[i].get;
Vector#(n, tnum) shiftedRout = shiftOutFromN(defaultValue,
rs, counterR);
for(Integer i=O;i<valueof(n);i=i+1)
routFIFO[i].enq( shiftedRout[i]);
if(counterR+1==fromInteger(valueof(nTyp))) counterR<=O;
else counterR<=counterR+1;
endrule
rule putInput ;//implicit gaurd either xinFIFO has new input
//and counter is 0 (machine = idle).. or qFIFO has now the output
//generated by unitrow and counter > 0 (in progress)
for (Integer i=0;i<valueof(n);i=i+1) begin
if(counter==0) begin
let xins = xinFIFO[i].firsto;
ur.xin[i].put(xins);
xinFIFO[i] deqO;
4
204
end else begin
if(i =fromInteger(valueof(TSub#(n,1)))) begin
let xins = qFIFO[i] .first(;
qFIFO[i].deqo;
ur.xin[i].put(xins);
end else begin
Vector#(nTyp,tnum) xins = replicate(C);
ur.xin[i].put(xins);
end
end
end
if(counter+1 ==fromInteger(valueof(n)))
counter<=;
else counter<=counter+1;
endrule
xinPut = map(fifoToPut, xinFIFO);
routGet = map(fifoToGet, routFIFO);
interface Put rowin = xinPut;
interface Get rowout = routGet;
endmodule
205
53
Multiplier.bsv
/*
Author: Richard Uhler ruhler@mit.edu
Revised by: Sunila Saqib saqib@mit.edu
typedef Server#(Tuple2#(tnum, tnum), tnum) Multiplier#(type tnum);
module Em] NDePipelinedMultiplier(m#(
PipelinedMultiplier#(stages, tnum)) mkmul, Multiplier#(tnum) ifc)
provisos(IsModule#(m, m__));
PipelinedMultiplier#(stages, tnum) m <- mkmul;
m.request;
interface Put request
interface Get response = m.response;
endmodule
module [ml HPipelinedMultiplierFixedPoint
(
m#(Multiplier#(Bit#(bLen)))
mkmul, Multiplier#(FixedPoint#(is, fs)) ifc)
provisos(IsModule#(m, m_-), Add#(a__, TAdd#(is, fs),
bLen));
Multiplier#(Bit#(bLen)) mul <- mkmul;
FIFO#(Bool) negative <- mkSizedFIFO(4);
interface Put request;
method Action put(Tuple2#(FixedPoint#(is, fs),
FixedPoint#(is, fs)) x);
match {.xO, xi = x;
let s-x = fxptGetInt(xO) < 0;
let s-y = fxptGetInt(xl) < 0;
xO;
let a = sx ? -xO
let b = s-y ? -xl : x1;
mul.request.put(tuple2(zeroExtend(pack(a)),
zeroExtend(pack(b))));
negative.enq((s-x && !s_y) 11 (!sx && sy));
endmethod
endinterface
interface Get response;
method ActionValue#(FixedPoint#(is, fs)) get();
Bit#(bLen) bits <- mul.response.geto;
FixedPoint#(is, fs) rv = unpack(bits[(2*valueof(fs)
+valueof(is)-1):valueof(fs)]);
Bool neg <- toGet(negative).get();
return (neg ? -rv : rv);
endmethod
endinterface
, endmodule
// A Multiplexed Multiplier.
module Em] iMultiplexedMultiplier(m#(Multiplier#(tnum))
206
mkmul,
Multiplier#(Vector#(n, tnum)) ifc)
provisos(IsModule#(m, mn), Bits#(tnum, tnumisz));
FIFO#(Tuple2#(Vector#(n, tnum), Vector#(n, tnum))) infifo
<-
mkFIF01);
match {.out-g, out_p} <- mkGPFIFO();
Reg#(Vector#(n, tnum)) pending <- mkRegUO;
Multiplier#(tnum) multiplier <- mkmul;
Reg#(Bit#(TAdd#(1, TLog#(n)))) inloc <- mkReg(O);
rule domultiply (True);
let a = tpl_1(infifo.first)[inloc];
let b = tpl_2(infifo.first)[inloc];
multiplier.request.put(tuple2(a, b));
if (inloc -+ 1 == fromInteger(valueof(n))) begin
infifo.deqO;
inloc <= 0;
end else
inloc <= inloc +, 1;
endrule
Reg#(Bit#(TAdd#(1, TLog#(n)))) outloc <- mkReg(O);
rule getresult (True);
let res <- multiplier.response.geto;
let npending = pending;
npending[outloc] = res;
if (outloc + 1 == fromInteger(valueof(n))) begin
outloc <= 0;
out-p.put(npending);
end else begin
outloc <= outloc + 1;
pending <= npending;
end
endrule
interface Put request = toPut(infifo);
interface Get response = out-g;
endmodule
module in] GomplexMultiplier3(m#(Multiplier#(tnum)) mkmul,
Multiplier#(Complex#(tnum)) ifc)
provisos(Arith#(tnum), Bits#(tnum, tnum__), IsModule#(m,
m__));
Vector#(3, Multiplier#(tnum)) multiplier <replicateM(mkmul);
interface Put request;
method Action put(Tuple2#(Complex#(tnum),
Complex#(tnum)) x);
match {.a, .b} = x;
Vector#(4, tnum) as =
as[] = a.rel-a.img;
as[1] = b.rel- b.img;
as[21 = b.rel+ b.img;
Vector#(4, tnum) bs = ?;
207
bs[0] = b.img;
bs[1] = a.rel;
bs[21 = a.img;
for(Integer i =0; i< 3; i=i+1)
multiplier[i].request.put(tuple2(as[i],
endmethod
endinterface
interface Get response;
method ActionValue#(Complex#(tnum)) get();
Vector#(3, tnum) z = replicate(0);
for(Integer i =0; i< 3; i=i+1)
z[i] <- multiplier[i].response.get(;
return cmplx(z[0] + z[l], z[0] + z[2]);
endmethod
endinterface
endmodule
208
bs[i]));
54
PipelinedMultiplier.bsv
Author: Richard Uhler ruhler@mit.edu
Revised by: Sunila Saqib saqib@mit.edu
interface PipelinedMultiplier#(numeric type stages, type tnum);
interface Put#(Tuple2#(tnum, tnum)) request;
interface Get#(tnum) response;
endinterface
// Implementation of an unsigned multiplier which should be
7/ inferred as pipelined by the xilinx synthesis tools.
// Methods are not guarded. The response to a request will be
/ available exactly stages cycles after the request is made,
/7 and will only be available for that one cycle.
// If you don't get the response on time, it will be dropped.
module JPipelinedMultiplierUG(PipelinedMultiplier#(stages,
Bit#(t)))
provisos (Add#(smi, 1, stages));
Reg#(Bit#(t)) a <- mkRegU();
Reg#(Bit#(t)) b <- mkRegUO;
Vector#(smi, Reg#(Bit#(t))) shiftregs <replicateM(mkRegU());
(* fire-when-enabled *)
(no-implicit-conditions *
rule multiplyandshift (True);
shiftregs[O] <= a * b;
for (Integer i = 1; i < valueof(sml); i = i+1) begin
shiftregs[i] <= shiftregs[i-1];
end
endrule
interface Put request;
method Action put(Tuple2#(Bit#(t), Bit#(t)) operands);
a <= tpl_1(operands);
b <= tpl_2(operands);
endmethod
endinterface
interface -Get response;
method ActionValue#(Bit#(t)) geto;
return shiftregs[valueof(sml)- 1];
endmethod
endinterface
endmodule
/ Provide a semi-safe, semi-guarded interface to an unguarded
7/ multiplier.
// The get method is not enabled until a value is available to
7/ be gotten, but you must take the result as soon as it is
// available, otherwise it will be lost.
209
S//
//
//
//
//
//
/
To use this safely, ensure the rule which calls response.get
fires when enabled (use the fire-whenenabled attribute),
has no explicit condition, and has only response.get.ready
as the implicit conditions.
The unguarded multiplier
IIith(
odule n
synthesize
module
asse
shoul
e
odule
*)
JPipelinedMultiplierSG(m#(PipelinedMultiplier#(stages, tnum))
mkmul, PipelinedMultiplier#(stages, tnum) ifc)
provisos(IsModule#(m, a__));
PipelinedMultiplier#(stages, tnum) mul <- mkmul();
PulseWire incoming <- mkPulseWireo;
Vector#(stages, Reg#(Bool)) valids <replicateM(mkReg(False));
(* fire-when-enabled *)
(* noimplicit-conditions *)
rule shift (True);
valids[0] <= incoming;
for (Integer i = 1; i < valueof(stages); i = i+1) begin
valids[i] <= valids[i-1];
end
endrule
interface Put request;
method Action put(Tuple2#(tnum, tnum) operands);
mul.request.put(operands);
incoming.sendo;
endmethod
endinterface
interface Get response;
method ActionValue#(tnum) get() if
(valids[valueof(stages)-1]);
tnum x <- mul.response.geto;
return x;
endmethod
endinterface
endmodule
// Provide a safe, guarded interface to an unguarded multiplier.
shoulde R odule
// The unguarded multiplier module asse
ith
(synthesize
*).
PipelinedMultiplierG(
rodule
n
m#(PipelinedMultiplier#(stages,
tnum))
mkmul, PipelinedMultiplier#(stages, tnum) ifc)
provisos(Bits#(tnum, tnumxsz), IsModule#(m, b__))
PipelinedMultiplier#(stages, tnum) mul <mkPipelinedMultiplierSG(mkmul);
FIFOF#(tnum) results <- mkGSizedFIFOF(True, False,
valueof(stages) + 1);
210
Counter#(TAdd#(I, TLog#(stages))) pending <- mkCounter(O);
(* fire-when-enabled *)
rule takeresult (True);
tnum x <- mul.response.geto;
results.enq(x);
endrule
interface Put request;
method Action put(Tuple2#(tnum, tnum) operands) if
fromInteger(valueof(stages)));
(pending~value()
pending.up();
mul request.put(operands);
endmethod
endinterface
interface Get response;
method ActionValue#(tnum) get();
pending.downo;
results.deqo;
return results.firsto;
endmethod
endinterface
endmodule
211
55
GR specific ComplexFixedPointRotation.bsv
Author: Richard Uhler ruhler@mit.edu
// This is a naive implementation. It has the following stages:
a = r.r*r.r + xrr*x.r + x.i*x.i where r.i is assumed to be 0
//2. r' = sqrt(a) using a fixed point square root.
//3. c = r.r/r', s.r = x.r/r', s.i = x.i/r'using fixed point divides
module Em
ComplexFixedPointRotation
(m#(Multiplier#(FixedPoint#(is, fs))) mkmul,
Rotate#(Complex#(FixedPoint#(is, fs))) ifc)
provisos(Add#(a__, fs, TMul#(2, fs)), Add#(b__, 1, TAdd#(is,
TMul#(2, fs))), Mul#(2, fs, TAdd#(c__, fs)), IsModule#(m, m__));
match {.in-g, in-p} <- mkGPFIFO();
match {.csoutg, .csout-p} <- mkGPFIFO();
Multiplier#(Vector#(3, FixedPoint#(is, fs))) mul <mkMultiplexedMultiplier(mkmul);
SquareRoot#(FixedPoint#(is, fs)) sqrt <mkFixedPointSquareRoot(1);
Divider#(FixedPoint#(is, fs)) dr <- mkFixedPointDivider(2);
Divider#(FixedPoint#(is, fs)) dxr <mkFixedPointDivider(2);
Divider#(FixedPoint#(is, fs)) dxi <mkFixedPointDivider(2);
match {.xr-g, xrp} <- mkGPFIFO();
GetPut#(Complex#(FixedPoint#(is, fs))) rout-gp <mkGPFIFO();
match {.rout-g, .rout-p} = rout-gp;
rule domultiply (True);
RotateInput#(Complex#(FixedPoint#(is, fs))) i <in-g.geto;
xr-p.put(i);
Vector#(3, FixedPoint#(is, fs)) as = ?;
as[0] = i.r.rel;
as[1] = i.x.rel;
as[2] = i.x.img;
Vector#(3, FixedPoint#(is, fs)) bs = 7;
bs[0] = i.r.rel;
bs[1] = i.x.rel;
bs[2] = i.x.img;
mul.request.put(tuple2(as, bs));
endrule
rule dosqrt (True);
let z <- mul.response.geto;
sqrt.request.put(z[0] + z[1] + z[2]);
endrule
rule dodivide (True);
match {.nr, .*} <- sqrt.response.geto;
/1.
212
let i <- xr-g.getO;
dr.request.put(tuple2(i.r.rel, nr));
dxr.request.put(tuple2(i.x.rel, nr));
dxi.request.put(tuple2(i.x.img, nr));
rout-p.put(cmplx(nr, 0));
endrule
rule dofinalize (True);
match {.cout, .*} <- dr.response.geto;
match {.srout, .*} <- dxr-response.geto;
match {.siout, .*} <- dxi.response.geto;
csout-p.put(RotationCS {c: cmplx(cout, 0),
s: cmplx(srout, siout)});
endrule
interface Put request = in-p;
interface Get rout = rout-g;
interface Get csout = csout-g;
endmodule
213
56
GR Systolic specific StreamQR.bsv
Author: Richard Uhler ruhler@mit.edu
*/
// Stream QR
// An implementation of QR with an interface where each element
7/ of the matrix is input one at a time, from first row
/7 to last, and within each row from first column to last.
// Every input belonging to the last row of a matrix should be
/7 annotated as such.
interface StreamQR#(numeric type width, type tnum);
interface Put#(Terminating#(tnum)) xin;
interface Get#(tnum) rout;
endinterface
// Cross Clock Domain StreamQR
7/
sqr
stream qr in the source clock domain
//
sclk - the source clock
//
srst
the source reset
7/ Implements a StreamQR in the current clock domain.
module
SyncStreamQR(StreaQR#(w, t) sqr, Clock sclk, Reset
srst, StreamQR#(w, t) ifc)
provisos(Bits#(t, t-sz));
Clock myclk <- exposeCurrentClocko;
Reset myrst <- exposeCurrentReseto;
SyncFIFOIfc#(Terminating#(t)) xi <- mkSyncFIFO(2, myclk,
myrst, sclk);
SyncFIFOIfc#(t) ro <- mkSyncFIFO(2, sclk, srst, myclk);
mkConnection(toGet(xi), sqr.xin);
mkConnection(sqr.rout, toPut(ro));
interface Put xin = toPut(xi);
interface Get rout = toGet(ro);
endmodule
214
57
Divider.bsv
// The MIT License
// Copyright (c) 2010 Massachusetts Institute of Technology
// Permission is hereby granted, free of charge, to any person
7 obtaining a copy of this software and associated documentation
// files (the "Software"), to deal in the Software without
// restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense,
// and/or sell copies of the Software, and to permit persons to
7 whom the Software is furnished to do so, subject to the
7 following conditions:
77 The above copyright notice and this permission notice shall
// be included in all copies or substantial portions of the
7/ Software.
7/ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
// KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
/ WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
7/ PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS
// OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
// OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
// OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
// OTHER DEALINGS IN THE SOFTWARE.
77 Author: Richard Uhler ruhler~mit.edu
typedef Server#(
Tuple2#(word, word),
Tuple2#(word, word)
) Divider#(type word);
// Unsigned division
/ Input: n, d
// Output: q, r where n = d*q + r.
7/ This implementation uses the non-restoring algorithm (or so
77 I'm told). itersPerCycle is an integer specifying how many iterations
/7 of the algorithm should be performed each clock cycle.
// This number should divide the bit width evenly.
module ENonRestoringDivider(Integer itersPerCycle,
z Divider#(Bit#(ws)) ifc)
provisos(Add#(, 1, ws));
if (valueof(ws) % itersPerCycle != 0) begin
error("itersPerCycle must evenly divide operand bit
v width");
4i
end
FIFO#(Tuple2#(Bit#(ws), Bit#(ws))) incoming <- mkFIFO();
FIFO#(Tuple2#(Bit#(ws), Bit#(ws))) outgoing <- mkFIFO0;
.v
Reg#(Bool) busy <- mkReg(False);
A
Reg#(Bit#(ws)) xReg <- mkRegU(;
'a
Reg#(Bit#(ws)) dReg <- mkRegU(;
Reg#(Bit#(ws)) pReg <- mkRegU(;
47
215
Reg#(Bit#(TAdd#(l, TLog#(ws)))) iReg <- mkRegUO;
rule start (!busy);
match {.ix, .id} <- toGet(incoming).geto;
busy <= True;
dReg <= id;
xReg <= ix;
pReg <= 0;
iReg <= 0;
endrule
// Return the most significant bit of a bit vector
function Bit#(1) top(Bit#(n) x) = x[valueof(n)-1];
// Return all but the most significant bit of a bit vector
function Bit# (TSub#(n,1)) rest(Bit#(n) x) = x[valueof(n)-2:0];
// Returns new x, p after a single iteration of the
/7 division algorithm.
function Tuple2#(Bit#(ws), Bit#(ws)) iterate(Bit#(ws) x,
Bit#(ws) p, Bit#(ws) d);
if (top(p) == 1) begin
p =
((p << 1)
zeroExtend(top(x))) + d;
end else begin
p =
((p << 1)
I zeroExtend(top(x)))
-
d;
end
x = (x << 1) 1 zeroExtend(~top(p));
return tuple2(x, p);
endfunction
rule doiterate (busy);
Bit#(ws) x = xReg;
Bit#(ws) p = pReg;
for (Integer i = 0; i < itersPerCycle; i
i+1) begin
let iout = iterate(x, p, dReg);
x = tpll(iout);
p = tpl_2(iout);
end
if (iReg + fromInteger(itersPerCycle) ==
fromInteger(valueof(ws))) begin
if (top(p) == 1) p = p + dReg;
outgoing.enq(tuple2(x, p));
busy <= False;
end else begin
xReg <= x;
pReg <= p;
iReg <= iReg + fromInteger(itersPerCycle);
end
endrule
interface Put request = toPut(incoming);
interface Get response = toGet(outgoing);
,4 endmodule
// Fixed Point divider
module HFixedPointDivider(Integer itersPerCycle,
c, Divider#(FixedPoint#(iw, fw)) ifc)
216
provisos(
Add#(a__, TAdd#(iw, fw), TAdd#(iw, TMul#(2, fw))),
Add#(b__, 1, TAdd#(iw, TMul#(2, fw)))
// We use the integer division algorithm to do fixed point
// division. If you pack a fixed point number x,
/7 you get a new number which is x * 2-fw
// If you unpack a number y into a fixed point number, you
// get a fixed point number which is y / 2^fw.
// Let the inputs be two fixed point numbers a, b. We want
// to generate the fixed point number a / b.
// We unpack both numbers, multiply the numerator by 2^fw,
// perform integer division and get the integer:
7/ (a * 2^fw) * 2^fw / (b & 2-fw) = (a/b) * 2^fw
// Simply unpack and we get our fixed point a/b.
Divider#(Bit#(TAdd#(iw, TMul#(2, fw)))) div <mkNonRestoringDivider(itersPerCycle);
FIFO#(Bool) negate <- mkFIFOO;
interface Put request;
method Action put(Tuple2#(FixedPoint#(iw, fw),
FixedPoint#(iw, fw)) x);
match {.a, .b} = x;
Bool neg = False;
if (pack(fxptGetInt(a))[valueof(iw)-I] ==
1'bi)
begin
neg = !neg;
a = -a;
end
if (pack(fxptGetInt(b))[valueof(iw)
neg = !neg;
1]
==
1'
bi) begin
b = -b;
end
negate.enq(neg);
Bit#(TAdd#(iw, fw)) aa = pack(a);
Bit#(TAdd#(iw, TMul#(2, fw))) aaa = zeroExte nd(aa);
aaa = aaa << valueof(fw);
Bit#(TAdd#(iw, fw)) bb = pack(b);
Bit#(TAdd#(iw, TMul#(2, fw))) bbb = zeroExte nd(bb);
div.request.put(tuple2(aaa, bbb));
endmethod
endinterface
interface Get response;
method ActionValue#(Tuple2#(FixedPoint#(iw, fw),
FixedPoint#(iw, fw))) get();
match {.q, .r} <- div.response.geto;
Bit#(TAdd#(iw, fw)) qb = q[valueof(iw) +
valueof (fw)-1:0];
FixedPoint#(iw, fw) qf = unpack(qb);
if (negate.firsto)
qf = -qf;
negate.deqo;
217
return tuple2(qf, ?);
endmethod
endinterf ace
endmodule
218
58
SquareRoot.bsv
// The MIT License
// Copyright (c) 2010 Massachusetts Institute of Technology
// Permission is hereby granted, free of charge, to any person
/7 obtaining a copy of this software and associated documentation
7/ files (the "Software"), to deal in the Software without
77 restriction, including without limitation the rights
77 to use, copy, modify, merge, publish, distribute, sublicense,
77 and/or sell copies of the Software, and to permit persons to
77 whom the Software is furnished to do so, subject to the
7/ following conditions:
// The above copyright notice and this permission notice shall
/7 be included in all copies or substantial portions of the
77 Software.
7/ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
7/ KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
// WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
77 PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS
77 OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
77 OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
77 OTHERWISE, ARISING FROM,
77 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
77 OTHER DEALINGS IN THE SOFTWARE.
77 Author: Richard Uhler ruhler~mit.edu
typedef Server#(
word,
Tuple2#(word, word)
) SquareRoot#(type word);
77 Square root
77 Input: x
77 Output: q, r where x = q*q + r.
// Algorithm developed based on description at
// http://www.itl.nist.gov/div897/sqg/dads/HTML/squareRoot.html
77 itersPerCycle specifies how many iterations of the algorithm
77 to perform each clock cycle. This should divide evenly half
77 the bitwidth of the operand.
SquareRoot(Integer itersPerCycle,
module
SquareRoot#(Bit#(ws)) ifc);
if (valueof(ws) % (2 * itersPerCycle) != 0) begin
error("itersPerCycle must evenly divide half the
bitwidth");
end
FIFO#(Bit#(ws)) incoming <- mkFIFO();
FIFO#(Tuple2#(Bit#(ws), Bit#(ws))) outgoing <- mkFIFO();
Reg#(Bool) busy <- mkReg(False);
Reg#(Bit#(ws)) qReg <- mkRegU(;
Reg#(Bit#(ws)) rReg <- mkRegUo;
Reg#(Bit#(ws)) xReg <- mkRegU();
219
Reg#(Bit#(TLog#(ws))) iReg <- mkRegUO;
rule start ('busy);
let xi <- toGet(incoming).geto;
busy <= True;
xReg <= xi;
rReg <= 0;
qReg <= 0;
iReg <= 0;
endrule
// Get the top 2 bits of a bit vector
function Bit#(2) top2(Bit#(n) a) =
a[valueof(n)-1:valueof(n)-2];
// Perform a single iteration of the square root algorithm.
// Returns new r,q,x
function Tuple3#(Bit#(ws), Bit#(ws), Bit#(ws))
iterate(Bit#(ws) r, Bit#(ws) q, Bit#(ws) x);
Bit#(TAdd#(2, ws)) d = {q, 2'bO1};
Bit#(TAdd#(2, ws)) r2 = {r, top2(x)};
if (d > r2) begin
q = q << 1;
r
r2[valueof(ws)-1:0];
end else begin
q = (q <<
1)
1;
Bit#(TAdd#(2, ws)) diff = r2 - d;
r = diff[valueof(ws)-1:0];
end
return tuple3(r, q, x << 2);
endfunction
rule doiterate (busy);
Bit#(ws) r = rReg;
Bit#(ws) q = qReg;
Bit#(ws) x = xReg;
for (Integer i = 0; i < itersPerCycle; i
let
r =
q =
x =
=
i+1) begin
iout = iterate(r, q, x);
tpl_1(iout);
tpl2(iout);
tpl_3(iout);
end
if (iReg + fromInteger(itersPerCycle)
fromInteger(valueof(ws)/2)) begin
outgoing.enq(tuple2(q, r));
busy <= False;
end else
iReg <= iReg + fromInteger(itersPerCycle);
rReg <= r;
qReg <= q;
xReg <= x;
endrule
interface Put request = toPut(incoming);
interface Get response = toGet(outgoing);
220
endmodule
// Fixed Point square root
module FixedPointSquareRoot(Integer itersPerCycle,
SquareRoot#(FixedPoint#(iw, fw)) ifc)
provisos(
Add#(a__, TAdd#(iw, fw), TAdd#(iw, TMul#(2, fw)))
// We use the integer square root algorithm to do fixed
// point square root. If you pack a fixed point number x,
// you get a new number which is x * 2^fw
// If you unpack a number y into a fixed point number, you
7/ get a fixed point number which is y / 2^fw.
// Let the input be a fixed point number x. We want to
// generate the fixed point number sqrt(x).
// We unpack x, multiply by 2^fw, perform integer square
// root, and get the integer:
7/ sqrt((x * 2^fw) * 2^fw) = sqrt(x) * 2^fw
// Simply unpack and we get our fixed point sqrt(x).
SquareRoot#(Bit#(TAdd#(iw, TMul#(2, fw)))) sqrt <mkSquareRoot(itersPerCycle);
interface Put request;
method Action put(FixedPoint#(iw, fw) x);
Bit#(TAdd#(iw, fw)) xx = pack(x);
Bit#(TAdd#(iw, TMul#(2, fw))) xxx = zeroExtend(xx);
xxx = xxx << valueof(fw);
sqrt.request.put(xxx);
endmethod
endinterface
interface Get response;
method ActionValue#(Tuple2#(FixedPoint#(iw, fw),
FixedPoint#(iw, fw))) get();
match {.q, .r} <- sqrt-response.geto;
Bit#(TAdd#(iw, fw)) qb = q[valueof(iw) +
valueof(fw)-1:0];
FixedPoint#(iw, fw) qf = unpack(qb);
return tuple2(qf, ?);
endmethod
endinterface
endmodule
221