Smart IP of QR Decomposition for Rapid Prototyping on FPGAs by Sunila Saqib Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Engineering OF TECHNOLOGY at the APR 10 2014 MASSACHUSETTS INSTITUTE OF TECHNOLOGY LBRARIES February 2014 @ Massachusetts Institute of Technology 2014. All rights reserved. Author......... . .v..................................... Departhent of Electrical Engineering and Computer Science January 09, 2014 Certified by.... Jacob White Professor, Dept. of Electrical Engineering and Computer Science Thesis Supervisor Accepted by...... I / (I Leslie A. Kolodziejski Chairman, Department Committee on Graduate Theses 2 Smart IP of QR Decomposition for Rapid Prototyping on FPGAs by Sunila Saqib Submitted to the Department of Electrical Engineering and Computer Science on January 09, 2014, in partial fulfillment of the requirements for the degree of Master of Science in Engineering Abstract The Digital Signal Processing (DSP) systems used in mobile wireless communication, such as MIMO detection, beam formation in smart antennas, and compressed sensing, all rely on quickly solving linear systems of equations. These applications of DSP have vastly different throughput, latency and area requirements, necessitating substantially different hardware solutions. The QR decomposition (QRD) method is an efficient way of solving linear equation systems using specialized hardware, and is known to be numerically stable [17]. We present the design and FPGA implementation of smart IP (intellectual property) for QRD based on Givens-Rotation (GR) and Modified-GramSchmidt (MGS) algorithms. Our configurable designs are flexible enough to meet a wide variety of application requirements. We demonstrate that our area and timing results are comparable, and in some cases superior, to state-of-art hardware-based QRD implementations. Our QRD design based on a Log-domain GR Systolic array achieved a throughput of 10.1M rows/sec for a complex valued 3x3 matrix on Virtex-6 FPGA, whereas our QRD design based on a Log-domain GR Linear array was found to be an area optimized solution requiring the fewest FPGA slices. Overall the Logdomain GR Systolic array implementation was found to be the most resource efficient design (IP for all of our proposed architectures have been prepared and are available at http://saqib.scripts.mit.edu/qr-code.php). Our set of IP can be configured to satisfy variety of application demands, and can be used to generate hardware designs with nearly zero design and debugging time. Moreover the reported results can be used to pick the optimal design choice based on a given set of design requirements. Since our architectures are completely modular, their sub-units can be independently optimized and tested without the need for re-testing the whole system. Thesis Supervisor: Jacob White Title: Professor, Dept. of Electrical Engineering and Computer Science 3 4 To my family, for all the love, support, and the many sacrifices made. 5 6 Acknowledgments First of all I wish to offer my gratitude to Almighty God for all the blessings bestowed upon me. I would like to express my sincerest gratitude to my adviser Dr. Jacob White, for his patience, timely guidance and insightful critique. Without his guidance this project could not have been a success. My deepest appreciation and thanks is extended to Professor Leslie Kolodziejski and Professor Terry Orlando for their continuous support, encouragement and valuable suggestions. I am also grateful to Janet Fischer, Lisa Bella and EECS graduate office staff for assistance in various procedures at MIT. I am especially grateful to Nirav Dave and Richard Uhler for their encouragement, guidance and professional contributions. It would not have been possible for me to complete this project without the motivation from my family. I am grateful for their constant support and tireless optimism. It is impossible to make a note of all those whose inspiration have been vital in the completion of this thesis, I am thankful to all of them. 7 8 Contents 1 2 3 1.1 Problem Statement and Motivation . . . . . . . . . . . . . . . 19 1.2 Intro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Application of Linear Equation System in Wireless Networks 21 2.1 MIMO Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Beam Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . 24 Selection of Algorithm for Solving Linear Equation System 27 . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Cholesky Decomposition . . . . . . . . . . . . . . . . . 29 3.2.2 Doolittle LU Decomposition . . . . . . . . . . . . . . . 29 3.2.3 Crout LU Decomposition . . . . . . . . . . . . . . . . . 30 . . . . . . . . . . . . . . . . . . . . . . . . 30 Linear Equation System 3.2 LU Decomposition 3.4 27 . . . . . . . . . . . . . . . . . . . . . 3.1 3.3 4 19 Introduction QR Decomposition 3.3.1 Givens-Rotation Based QRD . . . . . . . . . . . . . . 30 3.3.2 Modified Gram Schmidt Based QRD . . . . . . . . . . 33 3.3.3 Householder Based QRD . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . 36 Comparison and Selection 39 Implementation Challenges 9 5 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Latency Insensitive . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Computation Complexity and Scalability 4.2 Modularity 4.3 41 Proposed Parameterized Architecture 5.1 5.2 5.3 5.4 6 . . . . . . . . . . . . . . 4.1 . . . . . . . . . . . . . . . . . . . . 41 5.1.1 Data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.2 M ultiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.3 Storage Space . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1.4 Control Circuitry . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1.5 Implementation type . . . . . . . . . . . . . . . . . . . . . . 49 5.1.6 Reuse within design . . . . . . . . . . . . . . . . . . . . . . . 53 GR Based QRD Linear Array . . . . . . . . . . . . . . . . . . . . . 53 5.2.1 Reuse across design . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.2 Storage Space . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.3 Control Circuitry . . . . . . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . . . . . . . . . . . . 58 5.3.1 Reuse across algorithm . . . . . . . . . . . . . . . . . . . . . 59 5.3.2 Vector operations . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3.3 Control circuitry . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.4 Storage Space . . . . . . . . . . . . . . . . . . . . . . . . . . 61 MGS Based QRD Linear Array: . . . . . . . . . . . . . . . . . . . . 61 5.4.1 Reuse across design and algorithm . . . . . . . . . . . . . . . 61 5.4.2 Control circuitry . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4.3 Storage Space . . . . . . . . . . . . . . . . . . . . 61 GR Based QRD Systolic Array MGS Based QRD Systolic Array 63 Results and Discussion 6.1 6.2 Experiment Conditions and Setup . . . . . . . . . . . . . 63 6.1.1 Configuration Parameters . . . . . . . . . . . . . 63 6.1.2 Experiment Design . . . . . . . . . . . . . . . . . 64 Performance on FPGA . . . . . . . . . . . . . . . . . . . 65 10 6.2.1 GR based Arrays . . . . . . . . . . . . . 65 6.2.1.2 MGS based Arrays . . . . . . . . . . . . . . 69 6.2.2 GR versus MGS . . . . . . . . . . . . . . . . . . . . . 72 6.2.3 Comparison between Computational Unit Imp lementation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.3.1 Linear Arrays . . . . . . . . . . . . . . . . . . . . . . 76 6.2.3.2 Systolic Arrays . . . . . . . . . . . . . . . . . . . . . 79 Omega Notation Analysis of Design Parameters with 6.3.1 6.3.2 6.3.3 Throughput . . . . . . . . . . . . . . . . . . 6.3.1.1 Systolic Arrays . . . . . . . . . . . 6.3.1.2 Linear Arrays . . . . . . . . . . . . Latency . . . . . . . . . . . . . . . . . . . . 6.3.2.1 Systolic Arrays . . . . . . . . . . . 6.3.2.2 Linear Arrays . . . . . . . . . . . . 6.3.2.3 Internal Unit . . . . . . . . . . . . 6.3.2.4 Boundary Bnit . . . . . . . . . . . 6.3.2.5 Multipliers . . . . . . . . . . . . . A rea . . . . . . . . . . . . . . . . . . . . . . 6.3.3.1 Systolic Array . . . . . . . . . . . . 6.3.3.2 Linear Array 6.3.3.3 Internal Unit . . . . . . . . . . . . 6.3.3.4 Boundary Unit . . . . . . . . . . . 6.3.3.5 Multipliers Target Oriented Optimization 6.4.1 82 Multiplier Implementation (Firm versus Soft) Array Size . . . . . . . . . . . . . . . . . . . . . . . 6.4 65 . 6.2.1.1 6.2.4 6.3 Linear versus Systolic Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MIMO . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 85 85 86 87 87 87 88 88 90 90 90 90 91 91 92 93 93 6.4.1.1 Required Specifications . . . . . . . 93 6.4.1.2 Optimized Configurations . . . . . . . . . . . . . . . 94 11 . . . . . . . . . . . . . 94 6.4.2.1 Required Specifications . . . . . . . . . . . . . . . . . 94 6.4.2.2 Optimized Configurations . . . . . . . . . . . . . . . 95 Comparison with Previously Reported Results . . . . . . . . . . . . . 95 6.5.1 M IM O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5.2 Beam Formation . . . . . . . . . . . . . . . . . . . . . . . . . 96 . . . . . . . . . . . . . . . . . . 99 6.4.2 6.5 6.6 7 Beam Formation ............ Guidelines for Architecture Selection 101 Conclusions A Tables 107 B Source Code 113 12 List of Figures 22 2-1 4 x 4 M IM O ... ................................ 3-1 Givens-Rotation based QRD . . . . . . . . . . . . . . . . . . . . . . . 32 3-2 Modified Gram Schmidt based QRD . . . . . . . . . . . . . . . . . . 34 3-3 Householder based QRD . . . . . . . . . . . . . . . . . . . . . . . . . 35 5-1 Systolic Array Architecture . . . . . . . . . . . . . . . . . . . . . . . 42 5-2 (a)Boundary unit (b)Internal unit for Givens-Rotation method . . . . 42 5-3 QR(n) top module containing one row and one QR(n-1) top module . 43 5-4 Special case of QR top module, QR(1) . . . . . . . . . . . . . . . . . 43 5-5 Typeclass QR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5-6 Instance of QR typeclass for width equal to 1, terminating case for . . . . . . . . . . . . . . . . . . . . . . . . 44 5-7 Instance of QR typeclass for width greater than 1 . . . . . . . . . . . 44 5-8 QR instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5-9 Typeclass for Conjugate operation . . . . . . . . . . . . . . . . . . . . 46 5-10 Instance of typeclass Conjugate for FixedPoint data type . . . . . . . 46 5-11 Instance of typeclass Conjugate for Complex data type . . . . . . . . 46 . . . . . . 47 5-13 DSP and LUT based Multiplier . . . . . . . . . . . . . . . . . . . . . 48 5-14 Configuring Type of Multiplier in core units . . . . . . . . . . . . . . 48 recursive QR architecture 5-12 Data type and Matrix Size Configuration in Main Module 5-15 (a)boundary unit with FIFO (b)internal unit with FIFOs at input port and Internal storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 . . . . . . . . . . . . . . . . 51 5-16 Configuring Log Domain External Unit. 13 . . . . . . . . . . . . . . 52 5-18 Configuring LA based External Unit. . . . . . . . . . . . . . 52 5-19 Configuring Newton Raphson Method based External Unit. . 53 5-20 QR systolic array for 11x11 matrix . . . . . . . . . . . . . . 54 5-17 Linear Approximation of 1/sqrt(x) 5-21 QR linear array for 11x11 matrix . . . . . . . . . . . . . . . . . . . 55 5-22 (a) Indexes of values of r generated, while processing 2 coi nsecutive . . . 55 5-23 Algorithm to generate R state machine. . . . . . . . . . . . . 57 5-24 Algorithm to generate R state machine. . . . . . . . . . . . . 58 5-25 MGS Systolic array . . . . . . . . . . . . . . . . . . . . . . . 59 5-26 (a) Batch Adder (b) Batch Multiplier . . . . . . . . . . . . . 60 rows interleaved (b) QRD linear array for mxm matrix 6-1 DSP block usage in GR Linear and Systolic Arrays with all DSP based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 Slice LUT usage in GR Linear and Systolic Arrays with all DSP based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 67 Registers usage in GR Linear and Systolic Arrays with all DSP based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 66 68 Throughput of GR Linear and Systolic Arrays with all DSP based M ultipliers . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 68 6-5 Latency of GR Linear and Systolic Arrays with all DSP based Multipliers 69 6-6 DSP block usage in MGS Linear and Systolic Arrays with all DSP based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7 Slice LUT usage in MGS Linear and Systolic Arrays with all DSP based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8 70 Slice Register usage in MGS Linear and Systolic Arrays with all DSP based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9 70 71 Throughput of MGS Linear and Systolic Arrays with all DSP based M ultipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 71 6-10 Latency of MGS Linear and Systolic Arrays with all DSP based Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6-11 DSP block usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers . . . . . . . . 73 6-12 Slice LUT usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers . . . . . . . . 74 6-13 Slice Register usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers . . . . . . . 74 6-14 Throughput of GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers . . . . . . . . . . . . . 75 6-15 Latency of GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers . . . . . . . . . . . . . 75 6-16 DSP block usage in GR Linear Arrays with different Computational block implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6-17 Slice LUT usage in GR Linear Arrays with different Computational block implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6-18 Slice Register usage in GR Linear Arrays with different Computational block implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6-19 Throughput of GR Linear Arrays with different Computational block im plem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6-20 Latency of GR Linear Arrays with different Computational block implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6-21 DSP block usage in GR Systolic Arrays with different Computational block implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6-22 Slice LUT usage in GR Systolic Arrays with different Computational block implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6-23 Slice Register usage in GR Systolic Arrays with different Computational block implementations . . . . . . . . . . . . . . . . . . . . . . . 80 6-24 Throughput of GR Systolic Arrays with different Computational block implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 81 6-25 Latency of GR Systolic Arrays with different Computational block implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6-26 DSP block usage in GR Linear Arrays with different Multiplier implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6-27 Slice LUT usage in GR Linear Arrays with different Multiplier implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6-28 Slice Register usage in GR Linear Arrays with different Multiplier implem entations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6-29 Throughput of GR Linear Arrays with different Multiplier implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-30 Latency of GR Linear Arrays with different Multiplier implementations 16 84 85 List of Tables 6.1 Throughput observed for Systolic Arrays (LA based, with all DSP 86 based shared Multipliers) . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Throughput observed for Linear Arrays (LA based, with all DSP based shared Multipliers) 6.3 P&R results for GR and MGS based Systolic Arrays for complex valued input array of size 4x4 and word length (6.10) 6.4 87 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 P&R results for GR Linear Arrays for complex valued input array, word length (6.10) 95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Comparison of our study results with previously reported for MIMO 6.6 Comparison of our study results with previously reported for Beam 97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.7 Selection of Appropriate Architecture . . . . . . . . . . . . . . . . . . 99 A.1 P&R results for GR and MGS based Linear and Systolic Arrays with form ation LA, Log and NR based QRD config. for word length (16.16) . . . . . A.2 108 P&R results for GR and MGS based Linear and Systolic Arrays with LA, Log and NR based QRD config. for word length (16.16) 17 . . . . .111 18 Chapter 1 Introduction 1.1 Problem Statement and Motivation Rapid prototyping of digital systems is a technique used to evaluate and investigate the feasibility of a technology without incurring the exorbitant development costs associated with full product development. To further reduce the cost and design- time of digital system design, prototypes must make use of commodity and shared IP (intellectual property) wherever possible. FPGAs (Field Programmable Gate Arrays) enable rapid prototyping by providing a low cost and re-configurable substrate capable of supporting high performance designs. However, the scale and maximum advantage of rapid prototyping is limited by availability of the required IPs. In many digital system domains, a wide variety of applications share the same core components. For example, mobile wireless communication systems like MIMO detection, beam-formation in smart antennas, and compressed sensing, involve solving linear systems of equations which can be implemented efficiently using QR matrix decomposition (QRD). Such core components are therefore a natural candidates for shared IP, and can be used for rapid prototyping. Unfortunately, every application has its own unique set of design limitations for the core components, and therefore require a different hardware implementation. For example, a QRD in MIMO may only need to support decomposition of 4x4 matrices. 19 On the other hand, a QRD for beam-formation may need to support the decomposition of 40x40 matrices, resulting in different resource sharing strategy. Modular, parameterized, "smart IP" components could enable rapid prototyping by allowing a designer to configure and adapt existing components to their application. However, it remains to be proven whether smart IP components can be developed with the flexibility to adapt to a full range of target applications while still meeting each application's design-requirements. 1.2 Intro. In this project we present the design and the FPGA implementation of a smart IP for QRD. The design is based on two algorithms, namely Givens-Rotation (GR) and Modified-Gram-Schmidt (MGS). Our IPs are flexible enough to meet wide variety of application requirements, as they can be configured for arbitrary matrix sizes and an extensible variety of number representations with wide selection of plugable sub-units. We have demonstrated our area and timing results to be comparable with, and in some cases superior to, state-of-art hardware-based QRD implementations. As an added advantage, the implementation can be targeted to an ASIC with a minimum effort. 1.3 Thesis Structure The manuscript is organized as follows: Linear equation systems in wireless networks is discussed in Chapter 2. Selection of algorithm for solving the linear equation systems is discussed in Chapter 3, and implementation challenges are listed in Chapter 4. The parameterized architecture design and the implementation of our model is elaborated in Chapter 5. Our results and their comparison with previously published results are discussed in Chapter 6. The conclusion of this study is presented in Chapter 7. 20 Chapter 2 Application of Linear Equation System in Wireless Networks Linear equation system solver and matrix inversion are critically important components for performance enhancements in signal processing systems and communication networks. 2.1 MIMO Detection MIMO technology increases the data throughput and improves network performance as well as link reliability by using multiple antennas at both the transmitter and receiver end. The MIMO detection schemes zero forcing (ZF) or minimum mean square error (MMSE) are linear algorithms, which estimate the transmitted data symbols by multiplying OFDM (orthogonal frequency-division multiplexing) sub-channel received signals from different antennas by inverse of channel characteristic matrix. These linear algorithms are proved to be efficient with low computational complexity [111. In an n receiving and m transmitting antennae MIMO system, with t as number of symbol periods over which data is being transmitted under a flat fading channel condition the channel model is represented by: 21 (2.1) Y=AX+Z where, Y is n x t matrix of received complex symbols, A is n x m complex-values channel characteristic matrix, X is m x t matrix of transmitted complex symbols and Z is n x t additive complex-valued zero-mean white Gaussian noise vector. The vectors of channel matrices are dynamically coupled i.e. there exists a correlation between individual sub-channels. Optimal number of antennas (the dimension of channel matrix) based on channel capacity maximization while considering the limitations due to correlation between individual sub-channels, is found to be up to 8 on each side (preferred 4 antennas on each side) [1] [2] [6] [10] [20]. A MIMO system can be configured to have n transmitters and m receivers, where n $ m, resulting in rectangular channel characteristic matrix. Figure 2-1 shows a MIMO system with 4 transmitters and 4 receivers. yO x0 XlIy x2 Y2 x3 y3 Figure 2-1: 4 x 4 MIMO The MIMO received signal can be decoded in the following three ways: by computing inverse of matrix A and matrix multiplication of A 1 and Y, by decomposing the matrix A into lower and upper triangular matrices (LU) followed by forward and backward substitution, or by decomposing matrix A into orthogonal and upper triangular matrices (QR) followed by matrix multiplication with Hermitian transpose of Q (QH) and backward substitution. 22 2.2 Beam Formation Wireless signal beam formation reduces interference between wireless networks by transmitting signals in narrow beam directed only toward the desired destination. The narrow beam is achieved by scaling the transmitted signal such that the signal transmitted over multiple antennas constructively interfere (amplify) at the desired destination and destructively interfere to diminish it in all other directions. For a given destination the scaling or weight matrix is calculated as w = A-'v, where A is interference and channel noise co-variance matrix, whereas v is the direction vector. The number of antennas in the beam-formation system dictates the potential width of the beam; larger the antenna array narrower the beam and larger the size of weight matrix. The interference and channel noise co-variance matrix is built over multiple observations of the channel. The observations recorded for a period in which only the noise interference is present can be constructed into the matrix: x(ni) 2 X = (2.2) x(np) This observation matrix X can be used to setup a co-variance matrix for random variable; noise and interference: A = scalar x XHX (2.3) Dimension of observation matrix is m x n where m is the number of observations made and n is number of antennas. The number of observations can be increased or decreased depending on the number of antennas to obtain a square observation matrix. The computational complexity of the nontrivial task of computing the inverse of 23 A can be reduced significantly by decomposing X into simple matrices, such as R where Q Q and is orthogonal matrix and R is upper triangular matrix. A = XH*X = (Q* R)H * (Q* R)= RH * QHQ* R (2.4) = RHR Thus, A- 1 1 can be computed using the expression A- R-1H * R- 1 , where both RH and R are triangular matrices. Inverse of a triangular matrix is also a triangular matrix and can be computed by forward/backward substitution. thus this simplification reduces the task of computing the inverse of A to computing triangular matrix R and RH, followed by forward and backward substitution and matrix product (computing the value of 2.3 Q matrix is not required). Compressed Sensing In compressed sensing network, a sparse data signal is sampled with sampling rate higher than Shannon sampling rate reducing the bandwidth requirement. In the compressed sensing network, an n dimensional input signal X is compressed to m measurements C by taking m linear projections, i.e. C = AX where A is matrix of size m x n, and m < n. In this case the system is undetermined, with lesser constraints than the degrees of freedom resulting in infinite solutions that can satisfy the set of equations. Since the signal is known to be sparse, the sparsest signal representation satisfying the system of equations can conditionally be shown to be unique; and solution found using minI xI 1 subject to c = Ax can be shown to produce the sparsest solution [5]. Orthogonal matching pursuit (OMP) is a fast and relatively simple algorithm for recovering compressed signal, and is a greedy approach for finding the sparsest solution. It iteratively improves its estimate of the signal by choosing the column of a matrix that has the highest co-relation with the residual (highest value of dot 24 product between column and residual vector) [27]. The process of decoding the sparse signal using OMP, explained in [27], involves the following three steps, summarized here for readers convenience: 1. Matching (vector product) in which it finds co-relation between all columns vector of the projection matrix q and the received signal. The maximum value of this vector product gives the vector with highest co-relation: it = max < rt-1, 4 > 1 (2.5) 2. Projection (solving linear equation) in which it estimates the signal by solving linear equation. It takes the measurements y and project them on the range of active subset of sensing matrix F: X = (@T@)- 4 Ty (2.6) 3. Residual computation for next iteration, preparing the residual for the next iteration: rt = y - (Dxt (2.7) These three steps are repeated to improve the estimate of signal. In the projection step of each iteration, a linear equations system is solved by computing inverse of a tall rectangular matrix. Therefore, performance and quality of output of OMP highly depends on the latency of linear equation solver, and can benefit from low latency implementation. 25 26 Chapter 3 Selection of Algorithm for Solving Linear Equation System 3.1 Linear Equation System Linear arithmetic equation system consists of set of m equations relating n unknowns expressed in the form: a11 x 1 + a 1 2 X2 + +± al-nXn C1 a 2 1X 1 + a 2 2 X2 ± +± a2nXn C2 an1X1 ± an2X2 ± ' (3.1) *+ ± anmxn ~ cm where n x m aij (coefficients of x), and m right hand side constants cj are known, while n xi are unknown. The mathematical expression in the form of matrix is: AX=C a11 a 12 -. a a 22 ... a 2 n- am2 - amn_ 21 _am, n 27 X1 Cl X C2 Xm -Cm (3.2) X If the solution exists then matrix inversion or matrix decomposition and back substitution can be used to solve these equations. There are a number of methodologies to come up with the numerical solution from the given set of equations which can be categorized as direct, iterative and relaxation methods. Direct methods perform decomposition in fixed throughput/latency for all inputs; whereas in iteration and relaxation methods, convergence rate and throughput depends on nature of data and intelligent selection of initial conditions, reordering of columns and rows and other user dependent factors. Hence, the iterative and relaxation methods are not discussed here. The direct methods include: 1. Crammers rule: This method uses Laplace expansion to compute the elements of inverse matrix A- 1 . This is computationally expensive and slowest method for solving linear equations. 2. Elimination methods: These methods including Gauss elimination, Jordan elimination, and LU decomposition transform a given matrix A into either A other simple matrix form suitable for back/forward substitution. or Both the Gaussian elimination and the Gauss-Jordan method involve the right hand side matrix C in the solution. Therefore in applications where the left hand side remains constant between consecutive problem setup, no computation effort will be reused. In such cases, LU is more suitable. 3. Orthogonalization methods: These methods including Givens rotation and Gram Schmidt and Householder orthogonalization based QR decomposition transform an input matrix into upper triangular matrix using orthogonalization to eliminate elements in lower triangle of the input matrix. These methods do no involve the matrix C and therefore consecutive problem setup with same matrix A can reuse the computational effort The elimination method in which computational effort can be reused (LU decomposition) and orthogonalization methods (QR decomposition) are discussed in the following sections. 28 3.2 LU Decomposition LU decomposition processes decomposes a square matrix A into a lower and an upper triangular matrices. After this decomposition, forward and backward substitution is used to solve AX = C. The decomposition process is independent of the value of right hand side vector C, so for linear equation systems with same left hand side can reuse the decomposition results. LU decompositions take approximately 3 floating-point operations for decomposing n x n matrix. The decomposition of the matrix A into a lower and upper triangular matrix is not unique and it does not work for singular matrix/under-determined equations. Row or Column pivoting/reordering is required in this method to avoid division by zero. Some of the notable decomposition algorithms to achieve LU decomposition based on Gaussian elimination, are the Cholesky, Doolittle and the Crout decomposition. 3.2.1 Cholesky Decomposition Cholesky decomposition is an optimized LU decomposition for the Hermitian, positive definite matrix and every real valued symmetric positive definite matrix. It decomposes matrix A into the product of a lower triangular matrix and its conjugate transpose A = LL*. 3.2.2 Doolittle LU Decomposition Doolittle LU decomposition performs decomposition column by column and decomposes a given matrix into unit lower triangular matrix and upper triangular matrix. The pivoting is performed in such a way that for each k a pivot row is determined and interchanged with the row k, the rest of the algorithm works similar to the Cholesky decomposition. 29 3.2.3 Crout LU Decomposition In Crout method matrix A is decomposed into lower triangular matrix L and upper unit triangular matrix U. The elements lij of L and ujj of U are computed by solving equation system LU = A. 3.3 QR Decomposition Solution of singular matrices expressing either over or under determined set of equations can be computed using QR orthogonalization methods, wherein a m x n matrix A is decomposed into n x n unitary matrix Q and m x n upper triangular matrix R. QR decomposition methods take approximately 2mnr2 floating-point operations for decomposing m x n matrix. The three widely used methods for computing QR orthogonalization are: 1. Givens Rotation method 2. Modified Gram Schmidt method 3. Householder orthogonalization method In each of these methods the matrix A is decomposed into the product of an orthonormal matrix Q and an upper triangular matrix R, such that A = Q - R. If A is invertible then the decomposition resulting in positive diagonal elements in R is unique. 3.3.1 Givens-Rotation Based QRD Givens-Rotation rotates a vector in (x, y) plane by an angle 0 such that it becomes orthogonal to axis y, diminishing its magnitude in that dimension to zero: cos 0 - sin 0 x r (3.3) sin0 cos0 30 Y 0 where values of the rotation matrix can be determined from the pivot elements r= (x) 2 + (y) 2 , cos 0 = Z, sin 0 = -. Using Given rotation repeatedly, a matrix A can be decomposed into orthogonal matrix Q and an upper triangular matrix R. At each iteration A is rotated clockwise by angle 0 in (i, j) plane by multiplying it with a rotation matrix of the form: G(i, j, 0) = 1 0 0 cos0i' 0 0 ... 07 0 sin Oj . .0 sinG5,j cos Ojj . .0 0 0 - (3.4) to generate transformed A, which is used as input for the next iteration. Since this multiplication process changes only the ith and jth rows of matrix A, full matrix multiplication is not required to compute these intermediate transformed matrices. Therefore this rotation matrix can be reduced to 2x2 matrix of the form: cos Ojj G(i, j, 0) sin Oj'i and multiplied by only the ith and jth - sin Oij cos rows. j() (3.5) Also, since after computing the elements of rotation matrix, same operation is done for the whole row, therefore the operation can be distributed over computational blocks and done in parallel. After 2(n-1) 2 iterations of rotation, A will be transformed into upper triangular matrix R. While performing same rotation operations on an identity matrix will transform it to the orthogonal matrix Q. Matlab code for this algorithm is shown in Figure 3-1. Time complexity of GR based QRD is 2mn 2 31 1 2 3 4 7 QR decomposition [m,n] = size(A); X = zeros(m,m); R = zeros(m,m); % m-by-m zero matrix to store output 5 6 7 8 9 10 11 12 13 for i = 1:n [R(1,1), c,s] = external(R(1,1),A(i,1)); [R(1,2:m), X(2,2:m)] = internal(c,s, R(1,2:m), A(i,2:m)); for j = 2:m [R(j,j), c,s] = external(R(j,j),X(j,j)); [R(j,j+1:m), X(j+1,j+1:m)] = internal(c,s, R(j,j+1:m), end end 14 15 16 17 is 19 20 21 22 23 function [r, c, si = external(rin, xin) if rin == 0 && xin ==O r = 0; c = 0 ;s = else r norm([rin, xin]); C = rin / r; s = xin / r; end end 24 25 26 27 28 function [rx] = internal( c, s, rin, xin) r = (c * rin) + (conj(s)* xin); x = (-s *rin) + (c * xin); end Figure 3-1: Givens-Rotation based QRD 32 X(j,j+1:m)); Modified Gram Schmidt Based QRD 3.3.2 Gram - Schmidt process is used to construct an orthonormal basis Q for a set of linearly independent vectors, expressed as columns in matrix A. This process can be used to create the orthogonal matrix can be computed using rij= Q Q; and the values of upper triangular matrix R and A employing the formula: qf&3 i < j(elementsabovetheprinciplediagonal) ||a i = j (elemetsorthepriciplediagoial) 0 i > j (elements belowt hepri~ncipledi agon al) (3.6) 1 4 is the orthogonalized intermediate jth column vector of A during jth where a'i~j iteration. For a matrix of size n x m it will take n iterations to complete decomposition of matrix A. In each iteration of orthogonalization, a row of R and a column of Q is computed such that the jth elements of the principle diagonal of R are computed by normalizing jth column of A in jth iteration; and the values in Jth column of Q are computed by scaling aij by 1-. The non-diagonal elements in a row of R are computed by using the formula in jth iteration for k = 0 ... n , k < jr,k = 0 and for k > j, rj,k = q - ak ); where ak are the values in column k of matrix A during iteration j - 1. In first iteration, these are the original values of kth column of A. After each jth iteration A is updated such that for k = 0... n, k < ai, au~r~zai~k a,k = 0 and for k > i ,k After j = aUk ai j the new - q,jrj,k. -qk gys iterations, A will transform into zero matrix, and the computation of orthogonal matrix Q will be complete. Since in each iteration a vector in Q is com- puted and a vector in A is transformed to 0, they can be contained in singled memory location. MGS based QRD performs vector operations to compute column vectors of Q and rows of R. Therefore the complete matrix needs to be in memory before the process can begin. This is unlike Givens-Rotation which can begin first iteration as soon as 33 1 [m,n] = size(A); 2 Q = zeros(m,n); 3 R = zeros(n,n); 4 Q = A; 5 6 for i=t:n R(i, i) = norm(Q(:,i)) for j=i+1:n 9 Q(:,j) 11 = = Q(:,j) *Q(:,j) * R(i, -Q(:,i) j) end 12 13 j) R(i, 10 end Figure 3-2: Modified Gram Schmidt based QRD one row of data is available. Figure 3-2 shows the matlab code for MGS based QRD algorithm. Time com3 2 plexity of MGS based QRD is 2mn + 2mn - m. This process takes 2n arithmetic operations [20]. 3.3.3 Householder Based QRD Householder transformation uses unitary Hermitian matrix to reflect a given vector a3 across a plane such that all its coordinates but one disappear. The elementary Householder matrix used for reflection across the plane orthogonal to the unit normal vector v can be expressed in matrix form as: H = (37) I - 2vvT where I is identity matrix of the same dimensions as H, and vT is transpose of unit normal vector v. The reflector matrix that maps a given vector a3 to be a scalar multiple of another given vector el (first column vector of identity matrix (1, 0, ... , by taking v = ' O)T) can be constructed with u = a, - signrjayjiei where sign = ±1, then the product of resulting Hermitian matrix and aj will result in: 34 UU T Ha = (I - signu a = sign|af I llei (3.8) and product of Hermitian matrix and rest of the columns of A will transform A to A'. Repeating this process min(m,n) times, for a matrix A,, (of size x x y) by com- puting the Hermitian reflecting matrix using its first column, where x = m - k and y = n - k in kth iteration, constructed by dropping first k rows and columns from the given matrix Amn will reduce the matrix A to upper triangular matrix. The product of Hermitian matrices used for this transformation in all iterations forms the unitary Q matrix. Unlike Givens-Rotation based orthogonalization, this process requires the whole column vector to compute reflection matrix. Reflection matrix is then used in matrix multiplication. Therefore this method is not suitable for data distributed computation. Figure 3-3 shows the matlab code for Householder based QRD algorithm. Time complexity of Householder based QRD is 1 2 3 4 5 6 7 8 4mn2 3 [m,n] = size(A); Q = eye(m); for k=1:min(m-1,n) ak = A(k:end,k); vk = ak + sign(ak(1))*norm(ak)*[1;zeros(m-k,1)]; Hk = eye(m-k+1) - 2*vk*vk/(vk*vk); Qk = [eye(k-1) zeros(k-1,m-k+1); zeros(m-k+1,k-1) Hk]; AQk*A; 9 10 end 11 R = A; Figure 3-3: Householder based QRD 35 3.4 Comparison and Selection As mentioned earlier, there exists many optimized variant of LU decomposition algorithm for families of matrices which exhibit special characteristics. But LU decomposition fails to find solution if input matrix is singular. Also, like Gaussian elimination methods, LU requires normalization, row/column reordering to avoid division by zero as well as for non-diagonal dominant matrices for preserving accuracy. Although QR decomposition is computationally more complex than most of the LU decomposition variants but computations done in QR decomposition are unconditionally stable and can be used to decompose singular matrix [8]. Since error propagates at slower rate in orthogonalization process [171, QR decomposition is more accurate without maintaining diagonal dominance if large enough word size is used. Column reordering and normalization can be used to further improve solutions' precision, but is not required to avoid division by zero. Also since pivoting and row dominance is not required to maintain acceptable precision in QR computation, systolic arrays can be used to distribute the data and operations among parallel computational blocks without the need of context/data migration. This results in reduced control logic complexity. Both LU and QR decomposition reuse computational efforts if consecutive problem set-up has the same left hand side of Eq. 3.1 and Eq. 3.2; In addition to solving linear equation system, QR decomposition can also be used to determine the magnitude of determinant of a matrix: A=QR det(A) = det(Q).det(R) (3.9) whereIdet(Q)| = 1 soldet(A)| = 1.|det(R)| where det(R) = product of the values on principle diagonal; since its a triangular matrix. QR decomposition can also be used to find inverse of co-variance matrix of multi36 ple random variables expressed in the form A = XH * X where the matrix X contains measurements/snapshots of the outcome for these random variables. The computational intensive task of computing A can be reduced in complexity by decomposing X in QR components. After restructuring the formula becomes: A=XH*X =(Q* R)H * (Q* R) (3.10) RH *QH * Q* R =RH*R Since the final output does not require the value of Q, computing the Q part of the decomposition can be omitted while calculating inverse of co-variance matrix. Because of the generality in applications, less restricting limitations, and suitability for FixedPoint parallel hardware implementation we selected QR decomposition for our parameterized prototypes. Classically used algorithms for QRD as discussed earlier are: (a) Householder method [9], suitable for software implementation with centralized storage but not for hardware [26]; (b) Givens Rotation (GR) method [28], favorable for distributed parallel implementations; and (c) Modified Gram Schmidt (MGS) [18], mainly suitable for only smaller matrices. We chose GR and MGS for parameterization because of their favorable nature for hardware implementation, in terms of cycle count, cycle frequency, and area-on-chip for differing sizes of matrix. 37 38 Chapter 4 Implementation Challenges Challenges for designing and implementing flexible modular QRD prototype are discussed in the following sections. 4.1 Computation Complexity and Scalability The main operations involved in the QR decomposition using MGS are norm, inner product, scaling of vector by inner product, and division of vector by scalar value; while the main operations involved in the GR based QR decomposition using GR are (VX)_1 computation and multiplication. The major challenge is that the prototype should be able to generate scalable design for larger matrix sizes, for example pipe-lined resources that can be reused to reduce hardware size to fit it in a given FPGA board or available space on the board, at the same time it should be able to generate highly parallel design when hardware resources are not the limiting design factor, for instance a matrix of 4x4 is small in size and therefore reusing units can reduce the throughput unnecessarily. 4.2 Modularity Fine-grained modularity is another challenge in implementation. The prototype design must be modular with plug and play configurable architecture in order to facili- 39 tate unit or modular improvement with minimum effect on rest of the architecture. 4.3 Latency Insensitive To be truly modular the architecture needs to be insensitive to latency, so that when lesser latency module is available, the rest of the architecture doesn't need to be redesigned to synchronize the data and control signals. 40 Chapter 5 Proposed Parameterized Architecture We present here implementation of 4 architectures that can be scaled up to be used for beam formation with higher dimensions and configured to use various unit implementation techniques without the need to debug and test each time there is change in size of control circuitry and data path. These four architectures are: (a) Systolic Givens rotation based; (b) Linear Given rotation based; (c) Systolic MGS based; and (d) Linear MGS based. 5.1 GR Based QRD Systolic Array Givens Rotation (GR) decomposes a matrix A into unitary matrix Q and R by ro- tating it along one axis at a time and nullifying an element in a column vector of A. These rotations operation on each matrix element is independent and therefore can be done in parallel. We can distribute the input matrix over a uniform array of computation units and then combine the generated output. Figure 5-1 shows the block diagram for 5x5 Systolic Array for Givens rotation based QRD. Systolic Array consists of two specialized building blocks (Figure 5-2); boundary unit and internal unit. 41 SInt Figure 5-1: Systolic Array Architecture enRin'Rin XWin'xt RKI'VM Xin sin Rin COS Xin rout Cos Rin sin Xnut (b) (a) Figure 5-2: (a)Boundary unit (b)Internal unit for Givens-Rotation method The boundary unit rotates the input element and generates (a)rotation parameters (cos and sin of 0) and (b)diagonal elements. The internal unit rotates the input element using the rotation parameters generated by the diagonal element in the same row and generates non diagonal elements of the output matrix. Both these units require 3 and 4 multiplication operations respectively. These operations can be performed using dedicated multipliers or a single pipe-lined multiplier with a trade off between latency and area-on-chip. We implemented QR as a combination of a row of size n containing one boundary and n-i internal units, and a SubQR implementing QR of size (n - 1) as shown in Figure 5-3. As a special case, QR of width equal to 1 is a row which contains only the boundary unit as shown in Figure 5-4. This architecture implementation facilitates automated connectivity between the rows, for re-configurable size of matrix. This is achieved by implementing "typeclass" of QR and defining two instances; one for QR of size greater than 1, and terminating case of QR of size 1, as shown in the Figures 5-5, 5-6 and 5-7, respectively. A wrapper on the top level QRD module 42 Figure 5-3: QR(n) top module containing one row and one QR(n-1) top module Figure 5-4: Special case of QR top module, QR(1) 43 can define a QRD array of specific size by using the code shown in Figure 5-8. typeclass QRtopModule#(numeric type width); module [m] mkQRtopModule # (...) endtypeclas s Figure 5-5: Typeclass QR instance QRtopModule#(1); module [m] mkQRtopModule ( ... ); //return QR module of width EQual to 1 QR#(1, tnum) qrUnit <- nikQReqONE(...); return qrUnit; endmodule endinstance Figure 5-6: Instance of QR typeclass for width equal to 1, terminating case for recursive QR architecture instance QRtopModule#(width) module [m] mkQRtopModule (; QR module of width Greater Than 1 QR#(width ,tnum) qrUnit <- mkQRgtONE ( ... ); return qrUnit; endmodule endinstance //return Figure 5-7: Instance of QR typeclass for width greater than 1 The design parameters that can be reconfigured to generate a unique systolic array implementation are: (i)Width of input matrix, (ii)Data type of input element, (iii)Implementation techniques of computation units, and (iv)Multiplier style. 5.1.1 Data type Wireless communication systems usually operate on Complex Valued Matrix, but for RVD MIMO signal detection approaches like RVD K-best [22], the Complex Valued 44 QR#(width , datatype) qr <- mkQRtopModule (. .. ) ; Figure 5-8: QR instantiation Matrix is decomposed into 2n x 2n sized Real Valued Matrix. The amount of arithmetic operations stay the same either way in non diagonal units. However in diagonal units, required computation is doubled by decomposing the n x n matrix into 2n x 2n as the value of diagonal entries of r is always real. These Complex and Real numbers can be represented in Fixed Point notation or Floating Point notation. The Fixed Point (FP) implementation is ideal for wireless communication devices because FP units consume comparatively less power and span smaller area-on-chip [3]. The number of bits required to represent a FP number without losing information depends on the size of matrix; as the size of the matrix increases the amount of computation that each input element goes through to produce output also increases. In turn this leads to increased amount of computational noise. In order to keep the computational noise at least below -10 db the number of bits to communicate a single chunk has to be increased. The process of evaluating the ideal bit length and the length of fractional part for a 4x4 matrix is discussed in [28], a similar process can be used to determine the bit length for larger matrices. To reuse the implementation done for one size of input matrix, we parameterized the choice of data type, word length and length of fractional part. The top level module can be configured at the time of instantiating of QRD module to work on a single specific type of data. And therefore can be modified by single edit point in the architecture before Verilog code generation and synthesis. We achieve this effect by keeping the type a configurable parameter and defining Conjugate, a complex number specific operation, for the Real and Fixed Point numbers. Figures 5-9, 5-10 and 5-11 show typeclass to implement Conjugate operation, where 'is' and 'fs' are size of integer and fractional part respectively, datat is the abstract data type and FixedPoint and Complex are Bluespec inbuilt data types. 45 for Fixed Point version of the module, extra wires are timed during synthesis, and therefore causes no extraneous wires in actual hardware or loss of performance. typeclass Conjugate #(type provisos ( ... data-t); ); function data-t con (data-t x); endtypeclass Figure 5-9: Typeclass for Conjugate operation instance Conjugate #(FixedPoint#(is , fs )); function FixedPoint#(is , fs) con (FixedPoint#(is return x; endfunction endinstance , fs) x); Figure 5-10: Instance of typeclass Conjugate for FixedPoint data type instance Conjugate #(Complex#(data-t)) provisos (... ); function Complex#(data.t) con ( Complex#(data-t) x ); let y = Complex {rel:x.rel , img:0-x.img}; return y; endfunction endinstance Figure 5-11: Instance of typeclass Conjugate for Complex data type Figure 5-12 demonstrates how our implementation can be configured for 4x4 matrix of type Fixed Point of word length 16 (6 bits for integer and 10 bits for fractional part). 5.1.2 Multiplier QR decomposition using either MGS or GR involves both complex and real multiplication. We implemented complex multiplier using 3 fixed-point multiplier and 2 sets 46 mkQR(4, FixedPoint#(6,10)) qr..FP6-10 <- mkQRtopmodule( ... ); Figure 5-12: Data type and Matrix Size Configuration in Main Module of adders, to compute complex product using Eq. 5.1 c=axb c.rel = (a.rel - a.img) + (b.rel - b.img) c.img = (b.rel - b.img) (5.1) + (b.rel + b.img) For fixed-point multiplier implementation on FPGAs, the in-built firm multiplier provided as part of DSP chips on an FPGA board can be used. These multipliers are optimized for area/time efficiency to bridge the performance gap between FPGAs and ASICs caused by the programmable nature of FPGAs. But in case of scarcity of DSPs on a specific board, soft multipliers implemented using LUTs inside CLBs can be used. As each of the applications has different DSP demands, we have parameterized this choice as well to suit any given case. The choice of word length affects size and architecture of a fixed-point multiplier. The accuracy requirement of a QRD system and the word length of an element of the matrix decide the critical path of a multiplier. As the word length increases, the critical path of multiplier can exceed the critical path of the overall system. Thus it can become a limiting factor for the minimum period and clock frequency of QRD. To manage the length of critical path in order to control the minimum period and maximum frequency, we implemented a pipeline version in which the number of pipeline stages is a configurable parameter. Number of stages can be increased to cut down the data path or it can be decreased to reduce the number of cycles taken to generate a product depending on specific requirements. It is achieved by passing the number of stages as a numeric type which is then used by the mkMultiplier module to generate stages using a for-loop. Word length of the multiplier and number of pipeline stages in a multiplier can be configured for both LUT based and DSP multipliers as shown in Figure 5-13. A 47 configured multiplier module can then be passed to external and internal units as input for synthesis as shown in Figure 5-14. Multiplier#(Complex#(FixedPoint #(16 ,16))) <-mkMultiplier DSP ; mkmul Multiplier#(Complex#(FixedPoint # (16,16))) <-mkMultiplier LUT; mkmul Figure 5-13: DSP and LUT based Multiplier External#(Complex#(FixedPoint #(16,6))) m; mkExternal (mkRotation (mkmul)); m <- Internal#(Complex#(Fixedpoint #(16 ,16))) m; m <- mkInternal (mkComplexMultiplier (mkMultiplierFP16LUT)); Figure 5-14: Configuring Type of Multiplier in core units 5.1.3 Storage Space Each unit has a register for interim ri, value and a register for input xi,j value. These both registers are word-length bits large. Boundary unit also has register for storing xo, c and s values. 5.1.4 Control Circuitry Data dependency of one unit on the output of previous unit is shown by blue and magenta directed lines in Figure 5-1. To assist synchronized data availability for each block we placed FIFOs at the input port for all the units, as shown in figure 5-15a and 5-15b. At the same time all the computations are implicitly guarded by availability of data in FIFO. Because of the internal data dependencies among each subsequent unit on the previous row and the symmetry of computational complexity of each row FIFO of 48 in Xin Xin Xout delayed Xin Xout Figure 5-15: (a)boundary unit with FIFO (b)internal unit with FIFOs at input port and Internal storage length 2 is big enough to synchronize the flow without incurring extra delays because of FIFO overflow. Implementation type 5.1.5 The constant throughput for this design is max (throughput of boundary node, throughput of internal node). This throughput can vary depending on the type of implementation of the unit blocks. The main operations involved in QR decomposition using GR are 1/i computation and multiplication. Figure 5-2 shows the equations im- plemented by each block. We present implementation for three techniques, namely (i) Log domain computation, (ii) Linear Approximation and (iii) Newton Raphson Iteration, as discussed below. (i) Log Domain Computation: A multiplication process in linear domain is equivalent to addition in log domain: a x b = log(a) + log(b) 49 (5.2) and division operation becomes subtraction. Similarly power operation in linear domain becomes multiplication operation: (5.3) ab = b x log(a) If the power b is 2 the operation reduces to left shift and if it is 1 operation reduces to right shift. By using this simplification technique computational extensive operations such as 1 can be transformed into shift and addition operations. The log and exponential values are pre-computed and stored in look-up tables. of the look-up tables decides what range of inputs can be handled. The size Storing linear-to-log and log-to-linear domain conversions for full range of input values To reduce the size of these tables while can result in huge look-up tables. maintaining the range of input, normalization of value before and after the conversion can be performed. The following equations shows how a look-up table that supports only range for 'a' can be used to translate bigger range of values: a.2b = lg2(a.2b) = log2 (a) + log 2 (2b) (5.4) = log2 (a) + b 2a.192(b) - 2 a.2 (og2(b)) (5.5) = 2a.b So an operation in log-domain can be broken down into following steps: (a) look-up log domain equivalent value of chosen amount of MSBs (equivalent to right shift) (b) add de-normalization constant (c) perform the desired operation in log domain 50 (d) look-up the linear domain equivalent value for the result divided by a normalization factor (e) de-normalize it by multiplying with 2 nrmalizationfactar (left shift) The division in step 4 can be reduced to shift operations if the normalization factor is chosen to be a multiple of 2. For multiplication operation using this approach have large area overhead (storage for log and linear domain translation tables) with no significant improvement in total time of operation (clock cycles x clock period). Therefore log domain operation is only favorable for computing 1/sqrt(x). Ideal size of these lookup tables for maintaining the desired accuracy depends on the data type and word length [23]. Therefore we parameterized LUT size and normalization factor for both log-to-linear and linear-to-log domain translations in our design. QR can be configured to have Log based boundary unit as shown in Figure 5-16 LogTable#(BitDis , FixedPoint #(16,16),LogLUTsize) logtbl <- mkLogTable(; Exptable#(BitDis , FixedPoint#(16,16), exptbl <- ExpLUTsize) exptbl; mkLogtable (; External#(Complex#(FixedPoint # (16,6))) mkext <- logtbl; mkext mkExternal (mkLogrotation (mkmul, logtbl , exptbl )); Figure 5-16: Configuring Log Domain External Unit. (ii) Linear Approximation: The value of 1/sqrt(x) can also be computed by linearly approximating the value of function along its tangent, using slope and offset values (Figure 5-17):. f(x) f (a) + f'(a)(x - a) 51 (5.6) Figure 5-17: Linear Approximation of 1/sqrt(x) The values of slope(f'(a)) and offset(f(a)) are pre-computed and stored in lookup tables. To increase the accuracy of this operation larger granularity of the entries along the lower bound of input values need to be stored. This can be achieved by using upper fractional bits along with the integer bits to index the look-up tables. Size of these tables can be increased to get higher range of values. Therefore we have parameterized both size of look-up table, and number of fractional bits used in index. These two parameters can be tuned to fit the accuracy requirements. QR can be configured to have LA based boundary unit as shown in Figure 5-18 LAtable#(BitDis , FixedPoint #(16,16), tbl <- LALUTsize) tbl; mkLAtable(; External#(Complex#(FixedPoint #(16 ,6))) mkext <- mkExternal(mkLArotation(mkmul, mkext; latbl)); Figure 5-18: Configuring LA based External Unit. (iii) Newton Raphson Iteration: The value of 1/sqrt can also be estimated using Newton Raphson Iteration method [26]. It performs iterative shift and add operations to compute 1/sqrt(x). It reduces complexity but increases latency because of iterative nature. Because the required iteration count depends on the data length and specific application, we have parameterized it. Figure 5-19 shows QR configured with NR based boundary unit with iteration count = 32. 52 QR#(tnum) qr <- mkQR(mkExternal#(NR#(32, mkMultiplier)) , mkInternal#(mkMultiplier )); Figure 5-19: Configuring Newton Raphson Method based External Unit. 5.1.6 Reuse within design Because of parameterized and re-configurable nature of the proposed architecture, same implementation can generate hardware specialized for varying size of matrix with different area requirements. The results for area and throughput vs varying sizes of matrix, acquired after Place and Routing the Verilog implementation generated for our Bluespec code using BSC are presented in the chapter 6. 5.2 GR Based QRD Linear Array The hardware for systolic array does not scale well for larger sizes of matrices, as shown in Figure 5-20, but it can be folded in a linear array of units as shown in Figure 5-21. This folding technique reduces the hardware requirements exponentially but decreases the throughput by a factor of n, where n is the width of the matrix. For the systems, where matrix size is large, area becomes bigger concern than throughput. We present parameterized folded hardware implementation, wherein we modified Walke's folding technique [12] to suit automated generation of control signals and output-reordering sequence. This folding technique, like Walke's folding requires the size of input array to be an odd number for 100% hardware efficiency. In the case where the size of matrix is an even number, an extra empty column is appended at the end of the input matrix. Our modified version has a different order of flow of inputs to the units; in the modified version, the input X, C and S are chosen from its own output and output of one of its two neighbors, whereas in the original sequence input was chosen from outputs of its two neighbors. The modified sequence can be seen in Figure 5-22 a for input array of width 5. The automatic generation of this sequence eliminates 53 the probability of human error. Figure 5-22 b shows our architecture design for the implementation of parameterized QR decomposition hardware. S Int En n n Inntlt B Int o nt In n l It nt Int BB It fB~~li +]t n +] Int nt In t Int t nt Int Int In Int OPI B Int, n Figure 5-20: QR systolic array for 11x11 matrix 5.2.1 Reuse across design This architecture has the same building blocks as GR based systolic array architecture. Therefore, it can directly use the design units: (i) Boundary unit, (ii) Internal unit, (iii) Full Row, (iv) Multiplier, (v) Implementation types; and connect them with temporary storage and control logic in a new top level module. Internal and boundary units require slight modification because in linear array these units no longer need internal temporary registers for intermediate results. 5.2.2 Storage Space Each unit in the row handles "n" entries of final R matrix. Therefore, each one to maintains a vector of n values of type DATA TYPE. This can result in huge size 54 46 I Figure 5-21: QR linear array for 11x11 matrix Figure 5-22: (a) Indexes of values of r generated, while processing 2 consecutive rows interleaved (b) QRD linear array for mxm matrix 55 be implemented in LUT slices and Register files. Therefore we moved this storage in BRAM. Size of interim R values is n x m x word length bits. All the interim X, C and S outputs generated in a cycle are utilized in the very next cycle. Therefore we only need a single set of output registers for each processing unit. Size of vector of values of interim xout, c and s is (3 x m - 1) x word-length bits. Size of look-up table of control signals for X,C and S inputs is n x m x 3 bits. While control signals for input and output R are stored in two log2 (m) bits counter registers. Detail of the control signals is discussed in next section. 5.2.3 Control Circuitry Control circuitry is required to pick the right set of inputs and outputs for each sub-unit in each of the iterations. Both input and output R values consumed and generated in each iteration are one row of the interim R memory block. A counter register is used to select the input row, while a register storing the previous value of counter is used to select the destination of the output generated in any given iteration. Figure 5-22a shows the indexes of the entries of matrix R generated by each unit for a matrix size 5x5. These R values can be re-routed to get the matrix R using the sequence generated by our proposed algorithm, shown in Figure 5-23. The X, C and S outputs generated in each cycle are stored directly in the output registers array. Inputs C and S for each internal block comes from output register either for its own output or for its left neighboring unit. So the control circuitry required to pick the right set of C, S input pair for next iteration consists of a 2-input-MUX for each unit (1 bit select signal). Input X for each internal block comes from 3 sources: from the input matrix row, its own output register, or output register for its right neighboring unit. Whereas input X for external unit comes either from the input matrix row or output register for its right neighboring unit. So the control circuitry required to pick the right X 56 1 %Y input: n matrix size 13 Y% output: nxm matrix of coordinates (a,b) m = ceil(n/2); iA = 1; %Zindex for current input row iB = m; %Xindex for previous input row cA = 2; %%counter for current row cB = n+2; Xcounter for previous row a=0; ,%%x coordinate in the result matrix R b=0; Yfy coordinate in the result matrix R flag = False; strIndl = zeros(n); %%starting indexes for (all str) strIndl(str)=3; for (str m-1)strInd1(str)=2; 14 for steps = 1:1:n 2 3 4 5 6 7 8 9 10 11 12 for j=iA:-1:1 if(cA 15 16 18 19 20 21 22 23 24 25 26 end end end end for j=iB:-1:1 if (cB-j >0 && cB-j<=n) a=j; b=cB-a; if(a==b) S(steps,1)= [a, b]; elseif (a+1==b) S(steps,2)= [a, b]; else for l=3:m; if(strcmp(S(stepsl), S(steps, )=[a,b] break; end end end end cA = cA + 1; Zfincrement index Loop 1 if(mod(stop,2)==O) cB = cB + 1; Xfincrement index loop 2 else 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 j >0 && cA-j<=n) a=j; b=cA-a; if (ab) S(steps,1)=[a,b]; elseif (a+1==b) S (steps, 2)= [a, b]; else S(stop,strInd1(a))=[a,b]; if(strInd1(a)+1<=m-a+1) strInd1(a)=strInd1(a)+1; 17 end end Figure 5-23: Algorithm to generate R state machine. 57 % sequence for CS and X input n i< 3 %' output nxm matrix of indexes for x and cs 4 m = ceil(n/2); a = 1; b = m-a; 5 max = maximum value in available word length 6 outX = zeros(n,m); outCS = zeros(n,m); 7 for i=1:n 8 c=0 9 for j=1:m 10 if(j==i t1(i>m && j>=m)) 11 outX(i,j) = max; 12 else 13 if(j-1==c) outX(i,j) = 0; Y right neighbour's output 14 elseif(j-2==c) outX(i,j) = 1; Y its own output is end 16 c=c+1; 17 end 1 2 18 if 20 21 22 23 24 25 26 Q=1) outCS(i,j) = 0; % dont care condition elseif(i>n-j+1) outCS(i,j) = 1; % its own output else outCS(i,j) = 0; Y left neighbour's output end end 19 end Figure 5-24: Algorithm to generate R state machine. input for next iteration consists of 2-input MUX for external unit and one 3-input MUX for each internal unit (1 and 2 bit select signals). Control signals for these MUXs are stored in an look-up table of size n x m x (1 + 2). Figure 5-24 shows the proposed algorithm for populating this look-up table for varying sizes of n. 5.3 MGS Based QRD Systolic Array We arranged the block diagram for MGS systolic array as shown in Figure 5-25 to illustrate its similarity with Givens systolic architecture. It shows 5x5 systolic architecture for MGS. The Boundary unit DP in this specific architecture implements line 58 7 and 8 of the algorithm shown in 3-2 and involves norm computation, compu- tation and vector scaling operations. The internal units, TPs, in a row implement one iteration of internal loop on line 9 of the algorithm shown in 3-2 and involves dot product, vector scaling and vector subtraction. The throughput of this design is equal to the latency of each row, consequently latency of each computational block. I DPP DP TP TP qI TP T+] TP DP Figure 5-25: MGS Systolic array 5.3.1 Reuse across algorithm We implemented this design using the same implementation style as that of GR Systolic array, with different boundary and internal units. Each row now contains 1 new boundary and n-1 new internal units. Implementation of 1 computation unit and Multiplier unit are reused from previous architecture. 5.3.2 Vector operations Both boundary and internal units in MGS based QRD involve vector operations such as dot product, vector scaling and vector difference for vector of size n where n is the width of input array. Therefore with the increase in size of input array, not only does the size of systolic array grow, but the size of each sub-unit also grows. This results in poor scalability. To improve the scalability of this design we present a pipe-lined architecture of batch processing unit, which processes an input array of size n while inferring 59 processing-unit (PU) array of size equal to only fraction of n. Figures 5-26 a and b show our architecture for batch product and batch accumulation. A shift register of size p is used to feed the next p values of input vector into a batch processing units in each iteration. A Batch-Product unit takes set of p values and generates p products in each iteration. While Batch-Accumulator takes in p inputs and perform p-to-1 compression in each iteration, and only generates output in "p iteration. The p-to-1 compression tree implementation can affect the critical path by performing multiple additions in one cycle. This module can be tuned and tested with different configurations to find an ideal p value. If it is configured for 4x4 matrix with PU array size = 4 the extra wires will be trimmed during bluespec compilation process resulting in hardware equivalent to completely parallel design of vector processing unit [13]. Whereas almost the same amount of processing units can be used to implement 8x8 QRD if PU array size is set to 1. Size of the processing unit array (PU array) is configurable for both DP and TP, and should be selected such that p is a factor n. Input Shift registers Input Shift registers Output register Output registers (b) (a) Figure 5-26: (a) Batch Adder (b) Batch Multiplier 5.3.3 Control circuitry Data dependency of sub-units on the output of previous sub-unit in MGS based systolic array is similar to data dependency in GR based systolic array. Therefore, like done for GR based systolic array, we used FIFO of size 2 to synchronize the inputs for MGS based systolic array sub-units. 60 5.3.4 Storage Space Each sub-unit has one register to store interim output value rij. There are total n(n-1)/2 interim output registers, equal to the non-zero entries of output matrix R. Each unit has its own set of temporary registers to hold a column of input matrix as this algorithm requires the complete matrix to be in memory before it can begin an iteration. There are n(n-1)/2 sub-units. 5.4 MGS Based QRD Linear Array: The similarity between systolic GR and systolic MGS architecture points towards the possibility of linear MGS, but due to increasingly huge temporary storage space required for linear array, interleaving of inputs is not feasible for MGS. Nevertheless, it reduces area by a factor of n, at the cost of decrease in throughput by the factor of n. 5.4.1 Reuse across design and algorithm Like linear GR, this architecture re-uses modules/units from systolic array implementation. The only new module implemented for this architecture is the top module enclosing the unit-row, i.e. the unit-row implemented in systolic array is enclosed in new top level module which has specialized control circuitry. 5.4.2 Control circuitry Control circuitry for this algorithm is 2 entry state machine, which can be implemented by single mux and a counter. We used FIFO to guard and synchronize the inputs for each unit. 5.4.3 Storage Space Each sub-unit has one register to store interim output value rij. There are only n interim output registers, equal to the number of sub-units, as the output is pushed 61 out as soon as it is prepared at the end of each iteration. Like systolic array sub-units, each of the n sub-unit in linear has its own set of temporary registers to hold a column of input matrix. 62 Chapter 6 Results and Discussion Experiment Conditions and Setup 6.1 6.1.1 Configuration Parameters Following are the configurable parameters for our architecture: 1. Algorithm - Givens based QR, MGS based QR 2. Array structure - Systolic, Linear 3. Computational unit's implementation - Linear approximation, Log domain computation, Newton Raphson 4. Multiplier implementation style - Firm (DSP based), Soft(LUT based) 5. Pipeline stages for multiplier 6. Word length for input array elements 7. Multiplier sharing - dedicated, shared pipe-lined multiplier 8. Size of computation vectors in MGS 63 6.1.2 Experiment Design We tested implementation of four parameterized architectures including: (1) Systolic Givens rotations based (GR sys) (2) Linear Givens based (GR lin) (3) Systolic MGS based (MGS sys) (4) Linear MGS based (MGS lin) We used following three implementation techniques for the above mentioned architectures: (a) Linear approximation (LA) (b) Log domain computation (Log) (c) Newton Raphson method (NR) For each design we used 4 different configurations for the type of multiplier: (i) All DSP multipliers (ii) LUT based multipliers used in external units (iii) LUT based multipliers used in internal units (iv) All LUT based multipliers All experiments are configured for complex valued input matrices, with word length of both real and imaginary parts equal to 32 bits (where both the integer and fraction parts are 16bits long). Single shared 3 stage pipe-lined multiplier is used in internal units for GR based linear and systolic arrays. Length of vector of processing units for MGS based QR is set to 1 for all sizes. 64 6.2 Performance on FPGA The experiment set-ups were evaluated by compiling configured QRD from Bluespec to Verilog code using BSC and subsequently acquiring Place & Route results for Xilinx-6 FPGA (XC6VLX240T). FPGAs are composed of array of configurable logic blocks (CLBs) each containing multiple slices. Each of these slices contain LUTs, flip-flop registers, carry chain and combinatorial circuitry. Interconnect network, comprised of switch matrices, connects slices to each other and CLBs with neighboring CLBs. In addition to the uniform array of CLBs, FPGAs come equipped with specialized state-of-the-art IP blocks such as Block RAM, digital signal processing blocks (DSPs), analog-to-digital converters, high speed IOs etc. These blocks are aimed to bridge the performance gap between custom ASIC and reusable FPGAs for general applications, as well as to reduce the area overhead of interconnect network used for programming the gate-arrays in an FPGA. The FPGA Virtex-6 (XC6VLX240T) has total 37,680 slices. Each slice contains four 6-input LUTs, each of which can be broken into two 5-input LUTs for maximum device utilization. Each slice contains 8 Registers (flip-flops). Total count of 6-input LUTs is 150,720, and that of Slice Registers (flip-flops) is 301,440. It also has 768 DSP48Els (containing 25 x 18, two's complement multiplier/accumulator), 3,770Kb distributed RAM and 1,885 shift registers. The Place & Route results of experiments are presented in Appendix-A (Tables A. 1 and A.2), in respect of various configurations for comparing area/resource utilization as well as throughput and latency. These results are discussed in the following sections in detail. 6.2.1 Linear versus Systolic Arrays 6.2.1.1 GR based Arrays Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for GR Linear and Systolic arrays, for all three implementation techniques, are 65 DSP blocks 00 UrN.A based --0- Un-Log based Uri-NR based 500- --- - SYS-LA based + Sys-Log based Sys-NR based 400- 300 4( 100- 1 2 3 4 6 5 size of input array (n) 7 8 9 10 Figure 6-1: DSP block usage in GR Linear and Systolic Arrays with all DSP based Multipliers shown in Figures 6-1, 6-2 and 6-3, respectively. Figures 6-4 and 6-5 show throughput and latency for each of these implementations. These results were computed for all DSP based 3-stage pipe-lined shared multipliers. From the data used to plot resource utilization graphs, it can be inferred that Systolic arrays grow 9 to 11 times faster than linear arrays in terms of DSP utilization; 10-11 times faster in terms of LUT utilization; 6-11 times faster in Registers utilization, for the three implementation techniques. Growth in resource utilization is linear in linear arrays, and exponential in systolic arrays. In terms of throughput, systolic array always out-performs linear array, as can be seen in Figure 6-4. This is because of the increased level of parallelism achieved in systolic arrays. Throughput of systolic array is not directly dependent on the size of input array, depicted by straight lines in Figure 6-4. Although with the increase in array size, larger word length is required to preserve the precision, this increase in word length affects both the cycle time (longer critical path) and count (more pipe-line stages in multipliers). Throughput of linear arrays diminishes linearly as the array size increases. 66 From above comparison between these implementations, it can also be inferred that Log and NR based Systolic arrays are suitable for smaller array size n. In this analysis the threshold value is 5 for the high DSP utilization applications. For larger arrays, with size equal to or greater than 6, linear arrays Log and NR are better suited. x 10, LUT 6- -5-9-+5 + -4-- Un-RLA be sod Un-Log rxhsed ased Sys-LA b Sys-Log b ased Sys-NR -aed 4.5 4 3.5 Iaa 3- 2.5 2 1,5 1 2 3 4 5 6 size of inptA arrmy (n) 7 8 9 10 Figure 6-2: Slice LUT usage in GR Linear and Systolic Arrays with all DSP based Multipliers 67 Registers x 10, UrnLA be sod Un-Log besed Un-NR be ead 4 .5 -Sys-LA bised -+Sys-Log b ased + 4Sys-NR based - 3.5 [ -4- 3 2.5 - 2 1.5 0.51 1 11 4 3 2 6 5 size of input array (n) 1 9 1 B 1 7 10 Figure 6-3: Registers usage in GR Linear and Systolic Arrays with all DSP based Multipliers Twougx 2- Un-LA based +-- Un-Log based ---14-- 0 + + -- Un-NR based Sys-LA based Sys-Log based Sys-NR based 64- 2 -- 1 2 3 4 5 size of ipu 6 array (n) 7 8 9 10 Figure 6-4: Throughput of GR Linear and Systolic Arrays with all DSP based Multipliers 68 Latency 70 60- 50 40 30- 20 - 10 Lin-LA based +- Lin-Log based Lin-NR based -+- ys-LA based Sys-Log based based 4B + -$ 0 1 2 3 4 5 6 size of irpu array (n) 7 -4-Sys-NR a 9 10 Figure 6-5: Latency of GR Linear and Systolic Arrays with all DSP based Multipliers 6.2.1.2 MGS based Arrays Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for MGS Linear and Systolic arrays, for all LA based implementation techniques, are shown in Figures 6-6, 6-7 and 6-8, respectively. Figures 6-9 and 6-10 show throughput and latency for each of these implementations. These results were computed for all DSP based 3-stage pipe-lined shared multipliers. The size of the vector of multipliers and adders used to implement the batch product, accumulator and subtraction units was set to 1 for all array sizes. It can be seen from Figure 6-6 that for input array size n larger than 4, systolic array implemented doesn't fit on the Virtex-6. DSP resource utilization in linear array for MGS grows linearly while in systolic array it grows exponentially as can be seen in Figure 6-6 . Register and LUT utilization grows exponentially in both linear and systolic designs, but in linear array the growth is slower than the systolic array as can be seen in Figures 6-8 and 6-7. On the other hand there is a significant reduction in throughput going from systolic to linear array, as can be seen in Figure 6-9. The relationship between the area and throughput of linear to systolic array is same as that of GR based implementation, as discussed in previous section. 69 DSP boksd 500 U-LA +-4--+- based Sys-LA based 450400 350- 300 250- 200- 150- 1 10 9 8 7 8 5 4 3 2 size of iru array (n) Figure 6-6: DSP block usage in MGS Linear and Systolic Arrays with all DSP based Multipliers 6x-10 LUT 55 5 5 4.5 4 3.5 2.5 F 2 5 I .1 2 3 4 6 5 7 7 88 9 9 10 10 size of IrpA array (n) Figure 6-7: Slice LUT usage in MGS Linear and Systolic Arrays with all DSP based Multipliers 70 Registers x-10 -- 5.5 +-- UnLA based Sys-LA based 5 4.5 4 3.5 3 2.5- 1.5- 11 - 6 5 size of irxA array (n) 4 3 2 10 9 8 7 Figure 6-8: Slice Register usage in MGS Linear and Systolic Arrays with all DSP based Multipliers TVr4pt 61 --+- Un-LA based Sys-LA based 2 I 1 2 3 4 6 5 size of Irp array (n) 7 7 9 a 9 1 10 Figure 6-9: Throughput of MGS Linear and Systolic Arrays with all DSP based Multipliers 71 Latency 3.53 - 1. 2.52 15 0.5- +- Lin-LA based -+-Sys-LA 1 2 3 4 6 5 size of inpt array (n) 7 8 9 based 10 Figure 6-10: Latency of MGS Linear and Systolic Arrays with all DSP based Multipliers 6.2.2 GR versus MGS DSP block, LUT slices and registers usage for both GR and MGS based, Linear and Systolic arrays, for LA based computational units are presented in Figures 6-11, 6-12 and 6-13; whereas throughput and latency for each of these four implementations are shown in Figures 6-14 and 6-15, respectively. These results were computed for all DSP based 3-stage pipe-lined dedicated multipliers. Word length was set to 32 bits for all sizes of arrays. For MGS based implementation, size of the vector of multipliers and adders used to implement the batch product, accumulator and subtraction units was set to 1 for all sizes. It is obvious from the Figures 6-11, 6-12 and 6-13 that MGS based implementation takes more area on chip than all implementation of GR even when shared multipliers/adders are used for each product/sum in batch processing units. Compared to MGS, latency of GR is better in both array designs. As discussed in architecture design, for an array of size e where e is even number, GR linear array is implemented for size e+1. For small arrays, size less than 5, the area and time overhead because of this extra unit can be significant. In such cases, MGS and GR systolic are more suitable than GR linear. 72 In terms of latency of decomposing a full matrix, MGS linear and systolic arrays perform better than GR linear and systolic arrays for all sizes of matrix, respectively. DSP 600-4- bIlocks GRAMLh GR.-Sys ~.~MGS 500 - MGS-Sys 400- 300- 200- 100- 01 2 3 4 5 size of inpu 6 7 8 9 10 array (n) Figure 6-11: DSP block usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers 73 W 6 5 LUT + -,5-4 GR.-Un GR.-Sys MGS-Un MGS-Sys - 4.5 4 3.5 3 / 2.5 2 1.5 2 1 3 4 5 6 size of irpA array (n) 9 7 10 Figure 6-12: Slice LUT usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers WO x 6r 5.5 5 Regstrs GR.-LUn GR -Sys -+-- MGS-Un MGS-Sys 4.5 F 4 3.5 F 3 2.5 2 1.5 11 2 m 3 4 5 6 size of WnpW army (n) 7 8 9 10 Figure 6-13: Slice Register usage in GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers 74 TN-OU~q 7f M--+-GR.-Un -~GR.-Sys + MGS-Un -- 4- MGS-Sys -4 6 5 4 3 2 4 I 6 5 size of irput array (n) 4 3 2 77 1U 8 8 9 8 10 Figure 6-14: Throughput of GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers Latency 18 - 16 14 12 10- 8 6 4- 4+-++ R-Un GR.-Sys MGS-Un -MGS-Sys 1 2 3 4 6 5 size of input array (n) 7 7 8 8 9 9 10 10 Figure 6-15: Latency of GR and MGS, Linear and Systolic Arrays implemented using LA blocks and all DSP based Multipliers 75 6.2.3 Comparison between Computational Unit Implementation Techniques 6.2.3.1 Linear Arrays Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for GR based, Linear arrays, for all three implementation techniques (LA, Log and NR) are presented in Figures 6-16, 6-17 and 6-18 respectively. Figures 6-19 and 6-20 show throughput and latency for each of these implementations, respectively. These results were computed for all DSP based 3-stage pipe-lined dedicated multipliers. Word length was set to 32 bits for all sizes of arrays. As it is evident from these figures, NR is the most efficient in terms of DSP slice utilization; whereas Log is most efficient in terms of LUT and Registers utilization as well as throughput. Throughput of Systolic array implemented using Log based units is highest, closely followed by Systolic array implemented using LA unit, as shown in Figure 6-19. DSP bkocks 100_ 90 NRbased 8070 60 + 40- 3020 10 1 2 3 4 y 5 size of inWu allay (n) 7 8 9 10 Figure 6-16: DSP block usage in GR Linear Arrays with different Computational block implementations 76 LUT 210 LA based Log based --- + a NR based 3 2 4 6 5 size of input array (n) 8 7 9 10 Figure 6-17: Slice LUT usage in GR Linear Arrays with different Computational block implementations 1.6 Regsters o104 x14 SLA -+-~ . 1.5 based Log based NR base d 1.4- 1.3- 1.2 1.1 1- 1 2 3 4 6 5 size of irput army (n) 7 8 9 10 Figure 6-18: Slice Register usage in GR Linear Arrays with different Computational block implementations 77 3 -9-" + LA based Log based NR based 2.5 2 1.5 0.5 01 2 3 4 6 5 7 8 10 9 size of input array (n) Figure 6-19: Throughput of GR Linear Arrays with different Computational block implementations Latency --- 4+ 1 2 3 4 6 5 size of input array (n) 7 8 LA based Log based NR based 9 10 Figure 6-20: Latency of GR Linear Arrays with different Computational block implementations 78 6.2.3.2 Systolic Arrays Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for GR based, Systolic arrays, for all three implementation techniques (LA, Log and NR) are presented in Figures 6-21, 6-22 and 6-23 respectively. Figures 6-24 and 6-25 show the throughput and latency for each of these implementations, respectively. These results were computed for all DSP based 3-stage pipe-lined dedicated multipliers. Word length was set to 32 bits for all sizes of arrays. It is obvious from these figures that the trends similar to those observed in the linear array are seen in systolic arrays in terms of DSP, LUT and Register utilization as well as throughput and latency for all three implementation techniques. DSP blocks 600[ LA based ---- Log bosed + NR based + 500- 400- 300- 200- 100 2 3 4 6 5 sLze of iqut array(n) 7 8 9 10 Figure 6-21: DSP block usage in GR Systolic Arrays with different Computational block implementations 79 x10 LUT 6- +* LAbad based +4- Log + NR based 5.5 54.5 4 -/ 3.532.5 2- 1.5 2 1 6 5 4 3 size of Input array (n) 7 8 9 10 Figure 6-22: Slice LUT usage in GR Systolic Arrays with different Computational block implementations 5- 10 Registers S 4.5 4 -- +- LA based Log based 4 NR baSed 4- -- 3.5 3-- 2.52- 1.5 - 1 2 3 4 6 5 size of i'MuA a my (n) 7 a 9 10 Figure 6-23: Slice Register usage in GR Systolic Arrays with different Computational block implementations 80 ThOUOW 11 -++ 10- LA based Log based NR based 987- 6543- 21 2 3 4 6 5 size of inpti array (n) 7 9 8 10 Figure 6-24: Throughput of GR Systolic Arrays with different Computational block implementations Latency 8r -+- LA based i Log based NR based + 1 2 3 4 5 stze 6( of input arrsy (n) 7 8 9 10 Figure 6-25: Latency of GR Systolic Arrays with different Computational block implementations 81 6.2.4 Multiplier Implementation (Firm versus Soft) In FPGAs two different types of multipliers can be inferred: 1) Firm, in-built multipliers in DSP slices; and 2) Soft, constructed using CLB LUTs or other memory elements. Both these multipliers come with their set of pros and cons: For example DSP multipliers are optimized and are most efficient in terms of area/time utilization but their position on board is fixed and therefore applications which require high number of multipliers, the routing cost may over shadow the benefit achieved by using DSP multipliers. On the other hand the soft multipliers can be plugged-in in any part of the design by implementing it using LUTs in the CLBs. CLBs come geared with carry propagate channel, making it easier to implement accumulators. Since Multipliers are partial product generators and accumulators, structure of CLB comes in use. There are multiple different implementations available for Soft multipliers, and the optimized design in terms of critical path and resource utilization depends on the relationship between multiplicand/multiplier's word length and LUT input size. Utilization trends for FPGA resources including DSP blocks, LUT Slices and Registers, for GR based, Linear arrays, for LA based implementation (all implemented for 32-bits word length complex valued input) are presented in Figures 6-26, 6-27 and 6-28, respectively; whereas Figures 6-29 and 6-30 show throughput and latency for each of these implementations. For these tests, four different multiplier configurations are used in the sub units: (1) All multipliers implemented using DSP blocks (dsp); (2) LUT based multipliers used in external units (lutex); (3) LUT based multipliers used in internal units (lutin); and (4) all LUT based multipliers (lut). As can be seen from these figures, throughput is slightly better for all DSP based multipliers configuration, for smaller sizes of input array. But with the increase in array size the difference between all four configurations is negligible. 82 DSP blocks 100 Based - Irt:DSP Based -+Based - IntLUT Based + Ext:LUT Based - t:DSP Based -+--Ext:LUT Based - nt:LUT Based - Ext:DSP Ext:DSP I 3 4 5 size of inp 6 7 8 9 10 array (n) Figure 6-26: DSP block usage in GR Linear Arrays with different Multiplier implementations 3.2- LUT x10 -+-+- 3 + -+- Int:DSP Based Ext:DSP BWsed Ext:DSP EWased - Itr:LUT Based Ex:LUT B Based - hnt:LUT Based Ext:LUT B ased Int:DSP ased 2.82.6 2.4 2.2- -+ 2 + 1.8 1.6 1.4 L 2 3 4 6 5 size of inpiu array (n) 7 8 9 10 Figure 6-27: Slice LUT usage in GR Linear Arrays with different Multiplier implementations 83 [-- egisters x 10 Ext:DSP -+- + 1.5 - -- +- Based - Irt:DSP Based Ext:DSP Based - Irt:LUT Based ExtLUT Based- N:DSP Based - bnt:LUT Based EXLUT ased 1.4 1.3 1.2 1.1 L 1' 2 3 ---------- L----------L- 4 7 6 5 8 9 10 size of iput array (n) Figure 6-28: Slice Register usage in GR Linear Arrays with different Multiplier implementations TIVoUOW 2.6 Based - Ir:DSP Based Ext:DSP Based - Irt:LUT Based Ext:LUT Based - Int:DSP Based +-Ext:DSP +-+ 2.4 Est:LUT Based - hnt:LUT Based 2.2 2 1.8 1.6 1.4 - 9 1 9 10 1.2 0.8 1 2 3 4 6 5 7 a size of irpa array (n) Figure 6-29: Throughput of GR Linear Arrays with different Multiplier implementations 84 Latency 18 16 -- Ext:DSP -4-EAtDSP Based - Int:DSP Based Based - lit LUT Based E*dLUT Based - tnt: DSP Based -4--B:LUT Based - IrCLUT Based + 1 2 3 4 6 5 size of irWi array (n) 7 8 9 10 Figure 6-30: Latency of GR Linear Arrays with different Multiplier implementations 6.3 Omega Notation Analysis of Design Parameters with increase in Input Array Size 6.3.1 Throughput Throughput of our implemented architectures is discussed in the following subsections. 6.3.1.1 Systolic Arrays Throughput for QRD for 3x3 array of complex elements represented in fixed-point (word length 32, 16 bits for fractional part) using GR based systolic array implementation is shown in Table 6.3. For these results shared 3-staged pipe-lined multiplier were used. Both GR and MGS based systolic arrays, can accept a new row in max( latency of Internal unit, latency of boundary unit) cycles for all input array sizes m x n. Throughput for GR based systolic array is 0(1) because the latencies of internal and boundary units do not depend on the array size for a given word length. Throughput of MGS based systolic array diminishes linearly with the increase in 85 Table 6.1: Throughput observed for Systolic Arrays (LA based, with all DSP based shared Multipliers) Algorithm array size PU array size minimum clock period (ns) throughput (cycles) maximum clock frequency (MHz) throughput (MRows/s) GR GR GR MGS MGS MGS 3 5 7 2 3 4 NA NA NA 1 1 1 9.628 9.796 9.983 9.955 9.983 9.972 16 16 16 18 28 42 103.86 102.08 100.17 100.45 100.17 100.28 6.49 6.38 6.26 5.58 3.58 2.39 array size (O(n)), because the latency of MG based internal and boundary units is incremented by a constant number of cycles. This increase in cycles is due to the time it takes a PU array to consume the extra elements of batch input vector in series. In case the size of PU array is set equal to size of array, then the throughput for MGS systolic architecture will also be constant for growing size of input array. 6.3.1.2 Linear Arrays Throughput for QRD for 3x3 array of complex elements represented in fixed-point (word length 32, 16 bits for fractional part) using GR based linear array implementation is shown in Table 6.4. For these results shared 3-staged pipe-lined multiplier were used. Both GR and MGS based systolic arrays, can accept a new row in m x max( latency of Internal unit, latency of boundary unit) cycles for all input array sizes m X n. Throughput for GR based systolic array decreases by O(n 2 ) or 0(mn) with increase in array size of m x n. Throughput of MGS based systolic array diminishes by 0(n') for shared multiplier (PU size < array size) and O(n 2 ) for PU size equal to array size. 86 Table 6.2: Throughput observed for Linear Arrays (LA based, with all DSP based shared Multipliers) Algorithm array size PU array size minimum clock period (ns) throughput (cycles) maximum clock frequency (MHz) throughput (MRows/s) GR GR GR GR MGS MGS MGS MGS 3 5 7 9 2 3 4 5 NA NA NA NA 1 1 1 1 9.543 9.755 9.567 9.642 9.594 9.987 9.962 10.41 47 77 107 137 85 152 241 338 104.79 102.51 104.53 103.71 104.23 100.13 100.38 96.06 2.23 1.33 0.98 0.76 1.23 0.66 0.42 0.28 6.3.2 Latency Latency of our implemented architectures and its sub-units is discussed in the following subsections. 6.3.2.1 Systolic Arrays Latency for 1 row for both GR and MGS based systolic arrays, is (latency of Internal unit + latency of boundary unit) cycles for all sizes of arrays m x n. Latency for a matrix for GR based systolic array grows O(n) for array size m x n (where m < n), as each of the n rows passes through at least m rows of boundary and internal units in parallel. Latency for a matrix for MGS based systolic array increases by O(n 2) for shared multiplier (PU size < array size) and O(n) for PU size equal to array size, for all array sizes m x n, where m < n. 6.3.2.2 Linear Arrays Latency for 1 row for both GR and MGS based linear arrays, is (2m - 1)x max( latency of Internal unit, latency of boundary unit) cycles for all sizes of arrays m x n. 87 Latency for a matrix for GR based systolic array grows O(m x n) for array size m x n (where m < n), because each input row passes through at least m rows of boundary and internal units sequentially, and 2 rows are processed in parallel. Latency for a matrix for MGS based systolic array increases by O(n 3 ) for shared multiplier (PU size < array size) and 0(n2 ) for PU size equal to array size, for all array sizes m x n, where m < n. 6.3.2.3 Internal Unit 1. Latency of internal unit in GR based QRD is equal to latency of 4 multiplications done in parallel or series + latency of 2 additions + 1 cycle FIFO delay. The value of latency of internal unit, implemented using dedicated 3-stage pipe-lined multiplier for each product, in our experiment was observed to be 6 cycles. The value of latency of internal unit, implemented using shared 3-stage pipe-lined multiplier for 4 product, in our experiment was observed to be 12 cycles. 2. Latency of internal unit in MGS based QRD is equal to latency of dot product computation unit (DOT) + latency of offset correction (OC); where latency of DOT size ofinputvector size of vector of multipliers x latency of a multiplication + size of input vector size Of vector of adders used for batch accumulationyf and latency of OC = size of size of input vector subtractinns size of vector of adders used for x latency of a multiplication + tvr x x latency of an addition, of an addition. latencyion The value of latency of internal unit, implemented using dedicated 3-stage pipelined multiplier for each product with size of vector of computation units equal to 3, in our experiment was observed to be 16 cycles. 6.3.2.4 Boundary Bnit 1. Latency of boundary unit in GR based QRD depends on the type of implementation. 88 (a) For LA based the latency is equal to 2 or 3 multiplications for fixed-point or complex numbers respectively + latency of table look up + one multiplication + one addition. The observed value of this latency in our experiment, for word length 32 bit implemented using dedicated 3-stage pipelined multiplier, is 14 cycles. (b) For Log domain based the latency is equal to 2 or 3 multiplicationsfor fixed-point or complex numbers respectively ± latency of one Log-values table look up + one shift ± one addition + latency of 2 Exponential-values table look up in parallel or series. The observed value of this latency in our experiment, for word length 32 bit implemented using dedicated 3-stage pipe-lined multiplier, is 11 cycles. (c) For NR based the latency is equal to 2 or 3 multiplication for fixed-point or complex numbers respectively + 1 square root computation + 1 divider, where both square root and divider take wdienglh iteraticonspercydle cycles. The observed value of this latency in our experiment, for word length 32 bit implemented using dedicated 3-stage pipe-lined multiplier, is 60 cycles. 2. Latency of boundary unit in MGS based QRD is equal to latency of norm computation unit (NORM) + latency of square root computation unit (SQR) + max(latency of a multiplication, latency of vector product (VP)) , where latency of NORM = ±szof vectr size of inut size of vector of multipliers x latency of a multiplication i size of input vect orxlaecofnadio; + sizeofvector of adders used for batch accumulatcn latency of an addition and latency of VP x latency of a multiplication; = size oinpuJt iers and latency of SQR depends on the type of implementation. 89 (a) For LA based the latency is equal to latency of table look up + one multiplication + one addition. (b) For Log domain based the latency is equal to latency of one Log-values table look up + one shift + one addition + latency of 2 Exponential-values table look up in parallel or series. (c) For NR based, it takes i,,r, ,d2,le, cycles. The observed value of this latency in our experiment, for word length 32 bit implemented using dedicated 3-stage pipe-lined multiplier, is 28 for LA based implementations. 6.3.2.5 Multipliers For both LUT and DSP based multiplier implementations, latency is equal to number of pipeline stages + 1. If multiplier is shared to compute multiple products then the latency for last product will be number of pipeline stages + 1 buffering cycle + number of inputs. 6.3.3 Area 6.3.3.1 Systolic Array For both GR and MGS, if area of boundary unit is AB gates and area of internal unit is AI gates, then area of a systolic array implementation is m x AB + (n+1) AI gates + area of connection network, for all array sizes m x n, with m < n. 6.3.3.2 Linear Array -m) x For both GR and MGS, if area of boundary unit is AB and area of internal unit is Al, then area of a Linear array implementation is AB + (al-1) x AI where al is (-) +1, for all array sizes m x n, where m < n. This does not include the area occupied by the connection network. 90 6.3.3.3 Internal Unit Area of internal units in gates is as under: 1. Area of internal unit in GR based QRD is equal to area of 1 to 4 multipliers + area of 1 to 2 adders. 2. Area of internal unit in MGS based QRD is equal to Area of 1 dot product computation unit (DOT) + area of 1 offset correction (OC), where area of both DOT and OC = size of vector of multipliers x area of a multiplier + size of vector of adders x area of an adder. 6.3.3.4 Boundary Unit 1. Area of boundary unit in GR based QRD depends on the implementation. (a) For LA based the area is equal to 2 to 4 fixed-point multipliers + 1 LA look up table + 3 adder; where each entry in LA look up table is 2 x word length bits long. (b) For Log domain based the area is equal to 2 to 3 fixedpoint multipliers + 1 Log-values look up table + 1 shifter + 3 adders + 1 to 2 Exponential-values look up tables; where each entry in Log-values and Exponential-values look-up table is word length bits long. (c) For NR based, the area is equal to area of 1 to 3 fixedpoint multipliers + 3 adder + area of 1 square root computational unit + area of 3 divider Square-root unit has an adder, up to 2bit shifter, 1 word long input FIFO, 2 words long output FIFO, and 5 state registers. The five state registers are: 3 data registers each of length one word, 1 counter register of length Log2(word length) bits, and 1 flag register of length one bit. 91 Divider unit has an adder, 1-bit shifter, 2 words long input FIFO, 2 words long output FIFO, and 5 state registers. The five state registers are similar to ones in Squareroot unit. 2. Area of boundary unit in MGS based QRD is equal to area of norm computation unit (NORM) + area of square root computation unit (SQR) + area of a multiplication + area of vector product (VP)); where area of NORM = size of vector of multipliers x area of a multiplier + size of vector of adders x area of an adder; area of VP = size of vector of multipliers x area of a multiplier; and area of SQR depends on the implementation. (a) For LA based the area is equal to 1 LA look up table + 1 adder + 1 fixed point multiplier; where each entry in LA look up table is 2 x word length bits long. (b) For Log domain based the area is equal to 1 Log-values look up table + 1 shifter + 1 adder + 1 to 2 Exponentialvalues look up tables; where each entry in Log-values and Exponential-values look up table is word length bits long. (c) For NR based square root unit has 5 state registers; 3 data registers of length word-length bits, 1 counter register of length Log 2 (word length) bits, and 1 1-bit flag register, in addition to word length incoming FIFO, and 2 x word length outgoing FIFO. It also contains a word length adder and up to 2bit shifter. 6.3.3.5 Multipliers For complex number with both real and imaginary parts represented by 32-bit fixedpoint number, it takes 6 DSP multipliers in Virtex-6. For a 16-bit fixed-point multiplier, it takes 1 DSP block or around 223 LUTs. 92 Area used by multiplier can be reduced by taking advantage of pipe-lined nature of the multiplier, and sharing it for computing multiple products instead of using dedicated multipliers. This reduction in area comes at the cost of increased latency. But since total latency of a row of internal and external/boundary units in all architectures discussed here is equal to larger of the two latencies, and boundary unit takes significantly more cycles than internal unit, we can safely share multipliers in internal unit without loss in overall performance. 6.4 Target Oriented Optimization Target oriented optimal choice for a specific MIMO and beam formation setup is discussed in the following sections. Requirements for compressed sensing are problem specific and may range any where between very small size with time as critical design parameter to medium and large size with area as critical design parameters. The optimal solution can be found out by searching the configuration space for the given input data range and matrix size. Therefore, the target based optimal choice for compressed sensing is not discussed here. 6.4.1 MIMO 6.4.1.1 Required Specifications For MIMO of size 4x4, as recommended in [1], 16 bits are enough for preserving the precision [19]. In 3G LTE OFDM signal [4] there are up to 2048 sub-carriers. The channel matrix R must be computed for all the sub-carriers within the duration for which the channel impulse response is invariant (coherence time), which is computed as: te =(vf,) where c is speed of light 3 x 108 m/s, v is speed of receiver, and For v = 250 km/h and f, = (6.1) f, is carrier frequency. 2.4 GHz, the t, = 1.8 ms in which 2048 decomposition 93 must be performed. 6.4.1.2 Optimized Configurations Due to small input-matrix size, systolic arrays can be employed for best throughput. Table 6.3 shows the P&R results for systolic GR and MGS array results. All these implementation complete 2048 computations in less than 1.8ms except for NR based QRD. For the minimum area Systolic Log based GR QRD provides the best performance. Table 6.3: P&R results for GR and MGS based Systolic Arrays for complex valued input array of size 4x4 and word length (6.10) Algorithm Impl. type PU array size DSP usage LUT usage gr gr gr mgs mgs la log nr la la NA NA NA 1 2 99 (12%) 30 (3%) 22 (2%) 243 (31%) 438 (57%) 19189 19201 19056 33297 41124 (12%) (12%) (12%) (22%) (27%) Reg usage 12069 11447 14026 31562 33031 (4%) (3%) (4%) (10%) (10%) Processing for time 2048 rows (micro sec) 1.283 0.876 1.867 0.857 0.674 Comparing these results, the most optimized choice for this given problem is Systolic Log based GR QRD. 6.4.2 Beam Formation 6.4.2.1 Required Specifications The beam-former weights and channel estimates are computed using pilot symbols transmitted through dedicated physical control channel (DPCCH). The updated beam former weights are used for multiplication with the data transmitted through the DPCCH. For narrower bean, larger dimension of weight matrix is required. As the area constraints get tighter, and chip area becomes scarcer, a linear array becomes more suitable choice. 94 6.4.2.2 Optimized Configurations GR linear array QRD architectures designed in this study can fit on Virtex 6 for matrices larger than 25x25. However, choice of GR linear based on NR or Log QRD depends on the DSP blocks and throughput requirements. Table 6.4 shows area and timing results GR Linear NR and Log based QRD, implemented for word size 16 (6-bit integer part, 10-bit fraction part) using all DSP multipliers. Table 6.4: P&R results for GR Linear Arrays for complex valued input array, word length (6.10) Algorithm Array size PU array size DSP usage LUT usage Reg usage Throughput (micro sec/row) gr gr gr gr gr gr gr gr 17 17 19 19 21 21 25 25 nr log nr log nr log nr log 25(3%) 27(3%) 28(3%) 30(3%) 31(4%) 33(4%) 37(4%) 39(5%) 18308(12%) 18468(12%) 19078(12%) 19101(12%) 20035(13%) 20177(13%) 22469(14%) 21876(14%) 13997(4%) 13266(4%) 14675(4%) 13978(4%) 15384(5%) 14654(4%) 16805(5%) 16075(5%) 6.92 2.28 7.43 2.54 8.31 2.82 10.45 3.37 6.5 Comparison with Previously Reported Results We have also compared results of our proposed architecture with those previously reported, in terms of throughput, latency and area. The specific comparison has been made with the results reported by [7], [12], [14], [15], [19], [25], [16], [24] and [21]. This comparison is presented in Tables 6.5 - 6.6 for MIMO and Beam formation, respectively. Comparing our results with previously reported ones show that our implementation outperforms the earlier reported results in terms of performance. 95 6.5.1 MIMO From the comparison of results for MIMO, it is apparent that our GR bases systolic array implemented using Log domain computational units has the best throughput, closely followed by [15] running at 160MHz and [19] when operated at the highest clock frequency setting. It is well noted that higher clock frequency results in higher power consumption for any given circuit. Comparison between CLB count for [21] and our architecture acquired from P&R reports for Virtex 6 shows that our architecture for Log based, NR and LA based 4x4 QRD implementations took 2.4, 2.6 and 2.7 times more CLBs than implementation presented in [21] while improving the throughput by 54, 25 and 37 times than that of [21]. 6.5.2 Beam Formation Since for Beam formation the area is critical requirement, we only compared the area reported by previous authors, as shown in Table 6.6. From Table 6.6 it is clear that minimum area is utilized by our GR Linear log based QRD implementation. By using the insight that boundary unit takes more cycles than internal unit, we were able to share the multiplier in internal unit to take advantage of the extra cycles available for the internal unit to process one input. Consequently we reduced the total number of mulitpliers used by 2.5x. 96 Table 6.5: Comparison of our study results with previously reported for MIMO Algorithm ported result . This study clock freq Processing MRows Time for /sec cycles per (Kgates), Slice QRD (MHz) 2048 rows count 23.2 269 (ms) 1.058 0.13 * 480 * 17.7 212 1.343 0.16 * 773 * [19] low clock- 16.3 160 1.779 0.22 * 1357 freq [16] [15] [14] [25] [24] [21] (32 bits) [7] (12 bits) 61.8 48.7 27 6 72 1380 - 162 166 - 1.28 * 0.96 * - 0.16 * 0.12 * - - - - 60 100 - - 0.17 0.09 966 * 707 * 88 252 67 - GR systolic LA GR systolic log GR systolic NR MGS systolic 7449 (19%) 6733 (17%) 7079 (18%) 13470 (35%) 102.12 102.83 114.04 100.37 1.28 0.88 1.87 3.43 6.38 9.35 4.39 2.39 64 44 104 68 15762 (41%) 100.32 2.69 3.04 52 [19] high clockfreq medium [19] Previously clock-freq re- Gate count or Gate equivalent - LA (PU=1) MGS systolic LA (PU=2) a*calculated from given data 97 * Table 6.6: Comparison of our study results with previously reported for Beam formation Real Algorithm Real Beta Divisors Rounder Shifter Adders Mul- Multi- ti- pli- pli- ers ers Previo- [12] boundary 2 8 1 1 2 0 usly reported clk freq. cell internal 8 8 1 0 1 0 0 2 0 0 0 0 162 170 21 1 22 0 4 0 0 0 0 3 0 0 0 0 67 0 0 0 0 3 0 0 0 1 3 0 0 0 0 66 0 0 0 1 1 0 0 0 3 result 100MHz cell output cell 21 LA This study - linear array boundary 3 cell 7 internal cell 21 GR linear Log cell cell 150 linear array boundary 3 cell 7 internal cell 21 cell 150 linear array boundary 4 NR cell internal 7 3 0 0 0 0 cell 21 cell 151 64 0 0 0 3 linear array 98 6.6 Guidelines for Architecture Selection From the preceding results and discussion, following guidelines for appropriate selection of an architecture for a given problem can be concluded, presented in Table 6.7 Table 6.7: Selection of Appropriate Architecture Archi- tech. Impl. for boundary unit Critical Design Parameters Appropriate tecture Area GR Linear Array any DSP blocks GR Linear/Systolic NR Throughput Dynamic range of input elements and Accuracy Latency, input matrix size < 4 Latency and area, input matrix size < 4 Latency and area, input matrix size > 4 GR Systolic any Log LA MGS systolic MGS Linear GR Systolic LA/Log LA/Log Log 99 100 Chapter 7 Conclusions We present a highly modular and completely parameterized implementation of two different algorithms; Givens-Rotation (GR) and Modified-Gram-Schmidt (MGS), chosen for their suitability for hardware implementation. From the results of implementation of four parameterized architectures including systolic Givens rotations based, linear Givens based, systolic MGS based, and linear MGS based with three different configurations namely, linear approximation, log domain, and Newton Raphson method, it was concluded: (1) Maximum throughput of 10.1 M rows/sec was achieved for implementation of Givens based systolic array with log domain QRD configuration for 3x3 complex valued matrix on Virtex-6 FPGA. (2) The minimum slice utilization was achieved by the Givens based linear array with log domain QRD configuration at the cost of reduced throughput. (3) Given based systolic array with log domain proved to be the most resource efficient algorithm. (4) MGS based QRD outperforms GR in terms of latency, but is suitable for only input array sizes < 4 because of exponential growth in area with the increase in input array size. 101 (5) IP for all the proposed architectures have been prepared and are available at http://saqib.scripts.mit.edu/qr..code.php. This set of IPs can be configured to suit variety of application demands to generate hardware with zero design and debugging time. (6) Guidelines concluded from the reported results can be used to pick the optimal design choice for a given design requirement. (7) Because our architecture is completely modular, sub-units can be independently optimized and tested without the need of retesting the whole system. 102 Bibliography [1] 3gpp tr 25.876 multiple input multiple output in utra. 3rd GenerationPartnership Project, Tech. Rep., October 2005. [21 Mimo and smart antennas for mobile broadband systems. LTE standard,2013. [3] Naofal Al-Dhahira and Ali H. Sayedb. Cordic-based mmse-dfe coefficient computation. Digital Signal Processing: A Review Journal, pages 178-194, 1999. [4] R. Bachl, P. Gunreben, S. Das, and S. Tatesh. The long term evolution towards a new 3gpp air interface standard. Bell Labs Technical Journal, 11(4):25-51, 2007. [5] E. Candes and T. Tao. Decoding by linear programming. IEEE Trans. on Inform. Theory, 51(12):4203-4215, 2005. [6] L. Dai, S. Sfar, and K.B. Letaief. Optimal antenna selection based on capacity maximization for mimo systems in correlated channels. Communications, IEEE Transactions,54(3):563-573, March 2006. [71 Viktor wall Fredrik Edman. A scalable pipelined complex valued matrix inversion architecture. ISCAS IEEE, pages 4489-4492, 2005. [8] S. Haykin. Adaptive Filter Theory. Prentice-Hall, third edition, 1994. [9] S.-F. Hsiao and J.-M. Delosme. Householder cordic algorithms. IEEE Trans. Comput., 44:990-1001, August 1995. [10] A. Jraifi and E.H. Saidi. A prediction of number of antennas in a mimo correlated channel. International Conference on Intelligent Engineering Systems, 2008. INES 2008., pages 181-184, Feburary 2008. [11] T. Kailath, H. Vikalo, , and B. Hassibi. Mimo receive algorithms. Space-Time Wireless Systems: From Array Processing to MIMO Communications, Cambridge University Press, 2005. [12] G. Lightbody, R. Walke, and R. WoodsandJ. McCanny. Linear qr architecture for a single chip adaptive beamformer. The journel of VLSI Signal Processing, 24(1):67-81, 2000. 103 [13] Chih-Hung Lin, R.C.-H. Chang, Chien-Lin Huang, and Feng-Chi Chen. Iterative qr decomposition architecture using the modified gram-schmidt algorithm. IEEE InternationalSymposium on Circuits and Systems, 2009. ISCAS 2009, 2009. [14] Kuang-Hao Lin, Nat. Chung Hsing, R.C. Chang, Chien-Lin Huang, and Feng-Chi Chen. Implementation of qr decomposition for mimo-ofdm detection systems. 15th IEEE InternationalConference on Electronics, Circuits and Systems, 2008. ICECS 2008., pages 57-60, 2008. [15] P. Luethi, A. Burg, S. Haene, D. Perels, N. Felber, and W. Fichtner. Vlsi implementation of a high-speed iterative sorted mmse qr decomposition. Proc. of IEEE ISCAS, page 14211424, 2007. [16] P. Luethi, ETH Zurich, Zurich, Studer C., Duetsch S., and Zgraggen E. Gramschmidt-based qr decomposition for mimo detection: Vlsi implementation and comparison. IEEE Asia Pacific Conference on Circuits and Systems, 2008. APC- CAS 2008., pages 830-833, 2008. [17] Robert L. Parker. Geophysical Inverse Theory. Princeton University Press, 1994. [18] H. Sakai. Recursive least-squares algorithms of modified gramschmidt type for parallel weight extraction. IEEE Trans. Signal Procss, 42:429-433, February 1994. [19] P. Salmela, A. Burian, H. Sorokin, and J. Takala. Complex-valued qr decomposition implementation for mimo receivers. IEEE InternationalConference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008., pages 1433-1436, 2008. [20] Perttu Salmela, Adrian Burian, Harri Sorokin, and Jarmo Takala. Complexvalued qr decomposition implementation for mimo receivers. IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008., pages 1433-1436, March 2008. Implementation of givens qr[21] Anatoli Sergyienko and Oleg Maslennikov. decomposition in fpga. PPAM '01 Proceedings of the th InternationalConference on Parallel Processing and Applied Mathematics-Revised Papers, pages 458-465, 2001. [22] M. Shabany and P. G. Gulak. A 0.13mm cmos 655 mbps 4x4 64-qam k-best mimo detector. IEEE Int. Solid-State Circuits Conf. Dig. Tech., pages 256-257, Feburay 2009. [23] C. K. Singh, S. H. Prasad, and P. T. Balsara. A fixed-point implementation of qr decomposition,. IEEE Dallas Workshop CircuitsSystems., Dallas, TX, pages 795-825, October 2006. 104 [24] C.K. Singh, S.H. Prasad, and P.T. Balsara. Vlsi architecture for matrix inversion using modified gram-schmidt based qr decomposition. . Int. Conf. VLSI Design, pages 836-841, January 2007. [25] F. Sobhanmanesh and S. Nooshabadi. Parametric minimum hardware qrfactoriser architecture for v-blast detection. IEEE IEE Proc. Circuits Devices Syst, page 433441, 2006. [26] W.S Song, D.V. Rabinkin, M.M. Vai, and H.T. Nguyen. Vlsi bit-level systolic sample matrix inversion. MIT Lincoln Laboratory Report NTP-2, 2001. [27] J. Tropp and A. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. on Inform. Theory, 53(12):4655-4666, 2007. [28] S. Wang, Jr. E. E., and Swartzlander. The critically damped cordic algorithm for qr decomposition. . IEEE Asilomar Conf. Signals Syst. Comput, pages 908-911, November 1996. 105 106 Appendix A Tables 107 Table A.1: P&R results for GR and MGS based Linear and Systolic Arrays with LA, Log and NR based QRD config. for word length (16.16) Sr. no. Algc Array structure Input Multiplier artype (Boundary unit, ray Impl PU tech. array size Internal unit) size Reg. LUT DSP (%age) (%age) (%age) 10078 1 GR Linear 3 DSP, DSP LA NA 42 14536 (5%) (9%) (3%) 2 GR Linear 5 DSP, DSP LA NA 60 16252 11611 (7%) (10%) (3%) 17810 13135 3 GR Linear 7 DSP, DSP LA NA 78 (10%) (11%) (4%) 9 DSP, DSP LA NA 96 19570 14654 4 GR Linear (12%) (12%) (4%) 5 GR Systolic 3 DSP, DSP LA NA 120 20434 12835 (15%) (13%) (4%) 6 GR Systolic 5 DSP, DSP LA NA 294 35541 22632 (38%) (23%) (7%) 7 GR Systolic 7 DSP, DSP LA NA 540 58291 37588 (70%) (38%) (12%) 8 GR Linear 3 DSP, DSP Log NA 24 13514 9790 (3%) (8%) (3%) 9 10 11 12 13 14 15 GR GR GR GR GR GR GR Linear Linear Linear 5 7 9 Systolic 3 Systolic 5 Log DSP, DSP Log DSP, DSP Log DSP, DSP Log DSP, DSP Log DSP, DSP NA NA NA NA NA Systolic 7 DSP, DSP Log NA 3 DSP, DSP NR NA Linear 108 36 15240 11281 (4%) (10%) (3%) 48 16853 12763 (6%) (11%) (4%) 60 18810 14241 (7%) (12%) (4%) 72 18422 11975 (9%) (12%) (3%) 31877 20951 180 (23%) (21%) (6%) 336 53798 35021 (43%) (35%) (11%) 16 14355 11344 (2%) (9%) (3%) 16 GR 12726 Linear 5 DSP, DSP NR NA 28 16223 (3%) (10%) (4%) DSP, DSP NR NA 40 17946 14208 17 GR Linear 7 (5%) (11%) (4%) 18 GR Linear 9 DSP, DSP NR NA 52 19690 15685 (6%) (13%) (5%) 19 GR Systolic 3 DSP, DSP NR NA 48 20699 15759 (6%) (13%) (5%) 20 GR Systolic 5 DSP, DSP NR NA 140 35234 27917 1 (18%) (23%) (9%) 21 GR Systolic 7 DSP, DSP NR NA 22 GR Linear 3 LUT, DSP LA NA 280 (36%) 12 59335 (39%) 18083 45195 (14%) 10501 (1%) (11%) (3%) 19876 11989 23 24 25 GR GR GR Linear Linear Linear 5 LA LUT, DSP NA 24 (3%) (13%) (3%) 21571 13468 7 LUT, DSP LA NA 36 (4%) (14%) (4%) 9 LUT, DSP LA NA 48 23225 14957 (6%) (15%) (4%) LA NA 36 30983 14077 (4%) (20%) (4%) LA NA 120 53334 24457 1(15%) 26 GR Systolic 3 LUT, DSP 27 GR Systolic 5 LUT, DSP (35%) (8%) 28 GR Linear 3 LUT, LUT LA NA 0 20939 10989 (0%) (13%) (3%) 29 GR Linear 5 LUT, LUT LA NA 0 25733 13018 (0%) (17%) (4%) 30 GR Linear 7 LUT, LUT LA NA 0 30514 15031 (0%) (20%) (4%) 31 GR Systolic 3 LUT, LUT LA NA 0 39815 15261 (0%) (26%) (5%) 0 82603 28697 (0%) (54%) (9%) 24 17018 10480 (3%) (11%) (3%) 1_ 1__ 32 33 GR GR Systolic 5 Linear 3 LA LUT, LUT LA DSP, LUT 109 NA NA 34 GR Linear 5 DSP, LUT LA NA 24 21558 12403 (3%) (14%) (4%) 35 GR Linear 7 DSP, LUT LA NA 24 26284 14703 (3%) (17%) (4%) 66 28655 14040 (8%) (19%) (4%) 114 63666 26992 (14%) (42%) (8%) 102 17866 13671 (13%) (11%) (4%) 150 22551 19629 (19%) (14%) (6%) 198 30825 31333 (25%) (20%) (10%) 246 41146 44419 (32%) (27%) (14%) 19982 14482 36 GR Systolic 3 DSP, LUT LA NA 37 GR Systolic 5 DSP, LUT LA NA 38 39 40 41 MGS Linear MG' Linear 2 3 MGS Linear- 4 MG' Linear 5 LA DSP, DSP LA DSP, DSP LA DSP, DSP LA DSP, DSP 1 1 1 1 1 138 MGS Systolic 2 DSP, DSP LA (17%) (13%) (4%) 43 MGS Systolic 3 DSP, DSP LA 1 288 33009 29353 (37%) (21%) (9%) 44 MG' Systolic 4 DSP, DSP LA 1 486 55966 56403 1_ 1(63%) (37%) (18%) 42 ______ 1_ 1_ 110 Table A.2: P&R results for GR and MGS based Linear and Systolic Arrays with LA, Log and NR based QRD config. for word length (16.16) Sr. no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Algc Array structure GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR GR Linear Linear Linear Linear Systolic Systolic Systolic Linear Linear Linear Linear Systolic Systolic Systolic Linear Linear Linear Linear Systolic Systolic Systolic Linear Linear Linear Linear Systolic Systolic Linear Linear Linear Systolic Systolic Linear Linear Linear Systolic Systolic Input Multiplier type (Boundarunit, ary ray Impl PU tech. array Min. period size Internal unit) size (ns) 3 5 7 9 3 5 7 3 5 7 9 3 5 7 3 5 7 9 3 5 7 3 5 7 9 3 5 3 5 7 3 5 3 5 7 3 5 DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP DSP, DSP LUT, DSP LUT, DSP LUT, DSP LUT, DSP LUT, DSP LUT, DSP LUT, LUT LUT, LUT LUT, LUT LUT, LUT LUT, LUT DSP, LUT DSP, LUT DSP, LUT DSP, LUT DSP, LUT NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 9.54 9.76 9.57 9.64 9.63 9.80 9.98 9.44 9.44 9.17 9.80 9.99 10.69 9.94 9.24 9.87 9.35 9.59 9.36 9.98 9.93 9.97 9.75 9.90 9.92 9.99 9.99 9.94 9.99 9.98 9.89 10.38 9.98 9.93 9.79 9.99 9.98 LA LA LA LA LA LA LA Log Log Log Log Log Log Log NR NR NR NR NR NR NR LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA LA 111 Through put (Mrows /sec) 2.23 1.33 0.98 0.76 6.49 6.38 6.26 2.58 1.56 1.15 0.84 10.01 8.50 9.15 0.57 0.32 0.24 0.18 2.89 2.71 2.72 2.13 1.33 0.94 0.74 6.25 6.26 2.14 1.30 0.94 6.32 6.02 2.13 1.31 0.95 6.25 6.26 Latency (ins) 3.24 6.41 10.53 16.06 3.09 3.15 3.22 2.80 5.48 8.97 14.54 2.20 2.36 2.21 12.75 26.89 43.06 67.13 6.93 7.43 7.42 3.39 6.41 10.90 16.53 3.21 3.22 3.38 6.56 10.99 3.17 3.34 3.39 6.52 10.78 3.21 3.21 38 39 40 41 42 43 44 MG Linear MG Linear MG Linear MG Linear MG Systolic MG Systolic MGI Systolic 2 3 4 5 2 3 4 DSP, DSP, DSP, DSP, DSP, DSP, DSP, LA LA LA LA LA LA LA DSP DSP DSP DSP DSP DSP DSP 112 1 1 1 1 1 1 1 9.59 9.99 9.96 10.41 9.96 9.98 9.97 1.23 0.66 0.42 0.28 5.58 3.58 2.39 0.82 1.52 2.40 3.52 0.23 0.43 0.68 Appendix B Source Code 113 Contents 1 DataType.bsv 117 2 Conjugate.bsv 118 3 Double.bsv 119 4 BSVDouble.c 121 5 BSVDouble.h 124 6 GR specific Rotate.bsv 125 7 GR specific LArotation.bsv 126 8 LAtable.bsv 129 9 131 GR specific Logrotation.bsv 10 Exptable.bsv 134 11 Logtable.bsv 136 12 GR Linear specific UnitRow.bsv 138 13 GR Linear specific mkExternal.bsv 140 14 GR Linear specific mkInternal.bsv 141 15 GR Linear specific QR.bsv 143 16 GR Linear specific mkQR.bsv 144 17 GR Linear specific Memory.bsv 148 18 GR Linear specific States.bsv 150 19 GR Linear specific FixedPointQR.bsv 152 20 GR Linear specific Scemi.bsv 154 21 GR Systolic specific FullRow.bsv 155 22 GR Systolic specific mkFullRow.bsv 156 23 GR Systolic specific mkExternal.bsv 157 24 GR Systolic specific mkInternalRow.bsv 158 25 GR Systolic specific mkInternal.bsv 159 26 GR Systolic specific QR.bsv 161 114 27 GR Systolic specific mkQR.bsv 162 28 GR Systolic specific FixedPointQR.bsv 164 29 GR Systolic specific mkStreamQR.bsv 166 30 GR Systolic specific Scemi.bsv 167 31 MGS specific BatchAcc.bsv 168 32 MGS specific BatchCS.bsv 169 33 MGS specific BatchProduct.bsv 170 34 MGS specific BatchSub.bsv 171 35 MGS specific mkDot.bsv 172 36 MGS specific mkNorm.bsv 174 37 MGS specific mkOffsetCorrection.bsv 176 38 MGS specific mkVecProd.bsv 179 39 MGS specific SqrtInv.bsv 181 40 MGS specific LASqrtInv.bsv 182 41 MGS specific LogSqrtInv.bsv 184 42 MGS specific NRSqrtInv.bsv 186 43 MGS specific mkDP.bsv 187 44 MGS specific mkTP.bsv 189 45 MGS specific UnitRow.bsv 191 46 MGS specific QR.bsv 193 47 MGS specific mkStreamQR.bsv 194 48 MGS specific Scemi.bsv 196 49 MGS Systolic specific FixedPointQR.bsv 197 50 MGS Systolic specific mkQR.bsv 200 51 MGS Linear specific FixedPointQR.bsv 202 52 MGS Linear specific mkQR.bsv 204 53 Multiplier.bsv 206 115 54 PipelinedMultiplier.bsv 209 55 GR specific ComplexFixedPointRotation.bsv 212 56 GR Systolic specific StreamQR.bsv 214 57 Divider.bsv 215 58 SquareRoot.bsv 219 116 1 DataType.bsv Author: Sunila Saqib saqib@mit.edu // Configuration file. typedef TMul#(TAdd#(16,16),2) BitLen; //Multiplier's bit length typedef 3 Stages; //Multiplier's pipeline stages typedef FixedPoint#(16,16) FP; typedef Complex#(FP) CP-FP; typedef 3 Dim; //Dimensions of input matrix typedef 3 PUarrSize; //size of array of Processing units typedef 1024 LAlutSize; //size of LUT for LA based 1/sqrt(X) typedef 128 ExplutSize; //size of LUT for Log based 1/sqrt(X) typedef 1024 LoglutSize; //size of LUT for Log based 1/sqrt(X) typedef 4 BitDis; // bit displacement for lookup operation 1 typedef 4 BitDisExp; // bit displacement for lookup operation 2 typedef 1 Depth; //depth of sized fifos in dp and tp typedef 5 Num~fMat; //for performance script typedef 3 RowsPerMat; //for performance script 117 2 Conjugate.bsv Author: Sunila Saqib saqib(mit.edu 4 typeclass Conjugate #(type data-t); function data-t con (data-t x); endtypeclass instance Conjugate #(Double); function Double con (Double x); return x; endfunction endinstance instance Conjugate #(Real); function Real con (Real x); return x; endfunction endinstance instance Conjugate #(Complex#(tnum)) provisos(Arith#(tnum)); function Complex#(tnum) con (Complex#(tnum) x); img: -x.img}; let y = Complex {rel:x.rel return y; endfunction endinstance instance Conjugate #(FixedPoint#(is, fs)); function FixedPoint#(is, fs) con (FixedPoint#(is, fs) x); ,C return x; endfunction endinstance 118 3 Double.bsv Author: Sunila Saqib saqibOmit.edu import import import import import import import import "BDPI" "BDPI" "BDPI" "BDPI" "BDPI" "BDPI" "BDPI" "BDPI" function function function function function function function function Double add (Double a, Double b); Double sub (Double a, Double b); Double minus (Double a); Double divide (Double a, Double b); Double multiply (Double a, Double b); Double absolute (Double a); Double squareroot (Double a); Bool lessthanequal (Double a, Double b); typedef struct { Bit#(64) bits; } Double deriving(Bits, Eq); instance RealLiteral#(Double); function Double fromReal(Real x); return Double { bits: $realtobits(x) }; endfunction endinstance import "BDPI" dbl-print = function Action dblWrite(Double d); instance Arith#(Double); function Double \+ (Double x, Double y); Double result = add(x,y); return result; endfunction function Double \ (Double x, Double y); Double result = sub(x,y); return result; endfunction function Double negate(Double x) = minus(x); function Double \/ (Double x, Double y); Double result = divide(x,y); return result; endfunction function Double \* (Double x, Double y); multiply(x,y); Double result return result; endfunction function Double abs (Double x); Double result = absolute(x); return result; endfunction i endinstance 119 function Double sqrt(Double x); Double result = squareroot(x); return result; endfunction instance Literal #(Double); function Double fromInteger(Integer x); return fromReal(fromInteger(x)); endfunction endinstance instance Ord#(Double); function Bool \<= (Double x, Double y) = lessthanequal(x, y); endinstance instance FShow#(Real); function Fmt fshow(Real x); match {.n, f} = splitReal(x); return $format(n, ".", trunc(10000*f)); endfunction endinstance 120 4 BSVDouble.c /* Author: Sunila Saqib saqibmit.edu */ 4 #include <stdio.h> #include "BSVDoubLe.h" 3 #include "math.h" double dbl-unpack(unsigned long long int x) { double* xp = (double*)(&x); return *xp; } unsigned long long int dbl-pack(double x) { unsigned long long int* xp = (unsigned long long int*)(&x); return *xp; S} void dbl-print(unsigned long long int d) { printf("X0.20f", dblunpack(d)); double asdouble(long long int x) { double* dblptr = (double*)&x; return *dblptr; } long long int asllint(double x) { long long int* lliptr = (long long int*)&x; return *lliptr; : long long int add (long long int a, long long int b) { double ain = asdouble(a); double bin = asdouble(b); double result = ain + bin; long long int result out = asllint(result); return result-out; i long long int divide (long long int a, long long int b) {z 121 double ain = asdouble(a); double bin = asdouble(b); double result = ain / bin; long long int result-out = asllint(result); return resultout; sa } long long int absolute (long long int a) double ain = asdouble(a); double result = sqrt(ain*ain);//abs(ain); long long int result-out = asllint(result); return result-out; } long long int multiply (long long int a, long long int b) asdouble(a); double am double bin = asdouble(b); double result = ain * bin; long long int result-out = asllint(result); return result-out; } long long int square (long long int a) //power of 2 double ain = asdouble(a); double result = ain*ain;//(double)pow((double)ain,2); long long int result-out = asllint(result); return result-out; long long int squareroot (long long int a) { double ain = asdouble(a); double result = (double)sqrt((double)ain); long long int result-out = asllint(result); return result-out; long long int minus (long long int a) double ain = asdouble(a); double result = -ain; long long int result-out = asllint(result); return result-out; long long int sub (long long int a, long long int b) 122 { double ain = asdouble(a); double bin = asdouble(b); double result = ain - bin; long long int result-out = asllint(result); return result-out; } unsigned char lessthanequal(long long int a, long long int b) double ain = asdouble(a); double bin = asdouble(b); return ain <= bin ? 1 : 0; } 123 5 BSVDouble.h Author: Sunila Saqib saqib@mit.edu #ifndef BSVDOUBLEH #define BSVDOUBLEH typedef unsigned long long BSVDouble; // Convert between double and BSVDouble BSVDouble dbl-pack(double x); double dbl-unpack(BSVDouble x); // Print the value of the given double to stdout. void dbl-print(BSVDouble d); #endi f//BSVDOUBLE-H 124 6 GR specific Rotate.bsv /* Author: Sunila Saqib saqibomit.edu */ //interfaces for rotation units typedef struct { tnum x; tnum r; } RotateInput#(type tnum) deriving(Eq, Bits); interface Rotate#(type tnum); interface Put# (RotateInput# (tnum)) request; interface Get#(RotationCS#(tnum)) csout; interface Get#(tnum) rout; endinterf ace typedef struct { tnum c; tnum s; } RotationCS#(type tnum) deriving(Bits, Eq); 125 7 GR specific LArotation.bsv Author: Sunila Saqib saqib~mit.edu //LA based rotation module odule| [m] kLArotation(m#(Multiplier#(FixedPoint#(is, fs))) mkmul,m#(LAtable#(fb, FixedPoint#(is, fs),hight)) mkLUT, Rotate#(Complex#(FixedPoint#(is, fs))) ifc) provisos(Add#(a__, fs, TMul#(2, fs)), Add#(b__, 1, TAdd#(is, TMul#(2, fs))), Mul#(2, fs, TAdd#(c__, fs)), IsModule#(m, me)); FIFO#(Complex#(FixedPoint#(is,fs))) rinReg <-mkFIFO(); xinReg <-mkFIF(); FIFO#(Cied~oint#(is,fs))) Reg#(FixedPoint#(is,fs)) offsetReg <-mkRegUo; Reg#(FixedPoint#(is,fs)) iresReg <-mkRegUo; Reg#(FixedPoint#(is,fs)) prodReg <-mkRegU(); LAtable#(fb, FixedPoint#(is,fs),hight) tbl <- mkLUTO; FIFO#(RotationCS#(Complex#(FixedPoint#(is, fs)))) csout-fifo <- mkFIF01(); let csout-g = toGet(csout-fifo); let csout-p = toPut(csout-fifo); FIFO#(Complex#(FixedPoint#(is, fs))) rout-fifo <- mkFIFO(); let rout-g = toGet(rout-fifo); let rout-p = toPut(rout-fifo); Vector#(4,Multiplier#(tnum)) multiplier <replicateM(mkmul()); function Action multiply(a, b, index) = multiplier[index],request put(tuple2(a, b)); Stmt interim = seq while(True) seq action let prodi <- multiplier[Olresponse.get(; let prod2 <- multiplier[1].response.get(); let prod3 <- multiplier[2].response.get(; FixedPoint#(is, fs) ires = prodi + prod2 + prod3; iresReg<=ires; tbl.tableIndex.put(ires); endaction action let tblEntry <- tbl-tableEntry.get(; offsetReg <= tblEntry.offset; Bit#(TSub#(fs, fb)) important = truncate(pack(iresReg)); 4 FixedPoint#(is,fs) diff = unpack(zeroExtend(important)); multiply(tblEntry.slope, diff,O); q' 4 endaction action let prod <- multiplier[O].response.get(; 4h prodReg <= prod; F 126 endaction action let prod = prodReg; FixedPoint#(is, fs) temp = offsetReg + prod; multiply(temp, iresReg,O); multiply(temp, rinReg.first().rel,1); multiply(temp, xinReg.first().rel,2); multiply(temp, xinReg.first().img,3); endaction par action let prodi <-multiplier[O].response.geto; Complex#(FixedPoint#(is,fs)) rout =Complex { rel:prodi, img: 0}; rout-p.put(rout); rinReg.deqo; xinReg.deqo; endaction action let prod2 <- multiplier[1].response.geto; let prod3 <- multiplier[2].response.geto; let prod4 <- multiplier[3].response.geto; Complex#(FixedPoint#(is,fs)) cout =Complex { rel:prod2, img: 0}; Complex#(FixedPoint#(is,fs)) sout =Complex { rel:prod3, img: prod4}; csout-p.put(RotationCS { c: cout, s: sout } ); endaction endpar endseq endseq; mkAutoFSM(interim); interface Put request; method Action put(inputs); //for complex Complex#(FixedPoint#(is,fs)) rin = inputsr; Complex#(FixedPoint#(is,fs)) xin = inputs.x; if (rin == Complex{rel:0,img:0} && xin == Complex{rel:O,img:0}) begin routp.put(O); csoutp.put(RotationCS { c: 0, s: 1 }); end else begin //(rin*rin)+(xin*con(xin)); multiply(rin.rel,rin.rel, 0); multiply(xin.rel,xin.rel, 1); multiply(xin.img,xin.img, 2); rinReg.enq(rin); xinReg.enq(xin); end endmethod endinterface 127 interface Get csout= csout-g; interface Get rout rout-g; endmodule 128 8 LAtable.bsv Author: Sunila Saqib saqib@mit.edu i */ 4// Set of modules/functions to generate slope and offset // look-up tables, for LA based operations. // Get the appropriate linear approximation parameters // This computes the slope function FixedPoint#(is,fs) getSlope(Real index2LUT); Real i = (-1)*(1/(2* pow(index2LUT,3/2) )); return fromReal(i); endfunction 7/ This computes the offset function FixedPoint#(is,fs) getOffset(Real index2LUT); Real i = 1 / sqrt(index2LUT); return fromReal(i); endfunction //one entry of the LA LUT table - has a "offset" and a "slope" typedef struct { tnum offset; tnum slope; } LinearApproxStruct#(type tnum) deriving(Eq, Bits); //Structure of the table typedef Vector#(l, LinearApproxStruct#(FixedPoint#(is,fs))) LinearApproxTable#(type 1, type is, type fs); // Generate the linear approximation look-up-table function LinearApproxTable#(size, is, fs) genLAlut(Integer fBits); LinearApproxTable#(size, is, fs) la = newVector; Integer tableSize = value0f(size); Integer iterationCount = tableSize; Real step =0; for (Integer s = 1; s <iterationCount; s = s+1) begin step=fromInteger(s)/(2.0**fromInteger(fBits)); la[s].slope = getSlope(step); getOffset(step); la[s].offset 4u end return la; endfunction //interface ( interface LAtable#(numeric type fb, type tnum, numeric type tableSize); interface Put#(tnum) tableIndex; 129 interface Get#(LinearApproxStruct#(tnum)) tableEntry; i( endinterface //module odule LAtable(LAtable#(fb, FixedPoint#(is,fs), tableSize)ifc) provisos(Add#(a__, fs, TMul#(2, fs)),Add#(b__, 1, TAdd#(is, TMul#(2, fs)))); let actualTableSize = value0f(tableSize); let fractionBits = value0f(fb); let indexSize = valueof(TLog#(tableSize)); LinearApproxTable#(tableSize,is,fs) laLUT = genLAlut(fractionBits); FIFO#(LinearApproxStruct#(FixedPoint#(is, fs))) outfifo <mkFIF01(); interface Put tableIndex; method Action put(index); Bit#(TLog#(tableSize)) indexValue = pack(index)[indexSize+(valueof(fs) fractionBits)-1:(valueof(fs) - fractionBits)]; outfifo-enq(laLUT[indexValuel); endmethod endinterface interface Get tableEntry = toGet(outfifo); endmodule 130 9 GR specific Logrotation.bsv /* Author: Sunila Saqib saqib@mit.edu //Log domain rotation module Logrotation(m#(Multiplier#(FixedPoint#(is, fs))) odule [ml mkmul,m#(LogTable#(fbl, FixedPoint#(is, fs),hightL)) mkLog,m#(ExpTable#(fbe, FixedPoint#(is, fs),hightE)) mkExp, Rotate#(Complex#(FixedPoint#(is, fs))) ifc) provisos(Add#(a__, fs, TMul#(2, fs)), Add#(b__, 1,TAdd#(is, TMul#(2, fs))), Mul#(2, fs, TAdd#(c-_, fs)),IsModule#(m, m_-)); FIFO#(Complex#(FixedPoint#(is,fs))) rinReg <-mkFIFO(); FIFO#(Complex#(FixedPoint#(is,fs))) xinReg <-mkFIFO(); Reg#(FixedPoint#(is,fs)) prodReg <-mkRegUo; Integer offset = 1/(2**valueof(fbe)); Vector#(1 ,LogTable#(fbl, FixedPoint#(is,fs), hightL)) logtbl <- replicateM(mkLogo); Vector#(2,ExpTable#(fbe, FixedPoint#(is,fs), hightE)) exptbl <- replicateM(mkExpo); FIFO#(RotationCS#(Complex#(FixedPoint#(is, fs)))) csout fifo <- mkFIF01(); let csout-g = toGet(csout-fifo); let csout-p = toPut(csout-fifo); FIFO#(Complex#(FixedPoint#(is, fs))) rout-fifo <- mkFIFO(); let rout-g = toGet(rout-fifo); let rout-p = toPut(rout-fifo); Vector#(3,Multiplier#(FixedPoint#(is, fs))) multiplier <replicateM(mkmul()); function Action multiply(a, b, index) = multiplier[index].request.put(tuple2(a, b)); function Action result(Reg#(FixedPoint#(is,fs)) dst,int index); action let res <- multiplier[index].response.get(); dst <= res; endaction endfunction Stmt interim = seq while(True) seq action let prodi <- multiplier[01.response.geto; let prod2 <- multiplier[ll.response.get(; let prod3 <- multiplier[2].response.get(; FixedPoint#(is, fs) ires = prodi + prod2 + prod3; logtbl[01.tableIndex.put(ires); 4" 1 endaction + action let log-res <- logtbl[0].tableEntry.get(; let r-new-sqr-log = log-res.offset * let r-new-log = r-new-sqr-log>>l; 131 let r-new-inv-log = 0- rnew-log; exptbl[01.tableIndex.put(r-new-log); exptbl[1].tableIndex.put(r-new-inv-log); endaction action let r-new <- exptbl[0].tableEntry.geto; let r-inv <- exptbl[1].tableEntry.geto; Complex#(FixedPoint#(is,fs)) rout =Complex { rel:r-new.offset, img: 0}; routp.put(rout); prodReg <= r-inv.offset; endaction action let r-inv = prodReg; let xin = xinReg.firsto; let rin = rinReg.firsto; multiplierl.request.put(tuple2(xin.rel,r-inv)); multiplier2.request.put(tuple2(xin.img,r-inv)); multiplier3.request.put(tuple2(rin.rel,r-inv)); endaction action let prodi <- multiplierl.response.geto; let prod2 <- multiplier2.response.geto; let prod3 <- multiplier3.response.geto; Complex#(FixedPoint#(is,fs)) cout =Complex { rel:prod3, img: 0}; Complex#(FixedPoint#(is,fs)) sout =Complex { rel:prodi, img: prod2}; csout_p.put(RotationCS { c: cout, s: sout } ); xinReg.deqo; rinReg.deqo; endaction endseq endseq; mkAutoFSM(interim); interface Put request; method Action put(inputs); Complex#(FixedPoint#(is,fs)) rin = inputs.r; Complex#(FixedPoint#(is,fs)) xin = inputs.x; if (rin == Complex{rel:0,img:0} && xin Complex{rel:0,img:0}) begin rout-p.put(0); csout-p.put(RotationCS { c: 0, s: 1 }); end else begin //(rin*rin)+(xin*con(xin)); multiply(rin.rel,rin.rel,0); multiply(xin.rel,xin.rel,1); multiply(xin.img,xin.img,2); rinReg.enq(rin); xinReg.enq(xin); end endmethod 132 endinterf ace interface Get csout interface Get rout endmodule = csout-g; rout-g; 133 10 Exptable.bsv Author: Sunila Saqib saqib~mit.edu // Set of modules/functions to generate Log-to-linear // translation table // Get the appropriate Exponential transformatin parameter // This computes the exponenet function FixedPoint#(is,fs) getExp(Real index2LUT); Real i = 2**index2LUT; return fromReal(i); endfunction //one entry of the Exp LUT table - has a "exp value" typedef struct { tnum expval; } ExpStruct#(type tnum) deriving(Eq, Bits); //structure of table typedef Vector#(l, ExpStruct#(FixedPoint#(is,fs))) ExpTableEntries#(type 1, type is, type fs); // Generate the Exponential value look-up-table i function ExpTableEntries#(size, is, fs) genExpLUT(Integer fBits); ExpTableEntries#(size, is, fs) la = newVector; Integer tableSize = value0f(size); Integer iterationCount = tableSize; Real step =0; for (Integer s = 0; s <iterationCount; s = s+1) begin step=fromInteger(s)/(2.0**fromInteger(fBits));//not sure if this works la[s].expval = getExp(step); end return la; endfunction function ExpTableEntries#(size, is, fs) genExpLUTneg(Integer fBits); ExpTableEntries#(size, is, fs) la = newVector; Integer tableSize = value0f(size); Integer iterationCount = tableSize; Real step =0; for (Integer s = 0; s <iterationCount; s = s+1) begin step=-(fromInteger(s)/(2.0**fromInteger(fBits))); la[s].expval = getExp(step); end 134 return la; endfunction //interface interface ExpTable#(numeric type fb, type tnum, numeric type tableSize); interface Put#(tnum) tableIndex; * interface Get#(ExpStruct#(tnum)) tableEntry; endinterface //module module 3ExpTable(ExpTable#(fb, FixedPoint#(is,fs), tableSize) ifc) provisos(Add#(a__, fs, TMul#(2, fs)),Add#(b-_, 1, TAdd#(is, TMul#(2, fs)))); let actualTableSize = value0f(tableSize); let fractionBits = value0f(fb); let indexSize = valueof(TLog#(tableSize)); ExpTableEntries#(tableSize,is,fs) expLUT = genExpLUT(fractionBits); ExpTableEntries#(tableSize,is,fs) expLUTneg genExpLUTneg(fractionBits); FIFO#(ExpStruct#(FixedPoint#(is, fs))) outfifo <mkFIF01(); interface Put tableIndex; method Action put(index); if(index<O) begin Bit#(TLog#(tableSize)) indexValue = pack(-index)[indexSize+(valueof(fs) fractionBits)-1:(valueof(fs) - fractionBits)]; outfifo.enq(expLUTneg[indexValuel); end else begin Bit#(TLog#(tableSize)) indexValue pack(index)[indexSize+(valueof(fs) fractionBits)-1:(valueof(fs) - fractionBits)]; outfifo.enq(expLUT[indexValue]); end endmethod endinterface interface Get tableEntry = toGet(outfifo); endmodule 135 11 Logtable.bsv Author: Sunila Saqib saqib@mit.edu // Set of modules/functions to generate Linear-to-log // translation table 7/ Get the appropriate log transformation parameter // This computes the logval function FixedPoint#(is,fs) getLog(Real index2LUT); Real i = log2(index2LUT); return fromReal(i); endfunction //one entry of the Log LUT table - has a "log value" typedef struct { tnum logval; } LogStruct#(type tnum) deriving(Eq, Bits); //structure of the table typedef Vector#(l, LogStruct#(FixedPoint#(is,fs))) LogTableEntries#(type 1, type is, type fs); /7 Generate the linear approximation look-up-table function LogTableEntries#(size, is, fs) genLogLUT(Integer fBits); LogTableEntries#(size, is, fs) la = newVector; Integer tableSize = value0f(size); Integer iterationCount = tableSize; Real step =0; for (Integer s = 1; s <iterationCount; s = s+1) begin step=fromInteger(s)/(2.0**fromInteger(fBits)); la[s].logval = getLog(step); end return la; endfunction //interface interface LogTable#(numeric type fb, type tnum, numeric type tableSize); interface Put#(tnum) tableIndex; interface Get#(LogStruct#(tnn)) tableEntry; , endinterface 4 //module odule ~LogTable(LogTable#(fb, FixedPoint#(is,fs), tableSize)ifc) provisos(Add#(a__, fs, TMul#(2, fs)),Add#(b__, 1, TAdd#(is, 4J TMul#(2, fs)))); 17 136 let actualTableSize = value0f(tableSize); let fractionBits = value0f(fb); let indexSize = valueof(TLog#(tableSize)); LogTableEntries#(tableSize,is,fs) logLUT genLogLUT(fractionBits); FIFO#(LogStruct#(FixedPoint#(is, fs))) outfifo <mkFIF01(); interface Put tableIndex; method Action put(index); Bit#(TLog#(tableSize)) indexValue = pack(index)[indexSize+(valueof(fs) fractionBits)-I:(valueof(fs) - fractionBits)]; outfifo.enq(logLUT[indexValue]); endmethod endinterface interface Get tableEntry = toGet(outfifo); endmodule 137 12 GR Linear specific UnitRow.bsv Author: Sunila Saqib saqib@mit.edu //a = number of external nodes //b = number of internal nodes //tnum = data type of computation interface UnitRow#(numeric type a,numeric type b, type tnum); interface Vector#(TAdd#(a,b), Put#(tnum)) xin; interface Vector#(TAdd#(a,b), Put#(tnum)) rin; interface Vector#(b, Put#(RotationCS#(tnum))) csin; interface Vector#(b, Get#(tnum)) xout; interface Vector#(TAdd#(a,b), Get#(tnum)) rout; interface Vector#(TAdd#(a,b), Get#(RotationCS#(tnum))) csout; endinterface AUnitRow(m#(External#(tnum)) mkext, mkint, UnitRow#(a,b,tnum) ifc) provisos (IsModule#(m,m-_),Bits#(tnum, a_-)); Vector#(a,External#(tnum)) vecExternal <- replicateM( mkext() ); Vector#(b,Internal#(tnum)) vecInternal <- replicateM( module En] m#(Internal#(tnum)) mkint() ); Vector#(TAdd#(a,b), Put#(tnum)) xins = newVector; Vector#(TAdd#(a,b), Put#(tnum)) rins = newVector; Vector#(b, Put#(RotationCS#(tnum))) csins = newVector; Vector#(b, Get#(tnum)) xouts = newVector; Vector#(TAdd#(a,b), Get#(tnum)) routs = newVector; Vector#(TAdd#(a,b), Get#(RotationCS#(tnum))) csouts = newVector; for (Integer i = 0; i < valueof(a); i = i+1) begin xins[i] = vecExternal[i].xin; rins[i] = vecExternal[i].rin; routs[i] = vecExternal[i].rout; end for (Integer i = valueof(a); i < valueof(TAdd#(a,b)); i i+1) begin xins[i] = vecInternal[i-valueof(a)].xin; rins[i] = vecInternal[i-valueof(a)].rin; routs[i] = vecInternal[i-valueof(a)].rout; end for (Integer i = 0; i < valueof(a); i = i+1) csouts[i] = vecExternal[i].csout; for (Integer i = valueof(a); i < valueof(TAdd#(a,b)); i i+1) csouts[i] = vecInternal[i-valueof(a)].csout; for (Integer i = 0; i < valueof(b); i = i+1) 138 = csins[i] = vecInternal[i].csin; for (Integer i = 0; i < valueof(b); i xouts[i] = vecInternal[i].xout; interface xin = xins; interface rin = rins; interface csin = csins; interface xout = xouts; interface rout = routs; interface csout = csouts; endmodule 139 = i+1) 13 GR Linear specific mkExternal.bsv Author: Sunila Saqib saqib@mit.edu interface External#(type tnum); interface Put#(tnum) xin; interface Put#(tnum) rin; interface Get#(RotationCS#(tnum)) csout; interface Get#(tnum) rout; endinterface module [ml External(m#(Rotate#(tnum)) mkrotate, External#(tnum) ifc) provisos(IsModule#(m,m__),Literal#(tnum),Bits#(tnum, a__)); Rotate#(tnum) rotationUnit <- mkrotateo; Reg#(Maybe#(tnum)) xinReg <-mkReg(tagged Invalid); Reg#(Maybe#(tnum)) rinReg <-mkReg(tagged Invalid); rule rotate if (rinReg matches tagged Valid .r &&& xinReg matches tagged Valid x); rotationUnit.request.put(RotateInput {x: x, r: r}); rinReg <= Invalid; xinReg <= Invalid; endrule interface Put xin; method Action put(x) if(xinReg matches tagged Invalid); xinReg <= tagged Valid (x); endmethod endinterface interface Put rin; method Action put(r) if(rinReg matches tagged Invalid); rinReg <= tagged Valid (r); endmethod endinterface interface Get csout = rotationUnit.csout; interface Get rout = rotationUnit.rout; endmodule 140 14 GR Linear specific mklnternal.bsv Author: Sunila Saqib saqibcmit.edu interface Internal#(type tnum); interface Put#(tnum) xin; interface Put#(RotationCS#(tnum)) csin; interface Put#(tnum) rin; interface Get#(tnum) xout; interface Get#(RotationCS#(tnum)) csout; interface Get#(tnum) rout; endinterface Internal(m#(Multiplier#(tnum)) mkmul, module Em] Internal#(tnum) ifc) provisos (IsModule#(m, m-_), Arith#(tnum), Bits#(tnum, a__),Conjugate::Conjugate#(tnum), Print#(tnum)); let xins <- mkFIF01); let rins <- mkFIF01(); FIFO#(RotationCS#(tnum)) csins <- mkFIF010; match {.xout-g, .xout-p} <- mkGPFIF01(); match {.rout-g, .rout-p} <- mkGPFIF01(); match {.csout-g, .csoutp} <- mkGPFIF01(); Multiplier#(tnum) multiplier <- mkmul(); function Action multiply(a, b, index) = multiplier.request.put(tuple2(a, b)); function Action result(Reg#(tnum) dst, int index); action let res <- multiplier.response.get(); dst <= res; endaction endfunction let xi = xins.first(; let cs = csins.first(; let m-r = rins.first(; Reg#(tnum) cr <- mkRegU(; Reg#(tnum) sx <- mkRegU(; Reg#(tnum) cx <- mkRegU(O; Reg#(tnum) sr <- mkRegU(; Reg#(Bit#(11)) clk <- mkReg(Q); Reg#(Bool) timeit <- mkReg(False); rule tick; clk <= clk +f; endrule Stt work = seq while (True) seq par seq multiply(csc, m-r,G); 141 multiply(con(cs.s), xi,1); multiply(cs.c, xi,2); action multiply(cs.s, m-r,3); endaction endseq seq result(cr,0); result(sx,1); result(cx,2); result(sr,3); endseq endpar action let nr = cr + sx; let xo = cx - sr; xoutp.put(xo); rout-p.put(nr); csout-p.put(cs); xins.deqO; csins.deqO; rins.deqO; endaction endseq endseq; mkAutoFSM(work); interface Put xin; method Action put(x); xins.enq(x); endmethod endinterface interface Put csin; method Action put(cs); csins.enq(cs); endmethod endinterface interface Put rin; method Action put(r); rins.enq(r); endmethod endinterface interface Get xout = xout-g; interface Get rout = rout-g; interface Get csout = csout-g; endmodule 142 15 GR Linear specific QR.bsv Author: Sunila Saqib saqibOmit.edu interface QR#(numeric type width, type tnum); interface Put#(Terminating#(tnum)) xin; interface Get#(tnum) rout; endinterf ace 143 16 GR Linear specific mkQR.bsv /* Author: Sunila Saqib saqib@mit.edu // Multiplexers function tnum multiplexer3(Bit#(2) sel, tnum a, tnum b, tnum c); return (sel[1]==0)?((sel[0]==0)?a:b):(c); endfunction function tnum multiplexer2(Bit#(1) sel, tnum a, tnum b); return (sel[0]==)?a:b; endfunction // module QR(m# (External#(tnum)) mkExt, m#(Internal#(tnum)) odule Em] mkInt, QR#(nTyp, tnum) ifc) provisos(IsModule#(m,m__),Literal#(tnum),Bits#(tnum, a__), Log#(TDiv#(TMul#(nTyp,TAdd#(nTyp,1)),2),adsize), Div#(nTyp,2,mTyp),Add#(bTyp,aTyp,mTyp), Add#(aTyp,0,1), DefaultValue::DefaultValue#(tnum)); Vector#(nTyp,FIFO#(Terminating#(tnum))) xinFIFO <-replicateM(mkFIF01); Reg#(Bit#(adsize)) incount <- mkReg(0); Reg#(Bit#(adsize)) outcountI <- mkReg(0); Reg#(Bit#(adsize)) outcountJ <- mkReg(0); /* temporary storages */ Memory#(mTyp,Bit#(adsize),tnum) mem <- mkMemoryo; FIFO#(Vector#(mTyp, tnum)) currentR <- mkFIF01); FIFO#(Vector#(mTyp, tnum)) routputFIFO <- mkFIF01(); Vector#(mTyp,Reg#(RotationCS#(tnum))) csMem <replicateM(mkReg(RotationCS{c:0,s:0})); Vector#(bTyp,Reg#(tnum)) xMem <- replicateM(mkReg(0)); /* flags */ Reg#(Bit#(adsize)) counter <- mkReg(0); Reg#(Bit#(adsize)) prevCounter <- mkReg(0); Reg#(Bool) putNext <- mkReg(True); Reg#(Bool) acceptinput <-mkReg(True); Reg#(Bool) resetall <-mkReg(False); Reg#(Bool) resetR <-mkReg(False); Reg#(Bool) set <-mkReg(False); /* sub modules*/ UnitRow#(aTyp,bTyp,tnum) ur <- mkUnitRow(mkExt,mkInt); StateMachine#(nTyp,mTyp,adsize) tbl <- mkStateMachine(; rule getOutput if (putNext==False); I ,r[i] /* taking care of r output*/ Vector#(mTyp,tnum) r = newVector; for(Integer i =0;i<valueof(mTyp);i=i+1) <- ur.rout[i.get; 144 if(!resetR) begin mem.write.put(MemoryWrite {ad: prevCounter, val:r}); end else begin routputFIFO.enq(r); Vector#(mTyp,tnum) rReset = replicate(O); mem.write.put(MemoryWrite {ad: prevCounter,val:rReset}); end /* taking care of x output*/ Vector#(bTyp,tnum) x = newVector; for(Integer i=O;i<valueof(bTyp);i=i+1) begin x[i] <- ur.xout[i].get; if(!resetall) begin xMem[i] <= X[i]; end else begin xMem[i] <=0; end end /* taking care of cs output*/ Vector#(mTyp,tnum) c = newVector; Vector#(mTyp,tnum) s = newVector; for(Integer i=O;i<valueof(mTyp);i=i+1) begin let cs <- ur.csout[i].get; c[i]=cs.c; s[i]=cs.s; if(!resetall) begin cMem[i] <= cs.c; sMem[i] <= cs.s; end else begin cMem[i] <= 0; sMem[i] <= 0; end end /* setting flag */ putNext <=True; endrule rule putInput if(putNext==True); let currentState <- tbl.getState(counter); /* putting in r */ Vector#(mTyp,tnum) rvalue =replicate(O); if(!set) begin set<=!set; end else begin rvalue = currentRfirsto; currentR.deqo; end for(Integer i =O;i<valueof(mTyp);i=i+1) begin let val = rvalue[i]; ur.rin[i].put(val); end /* putting in x */ for (Integer i=O;i<valueof(mTyp);i=i+1) begin 145 tnum xins = fromInteger(O); if( i==O) xins = multiplexer2(tpl_1(currentState[i])[1], xMem[i], xinFIFO[counter].first().data); else if( i<valueof(bTyp)) xins = multiplexer3(tpll(currentState[i]), xMem[i], xMem[i-1], xinFIFO[counterl.first().data); else xins = multiplexer2(tpl_1(currentState[i])[1], xMem[i-1], xinFIFO[counter].first().data); ur.xin[i].put(xins); end /* putting in cs for (Integer i=O;i<valueof(bTyp);i=i+1) begin RotationCS#(tnum) csvec = multiplexer2(tpl_2( currentState[i+valueof(aTyp)]),csMem[il,csMem[i+1]); ur.csin[i].put(csvec); end /* setting flags */ /* counter: ranges from 0 to n-1, it represents the rows in r-memory. prevcounter: ranges from 0 to n-2, it represents the previous row in r-memory (where the ouput is inserted) * prevCounter<= counter; let nextCounter = counter + 1; if(nextCounter >= fromInteger(valueof(nTyp))) begin nextCounter=0; for(Integer d=0; d<valueof(nTyp); d=d+i) begin xinFIFO[d].deqo; end let res = xinFIFO[0].firsto.islast; if(res) begin resetall <= True; end end counter <= nextCounter; putNext<=False; mem.read.request.put(nextCounter); if(xinFIFO[0].first().islast) resetR <=True; endrule rule getCurrentR; let r <- mem.read.response.geto; currentR.enq(r); endrule interface Put xin; method Action put(xinval) if (acceptinput); xinFIFO[incount],enq(xinval); if(fromInteger(valueof(nTyp))==(incount+l)) begin 146 incount<=; if (xinval .islast==True) acceptinput<=False; end else incount<= incount+1; endmethod endinterf ace interface Get rout; method ActionValue#(tnum) get() if (resetR); let val = routputFIFO.firsto[outcountJ]; let ci = outcountI;//O to n (rows in rmem) let cj = outcountJ;//O to m (columns in rmem) if (fromInteger(valueof (mTyp))==(outcountJ+1)) begin outcountJ<=Q; routputFIFO .deq( ; if (fromInteger(valueof (nTyp)).==(outcountI+1)) begin outcountI<=O; acceptinput<=True; resetRC=False; resetall<=False; end else outcountI <= outcountI+1; end else begin outcountI<=outcountI; outcountJC= outcountJ+1; end return val; endmethod endinterface endmodule 147 17 GR Linear specific Memory.bsv Author: Sunila Saqib saqib@mit.edu /* default value for Complex numbers */ instance DefaultValue #(Complex#(FixedPoint#(is,fs)) ); defaultValue = Complex { rel : FixedPoint {i:O,f:0}, img FixedPoint {i:O,f:O} }; endinstance /* default value for Fixed-Point numbers */ instance DefaultValue #(FixedPoint#(is,fs)); defaultValue = FixedPoint {i:O,f:0}; endinstance /* implementation of Memory starts here */ typedef Vector#(m, dnum) Values#(type m, type dnum); typedef struct { snum ad; Values#(m,dnum) val; } MemoryWrite#(numeric type m, type snum, type dnum) deriving(Eq, Bits); function BRAMRequest#(snum, dnum) makeRequest(Bool write, snum addr, dnum data) provisos(Arith#(snum),Bits#(dnum,a__)); return BRAMRequest{ write: write, responseOnWrite:False, address: addr, datain: data endfunction interface Memory#(numeric type col, type snum, type dnum); interface Server#(snum, Values#(col,dnum)) read;//send in the count and get the whole row interface Put#(MemoryWrite#(col,snum,dnum)) write; endinterface module jMemory(Memory#(col,snum,tnum)) provisos(Arith#(snum),Bits#(tnum,a__),Bits#(Tuple3#(snum, snum, snum), b__),Bits#(Memory::MemoryWrite#(col, snum, tnum), e__),Bits#(snum, n--), Literal#(tnum),PrimIndex#(snum, c--)); BRAMConfigure cfg = defaultValue; cfg.allowWriteResponseBypass = False; cfg.loadFormat = tagged Hex "bram2.txt"; Vector#(col,BRAM2Port#(snum, tnum)) duts <replicateM(mkBRAM2Server(cfg)); FIFO#(Values#(col,tnum)) val<-mkFIF01(); 148 FIFO#(snum) ad<-mkFIF01(); FIFO#(tnum) value<-mkFIFO(); FIFO#(Tuple2#(snum,snum)) address<-mkFIFO(); FIFO#(snum) addressBeingRead <-mkFIFO(); FIFO#(MemoryWrite#(col,snum,tnum)) wrt<-mkFIF01(); rule writeVal; let writeMem = wrt.firsto; wrt.deqo; let adres = writeMem.ad; let valus = writeMem.val; for(Integer i=O;i<valueof(col);i=i+1) begin let valuel = valus[i]; duts[i].portBrequest.put(makeRequest(True,adres, valuel)); end endrule rule readReq; let readAd = ad.firsto; ad.deqO; Vector#(col,tnum) value = newVector; for(Integer i=O;i<valueof(col);i=i+1) begin let adres = readAd; duts[i].portA.request.put(makeRequest(False,adres, value[i])); end endrule rule readRes; Values#(col, tnum) values; for(Integer i=O;i<valueof(col);i=i+1) begin values[i]<-duts[i].portA response.get; end val.enq(values); endrule interface Put write = toPut(wrt); interface Server read; interface Put request = toPut(ad); interface Get response = toGet(val); endinterface endmodule 149 18 GR Linear specific States.bsv Author: Sunila Saqib saqibomit.edu typedef Vector#(m,Tuple2#(Bit#(adsize),Bit#(adsize))) State#(type m, type adsize); typedef Vector#(n,State#(m,adsize)) StateTable#(type n, type type adsize); function StateTable#(nTyp,mTyp,adsize) getStates(); State#(mTyp,adsize) row= newVector; StateTable#(nTyp,mTyp,adsize) machine= newVector; Bit#(adsize) outx; Bit#(adsize) outcs; Integer c=O; for(Integer i=1;i<=valueof(nTyp) ;i=i+1) begin c=0; for (Integer j=1;j<=valueof (mTyp) ;j=j+1) begin if(j==i I1(i>valueof(mTyp) &,& j>=valueof (mTyp))) begin outx = maxBound; end else begin if(j-1==c) outx = 0;//right unit else if(j-2==c) outx = 1;//same unit c=c+1; end if (j==1) outcs = 0; else if(i>valueof(nTyp) -j+1) outcs = 1; //same unit else outcs = 0; // left unit row[j-11=tuple3(outx,outcs,0); end m, machine[i-1]=row; end return machine; endfunction // Generate the linear approximation look-up-table function StateTable#(nTyp, mTyp, adsize) genTableo; StateTable#(nTyp,mTyp, adsize) la = newVector; la = getStateso; return la; 4t endfunction 44 //interface interface StateMachine#(numeric type n,numeric type type adsize); method ActionValue#(State#(m, adsize)) getState(Bit#(adsize) stateCounter); endinterf ace 150 m, numeric //module module HStateMachine(StateMachine#(nTyp, mTyp, adsize) ifc); StateTable#(nTyp,mTyp, adsize) sm = genTableo; method ActionValue# (State#(Bit#(adsize))) getState(indx); let tpl = sm[indx]; return tpl; endmethod endmodule 151 19 GR Linear specific FixedPointQR.bsv S/* Author: Sunila Saqib saqib(mit.edu 4 (* synthesize *) module JPipelinedMultiplierUGDSP( PipelinedMultiplier#(Stages, Bit#(BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) imkPipelinedMultiplierUGo; return m; endmodule m <- m <- (* synthesize *) (* doc = "synthesis attribute mult-style of mkPipelinedMultiplierUGLUT is pipejlut" *) module dJPipelinedMultiplierUGLUT( PipelinedMultiplier# (Stages, Bit# (BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) mkPipelinedMultiplierUG(); return m; endmodule (* synthesize *) module MultiplierFP16DSP (Multiplier#(FP)); let m <- mkPipelinedMultiplierFixedPoint (mkDePipelinedMultip lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGDSP))); return m; endmodule (* synthesize *) module EjultiplierFP16LUT (Multiplier#(FP)); let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGLUT))); return m; endmodule (* synthesize *) JLAtable-FP(LAtable#(BitDis, FP,LAlutSize) ifc); module let tbl <- mkLAtable(; return tbl; ic endmodule ( synthesize *) Logtable-FP(LogTable#(BitDis, FP,LoglutSize) ifc); o3 module let tbl <- mkLogTable(; ,4 ,I return tbl; endmodule 152 (* synthesize *) module 9ExptableFP(ExpTable#(BitDisExp, FP,ExplutSize) ifc); let tbl <- mkExpTableo; return tbl; endmodule A( synthesize *) module ExternalFixedPoint (External# (CPFP)); //a- DSP based let mkmul = mkMultiplierFP16DSP; /7 b. LUT based let mkmul = mkMultiplierFP16LUT; // /7 1. LA based let mkrot = mkLArotation(mkmul,mkLAtableFP); 77 2. Log baned 77 let mkrot = mkLogrotation(mkmul,mkLogtableFP,mkExptableFP); // 3. NR based 7/ let mkrot = mkComplexFixedPointRotation(mkmul); let m <- mkExternal(mkrot); return m; endmodule (* synthesize *) InternalFixedPoint(Internal#(CP-FP)); module 77 a. DSP based let mkmul = mkMultiplierFP16DSP; // b. LUT based let mkmul = mkMultiplierFP16LUT; 7/ let m <- mkInternal(mkComplexMultiplier(mkmul)); return m; endmodule module AQRFixedPoint(QR#(width, CPFP)) provisos(Add#(a__, 1, TDiv#(width, 2))); let mkext = mkExternalFixedPoint; let mkint = mkInternalFixedPoint; m <- mkQR(mkext, return m; let mkint); endmodule 153 20 I 4 GR Linear specific Scemi.bsv Author: Sunila Saqib saqib@mit.edu typedef Dim ScemiQRWidth; typedef CPFP ScemiQRData; typedef QR#(ScemiQRWidth, ScemiQRData) ScemiQR; (* synthesize *) module HModule] mkScemiQR(ScemiQR); let m <- mkQRFixedPoint; return m; endmodule module RModule] mkScemiDut(Clock qrclk, ScemiQR ifc); Reset myrst <- exposeCurrentReseto; Reset qrrst <- mkAsyncReset(1, myrst, qrclk); ScemiQR qr <- mkScemiQR(clocked-by qrclk, reset-by qrrst); ScemiQR myqr <- mkSyncStreamQR(qr, qrclk, qrrst); return myqr; endmodule module PceMiModule] mkSceMiLayer(Clock qrclk, Empty ifc); SceMiClockConfiguration conf = defaultValue; SceMiClockPortIfc clk-port <- mkSceMiClockPort(conf); ScemiQR qr <- buildDut(mkScemiDut(qrclk), clkport); Empty xin <- mkPutXactor(qr.xin, clk-port); Empty rout <- mkGetXactor(qr.rout, clk-port); Empty shutdown <- mkShutdownXactoro; endmodule ( synthesize *) module iTCPBridge (; Clock myclk <- exposeCurrentClock; Empty scemi <- buildSceMi(mkSceMiLayer(myclk), TCP); endmodule 154 21 GR Systolic specific FullRow.bsv Author: Sunila Saqib saqib@mit.edu interface FullRow#(numeric type width, type tnum); interface Vector#(width, Put#(Terminating#(tnum))) xin; interface Vector#(TSub#(width, 1), Get#(Terminating#(tnum))) xout; interface Vector#(width, Get#(tnum)) r; endinterf ace 155 22 GR Systolic specific mkFullRow.bsv Author: Sunila Saqib saqib@mit.edu interface FullRow#(numeric type width, type tnum); interface Vector#(width, Put#(Terminating#(tnum))) xin; interface Vector#(TSub#(width, 1), Get#(Terminating#(tnum))) xout; interface Vector#(width, Get#(tnum)) r; endinterface FullRow (m#(External#(tnum)) mkext, module "] m#(Internal#(tnum)) mkint, FullRow#(width,tnum) ifc) provisos(IsModule#(m, m__), Bits#(tnum, a__), Add#(1, b_-, width)); External#(tnum) ex <- mkext; InternalRow#(TSub#(width, 1),tnum) intRow <-mkInternalRow(mkint); mkConnection(ex.cs, intRow.cs); interface Put xin = cons(ex.xin, intRow.xin); interface Get xout = intRow.xout; interface Get r = cons(ex.r, intRow.r); endmodule 156 23 GR Systolic specific mkExternal.bsv /* Author: Sunila Saqib saqib@mit.edu 4 interface External#(type tnum); interface Put#(Terminating#(tnum)) xin; interface Get#(RotationCS#(tnum)) cs; interface Get#(tnum) r; endinterface module Em] jExternal(m# (Rotate#(tnum)) mkrotate, tnum diagonalLoad, External#(tnum) ifc) provisos(IsModule# (m,m_) ,Literal# (tnum) ,Bits#(tnum, a_-), Print#(tnum)); Rotate#(tnum) rotationUnit <- mkrotateo; Reg#(Maybe#(tnum)) r-local-reg <- mkReg(tagged Valid diagonalLoad); FIFO#(tnum) r-local <- mkFIFOO; rule external-node-get-output if (r-local-reg matches tagged Invalid); tnum r <- rotationUnit.rout.geto; rjlocal-reg <= tagged Valid (r); endrule Reg#(Bool) dofinish <- mkReg(False); rule finish (r-local-reg matches tagged Valid .r &&& dofinish); r-local.enq(r); r-local-reg<= tagged Valid diagonalLoad; dofinish <= False; endrule interface Put xin; method Action put(x) if (r-local-reg matches tagged Valid .r &&& !dofinish); rotationUnit.request.put(RotateInput {x: x.data, r: r}); rjlocal-reg <= Invalid; dofinish <= x.islast; endmethod endinterface interface Get cs = rotationUnit.csout; interface Get r = toGet(r-local); endmodule 157 24 GR Systolic specific mklnternalRow.bsv Author: Sunila Saqib saqibomit.edu interface InternalRow#(numeric type width, type tnum); interface Vector#(width, Put#(Terminating#(tnum))) xin; interface Put#(RotationCS#(tnum)) cs; interface Vector#(width, Get#(Terminating#(tnum))) xout; interface Vector#(width, Get#(tnum)) r; endinterface InternalRow(m#(Internal#(tnum)) mkint, module MI InternalRow#(width, tnum) ifc) provisos (IsModule#(m, m-_), Bits#(tnum, tnumsz)); Vector#(width,Internal#(tnum)) vecInternal vecInternal <- replicateM( mkint() ) Vector#(width, Put#(Terminating#(tnum))) xins = newVector; Vector#(width, Get#(Terminating#(tnum))) xouts = newVector; Vector#(width, Get#(tnum)) routs = newVector; for (Integer i = 0; i < valueof(width); i = i+1) begin vecInternal[i].xin; xins[i] xouts[i] = vecInternal[i].xout; routs[i] = vecInternal[i].r; if (i+1 < valueof(width)) begin mkConnection(vecInternal[i].csout, vecInternal[i+1].csin); end else begin rule eatcsout (True); let foo <- vecInternal[i].csout.geto; endrule end end Put#(RotationCS#(tnum)) cs_ if (valueof(width) == 0) begin cs_ = interface Put; method Action put(x) = noAction; endinterface; end else begin cs = vecInternal[Ol.csin; end interface xin = xins; interface cs = cs-; 43 interface xout = xouts; interface Get r = routs; i endmodule 158 25 GR Systolic specific mklnternal.bsv /* Author: Sunila Saqib saqibtmit.edu interface Internal#(type tnum); interface Put#(RotationCS#(tnum)) csin; interface Put#(Terminating#(tnum)) xin; interface Get#(Terminating#(tnum)) xout; interface Get#(RotationCS#(tnum)) csout; interface Get#(tnum) r; endinterface Internal(m#(Multiplier#(tnum)) mkmul, module [ml Internal#(tnum) ifc) provisos (IsModule#(m, m__), Arith#(tnum), Bits#(tnum, a-_),Conjugate::Conjugate#(tnum), Print#(tnum)); let xins <- mkFIFO(); FIFO#(RotationCS#(tnum)) csins <- mkFIF01(); match {.xout-g, .xout_p} <- mkGPFIF01(); match {.rout-g, rout_p} <- mkGPFIF01(); match {.csout-g, .csout_p} <- mkGPFIF01(); Reg#(tnum) m-r <- mkReg(O); Multiplier#(tnum) multiplier <- mkmul(); let xi = xins.firsto.data; let cs = csins.firsto; function Action multiply(a, b, index) = multiplier.request.put(tuple2(a, b)); function Action result(Reg#(tnum) dst, int index); action let res <- multiplier-response.geto; dst <= res; endaction endfunction Reg#(tnum) cr <- mkRegUO; Reg#(tnum) sx <- mkRegUO; Reg#(tnum) cx <- mkRegUO; Reg#(tnum) sr <- mkRegUO; Stit work = seq while (True) seq par csout-p.put(cs); seq multiply(cs.c, m-r,O); multiply(con(cs.s), xi,1); multiply(cs.c, xi,2); action multiply(cs.s, m-r,3); csins.deq(); endaction 159 endseq seq result(cr,O); result(sx,1); result(cx,2); result(sr,3); endseq endpar action let nr = cr + sx; let xo = cx sr; xoutp.put(Terminating { data: xo, islast: xins.first().islast}); if (xins.first().islast) begin m-r <= 0; rout-p.put(nr); end else begin m-r <= nr; end xins.deqO; endaction endseq endseq; mkAutoFSM(work); interface Put xin method Action put(x); xins.enq(x); endmethod endinterface interface Put csin; method Action put(cs); csins.enq(cs); endmethod endinterface interface Get xout = xout-g; interface Get r = rout-g; interface Get csout = csout-g; endmodule 160 26 GR Systolic specific QR.bsv Author: Sunila Saqib saqib(mit.edu interface QR#(numeric type width, type tnum); interface Vector#(width, Put#(Terminating#(tnum))) rowin; interface Vector#(width, Get#(tnum)) rowout; endinterf ace 161 27 GR Systolic specific mkQR.bsv Author: Sunila Saqib saqib@mit.edu 4 //make QR with width Greater Than ONE module ml QRgtNE(m#(External#(tnum)) mkext, m#(Internal#(tnum)) mkint, QR#(width, tnum) ifc) provisos(Bits#(tnum, tnumnsz), Literal#(tnum), IsModule#(m,m__), QRtopModule#(TSub#(width,1)), Add#(1, b__, width)); FullRow#(width, tnum) rowi <- mkFullRow(mkext, mkint); QR#(TSub#(width, 1), tnum) subQR <- mkQRtopModule(mkext, mkint); mkConnection(rowl.xout, subQR.rowin); Vector#(width, Reg#(Bit#(TAdd#(1, TLog#(width))))) rowsTaken <- replicateM(mkReg(O)); Vector#(TSub#(width, 1), FIFO#(tnum)) subrouts <replicateM(mkFIFO); mkConnection(subQR.rowout, map(toPut, subrouts)); Vector#(width, Get#(tnum)) routs = newVector; for (Integer i = 0; i < valueof(width); i = i+1) begin routs[i] = interface Get method ActionValue#(tnum) get(); if (rowsTaken[il fromInteger(valueof(width)-1)) rowsTaken[i] <= 0; else rowsTaken[i] <= rowsTaken[i] + 1; if(rowsTaken[i] == 0) begin let r <- row.r[i]get(; return r; end else if (i == 0) return 0; else begin let r <- toGet(subrouts[i-1]).get(); return r; end endmethod endinterface; end rowi.xin; interface Put rowin interface Get rowout = routs; endmodule // make QR with width EQual to ONE mkext, *ReqONE(m#(External#(tnum)) a module [in] I4 m#(Internal#(tnum)) mkint, QR#(1, tnum) ifc) provisos(IsModule#(m,m__), Bits#(tnum, tnum-sz)); q FullRow#(1, tnum) rowi <- mkFullRow(mkext, mkint); interface Put rowin = rowl.xin; 162 interface Get rowout = rowi.r; endmodule typeclass QRtopModule #(numeric type width); module [ml QRtopModule(m#(External#(tnum)) mkext, m#(Internal#(tnum)) mkint, QR#(width, tnum) ifc) provisos(IsModule#(m,m_) ,Bits#(tnum,tnum_sz) ,Literal#(tnum)); endtypeclass instance QRtopModule#(1); module Em] FQRtopModule(m#(External#(tnum)) mkext, m#(Internal#(tnum)) mkint, QR#(1, tnum) ifc) provisos(IsModule#(m,m__), Bits#(tnum, tnumsz)); QR#(1, tnum) qrUnit <- mkQReqONE(mkext, mkint); return qrUnit; endmodule endinstance instance QRtopModule#(width) provisos (QRtopModule#(TSub#(width,I)), Add#(1, widthml, width)); QRtopModule(m# (External#(tnum)) mkext, mkint, QR#(width, tnum) ifc) provisos(IsModule#(m,m__), Bits#(tnum, tnumsz),Literal#(tnum)); QR#(width,tnum) qrUnit <- mkQRgtONE(mkext, mkint); return qrUnit; endmodule endinstance module [ml m#(Internal#(tnum)) module HcQRtop(m#(External#(tnum)) mkext, mkint, QR#(width, tnum) ifc) provisos(IsModule#(m,m__), Bits#(tnum, tnumnsz), Literal#(tnum), QRtopModule#(width)); QR#(width,tnum) qrUnit <- mkQRtopModule(mkext, mkint); return qrUnit; endmodule M m#(Internal#(tnum)) 163 28 GR Systolic specific FixedPointQR.bsv /* Author: Sunila Saqib saqib@mit.edu ) module JPipelinedMultiplierUGDSP( PipelinedMultiplier#(Stages,Bit#(BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) (* synthesize m <- mkPipelinedMultiplierUG(); return m; endmodule (* synthesize *) ( doc = "synthesis attribute mult-style of mkPipelinedMultiplierUGLUT is pipejlut" *) module k PipelinedMultiplierUG-LUT( PipelinedMultiplier#(Stages, Bit#(BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) m <- mkPipelinedMultiplierUG(); return m; endmodule (* synthesize *) PipelinedMultiplierUG-32(PipelinedMultiplier#(Stages, module Bit#(BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) m <- mkPipelinedMultiplierUG(); m; return endmodule (* synthesize *) module jMultiplierFP16DSP (Multiplier#(FP)); let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUG_32))); return m; - endmodule (* synthesize *) module jMultiplierFP16LUT (Multiplier#(FP)); let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUG-LUT))); m; return , endmodule ,(synthesize *) module HLAtableFP(LAtable#(BitDis, FP,LAlutSize) ifc); let tbl <- mkLAtable(; 164 return tbl; endmodule ai (* synthesize *) LogtableFP(LogTable#(BitDis, FP,LoglutSize) ifc); module let tbl <- mkLogTableo; return tbl; endmodule (s synthesize *) module iExptableFP(ExpTable#(BitDisExp, FP,ExplutSize) ifc); let tbl <- mkExpTableo; return tbl; endmodule (* synthesize *) module JExternalFixedPoint(External#(CPFP)); // a. DSP based let mkmul = mkMultiplierFP16DSP; // b. LUT based // let mkmul = mkMultiplierFP16LUT; // 1. LA based // let mkrot = mkLArotation(mkmul,mkLAtableFP); // 2. Log baned // let mkrot = mkLogrotation(mkmul,mkLogtableFP,mkExptableFP); / 3. NR based let mkrot = mkComplexFixedPointRotation(mkmul); let m <- mkExternal(mkrot, 0); return m; endmodule (* synthesize *) InternalFixedPoint(Internal#(CPFP)); module // a. DSP based let mkmul = mkMultiplierFP16DSP; 7/ b. LUT based let mkmul = mkMultiplierFPl6LUT; // let m <- mkInternal(mkComplexMultiplier3(mkmul)); return m; endmodule module NQRFixedPoint(QR#(width, CPFP)) provisos(QRtopModule#(width)); let mkext = mkExternalFixedPoint; let mkint = mkInternalFixedPoint; let m <- mkQRtop(mkext, mkint); '45 return m; c, endmodule 165 29 GR Systolic specific mkStreamQR.bsv Author: Sunila Saqib saqibomit.edu // Turn a normal QR implementation into a streaming QR // implementation. module Em] HStreamQR(m#(QR#(width, tnum)) mkqr, StreamQR#(width, tnum) ifc) provisos(IsModule#(m, a__), Bits#(tnum, tnum__); Reg#(Bit#(TLog#(width))) xcin <- mkReg(O); Reg#(Bit#(TLog#(width))) rcout <- mkReg(O); QR#(width, tnum) qr <- mkqro; interface Put xin; method Action put(Terminating#(tnum) x); qr.rowin[xcin].put(x); if (xcin == fromInteger(valueof(width)-1)) begin xcin <= 0; end else begin xcin <= xcin + 1; end endmethod endinterface interface Get rout; method ActionValue#(tnum) geto; tnum r <- qr.rowout[rcout].geto; if (rcout == fromInteger(valueof(width)-1)) begin rcout <= 0; end else begin rcout <= rcout + 1; end return r; endmethod endinterface endmodule module HStreamQRTestFixedPoint (Empty); let mkqr = mkQRFixedPoint; StreamQR#(Dim, CPFP) qr <- mkStreamQR(mkqr); mkStreamQR3Test(qr, le-3); endmodule 166 30 GR Systolic specific Scemi.bsv Author: Sunila Saqib saqibomit.edu typedef Dim ScemiQRWidth; typedef CPFP ScemiQRData; typedef StreamQR#(ScemiQRWidth, ScemiQRData) ScemiQR; (* synthesize *) module NModule] mkScemiQR(ScemiQR); let m <- mkStreamQR(mkQRFixedPoint); return m; endmodule module [kodule] mkScemiDut(Clock qrclk, ScemiQR ifc); Reset myrst <- exposeCurrentReseto; Reset qrrst <- mkAsyncReset(1, myrst, qrclk); ScemiQR qr <- mkScemiQR(clocked-by qrclk, reset-by qrrst); ScemiQR myqr <- mkSyncStreamQR(qr, qrclk, qrrst); return myqr; endmodule module HSceMiModule] mkSceMiLayer(Clock qrclk, Empty ifc); SceMiClockConfiguration conf = defaultValue; SceMiClockPortIfc clk-port <- mkSceMiClockPort(conf); ScemiQR qr <- buildDut(mkScemiDut(qrclk), clk-port); Empty xin <- mkPutXactor(qr.xin, clk-port); Empty rout <- mkGetXactor(qr.rout, clkport); Empty shutdown <- mkShutdownXactoro; endmodule (* synthesize *) TCPBridge 0; module Clock myclk <- exposeCurrentClock; Empty scemi <- buildSceMi(mkSceMiLayer(myclk), TCP); endmodule 167 31 MGS specific BatchAcc.bsv Author: Sunila Saqib saqibomit.edu /* Batch Accumulator : accumulates all the values in a vector tnum: type of data accumulated cnum count of data units accumulated mTyp : count of vectors to be accumulated */ interface BatchAcc#(type tnum,numeric type tsize,numeric type interface Put#(Vector#(tsize,tnum)) invec; interface Get#(tnum) outval; endinterface mTyp); module qBatchAcc (BatchAcc#(tnum,tsize, mTyp) ifc ) provisos(Bits#(tnum, a__),Arith#(tnum)); Bit#(TAdd#(TLog#(mTyp),1)) mType = fromInteger(valueof (mTyp)); FIFO#(tnum) outputVal <-mkFIF01(); Reg#(tnum) sumReg <- mkReg(O); Reg# (Bit#(TAdd#(TLog#(mTyp),1))) counter <- mkReg(O); interface Put invec; method Action put(invector); let prev-sum = sumReg; let sum = fold (\+ , cons(prev-sum,invector)); if (counter+1==mType) begin counter<=O; sumReg<=0; outputVal.enq(sum); end else begin counter<=counter+-1; sumReg<= sum; end endmethod endinterf ace interface Get outval = toGet(outputVal); endmodule 168 32 MGS specific BatchCS.bsv Author: Sunila Saqib saqib~mit.edu /* Batch Complex Square : computes individual rel*rel + img*img for a vector of complex values. tnum : type of complex data accumulated tsize : count of data units in a vector */ interface BatchCS#(type tnum, numeric type tsize); method Action put(Vector#(tsize,Complex#(tnum)) invector); interface Get#(Vector#(tsizetnum)) outvec; endinterface module [ml RBatchCS (m#(Multiplier# (tnum)) mkmul, BatchCS#(tnum,tsize) ifc) provisos(IsModule#(m,m__), Bits#(tnum, a__),Arith#(tnum)); Vector#(tsize, Multiplier#(tnum)) relP <replicateM(mkmul());//product of real part Vector#(tsize, Multiplier#(tnum)) imgP <replicateM(mkmul());//product of imaginary part FIFO#(Vector#(tsize,Complex#(tnum))) inputVec <- mkFIFO(); FIFO#(Vector#(tsize,tnum)) outputVec <- mkFIFO(); rule cloudblock; Vector#(tsize, tnum) outvectop = newVector; Vector#(tsize, tnum) outvectopl = newVector; Vector#(tsize, tnum) outvectop2 = newVector; for(Integer i =O;i<valueof(tsize);i=i+1) begin outvectopi[i] <- relP[il.response.get(); outvectop2[i] <- imgP[ilresponse.geto; end for(Integer j =0;j<valueof(tsize) ;j=j+1) begin outvectop[j] = outvectopl[j] + outvectop2[j]; end outputVec.enq(outvectop); endrule method Action put(invector); for(Integer i =O;i<valueof(tsize);i=i+) begin relP[il.request.put(tuple2(invector[il.rel, invector[ii.rel)); imgP[i].request.put(tuple2(invector[i].img, invector[i].img)); end endmethod interface Get outvec = toGet(outputVec); endmodule 169 33 MGS specific BatchProduct.bsv Author: Sunila Saqib saqibomit.edu /* Batch Product : computes vector product of 2 vectors. tnum : type of data values tsize : count of data units in a vector */ interface BatchProduct#(type tnum, numeric type tsize); interface Put#(Tuple2#(Vector#( tsize,tnum),Vector#(tsize,tnum))) invec; interface Get#(Vector#(tsize,tnum)) outvec; endinterface module [] NBatchProduct (m#(Multiplier# (tnum)) mkmul, BatchProduct#(tnum,tsize) ifc) provisos(IsModule#(m,m__), Bits#(tnum, a__),Arith#(tnum)); Vector#(tsize, Multiplier#(tnum)) mul <-replicateM(mkmul()); FIFO#(Vector#(tsize,tnum)) outputVec <- mkFIF01(); rule cloudblock; Vector#(tsize, tnum) outvectop = newVector; for(Integer i =O;i<valueof(tsize);i=i+1) outvectop[i] <- mul[il.response.geto; outputVec.enq(outvectop); endrule interface Put invec; method Action put (invector); for(Integer i =O;i<valueof(tsize);i=i+1) mul[i].request.put(tuple2(tpl_1(invector)[i], tpl_2(invector)[i])); endmethod endinterface interface Get outvec = toGet(outputVec); endmodule 170 34 MGS specific BatchSub.bsv /'r Author: Sunila Saqib saqib@mit.edu /* Batch Subtraction : computes element wise difference, veci type of data values tnum */ tsize : count of data units in a vector interface BatchSub#(type tnum, numeric type tsize); interface Put#(Tuple2#(Vector#( tsize,tnum),Vector#(tsize,tnum))) invec; interface Get#(Vector#(tsize,tnum)) outvec; endinterface module jBatchSub (BatchSub#(tnum,tsize) ifc) provisos(Bits#(tnum, a__),Arith#(tnum)); FIFO#(Vector#(tsize,tnum)) outputVec <- mkFIF01(); function tnum subtract(tnum x, tnum y) = x-y; interface Put invec; method Action put(invector); let res = zipWith(subtract, tpl_1(invector), tpl_2(invector)); outputVec.enq(res); endmethod endinterface interface Get outvec = toGet(outputVec); endmodule 171 - vec2 35 MGS specific mkDot.bsv /* Author: Sunila Saqib saqib(mit.edu //take in two distinct vectors and domputes thier dot product interface Dot#(type itnum, numeric type n, numeric type m); //n - length of the vector //m items will be handled in one cycle, m copies of hardware //n total entries in the vector interface Put#(Vector#(n,itnum)) inveci;//complex interface Put#(Vector#(n,itnum)) invec2;//complex interface Get#(itnum) outval;//real endinterface Dot(m#(Multiplier#(tnum)) mkmul, Dot#(tnum,nTyp,mTyp) ifc) module [m] provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum, a__),Conjugate::Conjugate#(tnum),Print#(tnum), DefaultValue::DefaultValue#(tnum),Add#(mTyp, b__, nTyp)); //storage FIFO#(Vector#(nTyp,tnum)) infifol <- mkFIFO(); FIFO#(Vector#(nTyp,tnum)) infifo2 <- mkFIFO(); Reg#(Vector#(nTyp,tnum)) inregi <- mkRegUO; Reg#(Vector#(nTyp,tnum)) inreg2 <- mkRegUO; FIFO#(tnum) outfifo <- mkFIFO(); //units BatchProduct#(tnum,mTyp) bprod <- mkBatchProduct(mkmul); BatchAcc# (tnum,mTyp,TDiv#(nTyp,mTyp)) bacc <- mkBatchAcco; //control logic Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) counter <mkReg(O); Reg#(Bool) processing <- mkReg(False); //logic cloud rule putting-input;//implicit guard : infifos not empty Vector#(nTyp,tnum) invectori = newVector; Vector#(nTyp,tnum) invector2 = newVector; Vector#(nTyp,tnum) invec2 = newVector; if(!processing)begin invectorl = infifoi.firsto; invec2 = infifo2.firsto; invector2 = map(con, invec2); end else begin invectorl = inregi; invector2 = inreg2; end Vector #(mTyp,tnum) my-invec_1 = take(invectori); 413 Vector #(mTyp, tnum) my-invec-2 = take(invector2); bprod.invec.put(tuple2(my-invecl,my-invec_2)); let my-newinvector_1 = shiftOutFromO(defaultValue,invectorl,valueof(mTyp)); 172 let my-newinvector_2 = shiftOutFromQ(defaultValue,invector2,valueof(mTyp)); inregi <= my-newinvector_1; inreg2 <= my-newinvector_2; if(counter + 1 == fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin counter<=Q; processing <=False; infifol.deq(); infifo2.deqo; end else begin counter<=counter+1; processing <=True; end endrule batch product rule connect-dot-and-acc;//inplicit gaurd // unit generates output let products <- bprod.outvec.geto; bacc invec.put(products); endrule rule getting-output;//implicit gaurd : batch acc unit // generates output let summation <- bacc.outval.geto; outfifo.enq(summation); endrule interface Put inveci = toPut(infifoi); interface Put invec2 = toPut(infifo2); interface Get outval = toGet(outfifo); endmodule 173 36 MGS specific mkNorm.bsv Author: Sunila Saqib saqib@mit.edu */ //computes norm of a vector (dot product with its conjugate) interface Norm#(type tnum, numeric type n, numeric type m); //n - length of the vector //m items will be handled in one cycle, m copies of hardware //n total entries in the vector interface Put#(Vector#(n,tnum)) invec;//complex interface Get#(tnum) outval;//real endinterface Norm(m#(Multiplier#(tnum)) mkmul, module Em] Norm#(Complex#(tnum),nTyp,mTyp) ifc) provisos (IsModule#(m, m__),Bits#(Complex#(tnum),a-_), Conjugate::Conjugate#(Complex#(tnum)),DefaultValue:: DefaultValue#(Complex#(tnum)), Add#(mTyp, b__,nTyp)); //storage FIFO#(Vector#(nTyp, Complex#(tnum))) infifo <- mkFIFf(O; Reg#(Vector#(nTyp, Complex#(tnum))) inreg <- mkRegUO; FIFO#(Complex#(tnum)) outfifo <- mkFIFOO; //units BatchCS#(tnum,mTyp) bprod <- mkBatchCS(mkmul); BatchAcc#(tnum,mTyp,TDiv#(nTyp,mTyp)) bacc <- mkBatchAcco; //control logic Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1))))counter<-mkReg(C); Reg#(Bool) processing <- mkReg(False); //logic cloud rule putting-input;//implicit gaurd : infifo.notEmpty() Vector#(nTyp,Complex#(tnum)) invector = newVector; if(!processing)begin invector = infifo.firsto; end else begin invector = inreg; end Vector#(mTyp, Complex#(tnum)) my-invector = take(invector); bprod.put(my-invector); //bprod.invec.put(my-invector); let my-newinvector = shiftOutFromO(defaultValue,invector,valueof(mTyp)); inreg <= my-newinvector; if(counter + 1 == fromInteger(valueof(nTyp)/valueof(mTyp))) begin counter<=O; processing <=False; infifo.deqo; end else begin 174 counter<=counter+1; processing <=True; end endrule rule connect-product-and-acc; //implicit gaurd : bProd completes generatig output let products <- bprod.outvec.get(); bacc.invec.put(products); endrule rule getting-output; //implicit gaurd : bAdd completes generating output let summation <- bacc.outval.geto; outfifo.enq(cmplx(summation,O)); endrule toPut(infifo); interface Put invec interface Get outval = toGet(outfifo); endmodule 175 37 MGS specific mkOffsetCorrection.bsv Author: Sunila Saqib saqib@mit.edu 'r/ // computes A[i] - R * Q[i] interface OffsetCorrection#(type itnum, numeric type n, numeric type m); interface Put#(Tuple3#(Vector#(n,itnum), itnum, Vector#(n,itnum))) invec; // Q, R, A. interface Get#(Vector#(n,itnum)) outvec; // A[i] - R * Q[i] endinterface OffsetCorrection(m#(Multiplier#(tnum)) mkmul, module Em] OffsetCorrection#(tnum,nTyp,mTyp) ifc) provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum, a__),Conjugate::Conjugate#(tnum), DefaultValue::DefaultValue#(tnum),Add#(mTyp, b__, nTyp)); //storage FIFO#(Vector#(nTyp, tnum)) infifoQ <- mkFIFO(); FIFO#(tnum) infifoR <- mkFIFO(); FIFO#(Vector#(nTyp, tnum)) infifoA <- mkFIFO(); FIFO#(Vector#(nTyp, tnum)) outfifo <- mkFIFO(); Reg#(Vector#(nTyp, tnum)) inregQ <- mkRegUO; Reg#(Vector#(nTyp, tnum)) inregA <- mkRegUO; Reg#(Vector#(nTyp, tnum)) outreg <- mkRegUO; //units BatchProduct#(tnum,mTyp) bprod <- mkBatchProduct(mkmul); BatchSub#(tnum,mTyp) bsub <- mkBatchSubo; //control logic Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) inputCounterl <- mkReg(O); Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) inputCounter2 <- mkReg(O); Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) outputCounter <- mkReg(O); Reg#(Bool) processingi <- mkReg(False); Reg#(Bool) processing2 <- mkReg(False); //logic cloud rule putting-input;//implicit guard : infifos.notEmpty Vector#(nTyp,tnum) invecQtop = newVector; tnum invalRtop = infifoR.firsto; if(!processingi) begin invecQtop = infifoQ.firsto; end else begin invecQtop = inregQ; end Vector #(mTyp, tnum) myQ = take (invecQtop); bprod.invec.put(tuple2(my-Q,replicate(invalRtop))); 176 let my-newinvector = shiftOutFromO( defaultValue,invecQtopvalueof(mTyp)); // shift m elements out from 0 index. inregQ <= my-newinvector; if(inputCounterl + 1 ==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin inputCounter1<=0; processing1<=False; infifoQ.deqo; infifoR.deqo; end else begin inputCounterl<=inputCounterl+1; processingi<=True; end endrule rule connecting-prod-and-sub; // implicit gaurd : batch product unit generates output Vector#(nTyp, tnum) invecAtop = newVector; if(!processing2) begin invecAtop = infifoA.firsto; end else begin invecAtop = inregA; end let products <- bprod.outvec.geto; Vector #(mTyp, tnum) myA = take (invecAtop); bsub.invec.put(tuple2(myA,products)); let my-newinvector = shiftOutFromO(defaultValue, invecAtopvalueof(mTyp)); inregA <= mymnewinvector; if(inputCounter2 + 1 ==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin inputCounter2<=0; processing2<=False; infifoA.deqo; end else begin inputCounter2<=inputCounter2+1; processing2<=True; end endrule rule getting-output; //impicit gaurd: batch sub unit generates output Vector#(mTyp,tnum) res <- bsub.outvec get(); Vector#(nTyp,tnum) outputs = outreg; Vector#(TAdd#(nTyp,mTyp),tnum) newoutput = append(outreg,res); Vector#(nTyp,tnum) newoutreg = drop(newoutput); if(outputCounter + 1 ==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin outputCounter<=0; 177 outfifo.enq(newoutreg); end else begin outputCounter<=outputCounter+1; outreg<= newoutreg; end endrule interface Put invec; method Action put (in); infifoQ.enq(tpl_1(in)); infifoR.enq(tpl_2(in)); infifoA.enq(tpl-3(in)); endmethod endinterface interface Get outvec = toGet(outfifo); endmodule 178 38 MGS specific mkVecProd.bsv Author: Sunila Saqib saqib@mit.edu */ interface VecProd#(type tnum, numeric type n, numeric type //n total entries in the vector, m entries will be handled //simultaneously interface Put#(Tuple2#(Vector#(n,tnum), tnum)) invec; interface Get#(Vector#(n,tnum)) outvec; endinterface m); module Em] NvecProd(m# (Multiplier# (tnum)) mkmul, VecProd#(tnum,nTyp,mTyp) ifc) provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum, a__),Conjugate::Conjugate#(tnum), Print4(tnum), DefaultValue::DefaultValue#(tnum),Add#(mTyp, b_,nTyp)); //storage FIFO#(Vector#(nTyp, tnum)) infifoQ <- mkFIFO();//inputs FIFO#(tnum) infifoR <- mkFIFO(; Reg#(Vector#(nTyp, tnum)) inregQ <- mkRegU();//temporary Reg#(Vector#(nTyp, tnum)) outreg <- mkRegUO; FIFO#(Vector#(nTyp, tnum)) outfifo <- mkFIFO();//output //units BatchProduct#(tnum,mTyp) bprod <- mkBatchProduct(mkmul); //control logic Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) inputCounter <- mkReg(O); Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) outputCounter <- mkReg(O); Reg#(Bool) processing <- mkReg(False); //logic cloud rule putting-input;//will be implicitly guarded by the infifo. Vector#(nTyp,tnum) invecQtop = newVector; tnum invalRtop = infifoR.first(); if(!processing) invecQtop = infifoQ.firsto; else invecQtop = inregQ; Vector #(mTyp, tnum) myQ = take (invecQtop); Vector #(mTyp, tnum) myR = replicate(invalRtop); bprod.invec.put(tuple2(myQ,myR)); let my-newinvector = shiftOutFromO(defaultValue,invecQtop,valueof(mTyp)); // shift m elements out from 0 index. inregQ <= my-newinvector; if(inputCounter + 1 ==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin inputCounter<=0; processing<=False; infifoQ.deqo; 179 infifoR.deqo; end else begin inputCounter<=inputCounter+1; processing<=True; end endrule rule getting-output;//impicit guard: sub produces output Vector#(mTyp,tnum) res <- bprod.outvec.geto; Vector#(nTyp,tnum) outputs = outreg; Vector#(TAdd#(nTyp,mTyp),tnum) newoutput = append(outreg,res); Vector#(nTyp,tnum) newoutreg = drop(newoutput); if(outputCounter +- 1 ==fromInteger(valueof(TDiv#(nTyp,mTyp)))) begin //n should be multiple of m outputCounter<=0; outfifo.enq(newoutreg); end else begin outputCounter<=outputCounter+1; outreg<= newoutreg; end endrule interface Put invec; method Action put (in); infifoQ.enq(tpl_1(in)); infifoR.enq(tpl_2(in)); endmethod endinterface interface Get outvec = toGet(outfifo); endmodule 180 39 MGS specific Sqrtlnv.bsv Author: Sunila Saqib saqibomit.edu //complex square root module interface SqrtInv#(type tnum); interface Put#(tnum) x;//x interface Get#(tnum) xs;//x square root interface Get#(tnum) xsi;//x square root inv endinterface module Em] SqrtInvCPFP(m#(SqrtInv#(FixedPoint#(is,fs)) ) mksi, SqrtInv#(Complex#(FixedPoint#(is,fs))) ifc) provisos(IsModule#(m, m--)); SqrtInv#(FixedPoint#(is,fs)) sq <- mksi; interface Put x ; method Action put(x); sq.x.put(x.rel); endmethod endinterface interface Get xs method ActionValue#(Complex#(FixedPoint#(is,fs))) get(); let xs <- sqxs.get(; return cmplx(xs,O); endmethod endinterface interface Get xsi method ActionValue#(Complex#(FixedPoint#(is,fs))) geto; let xsi <- sq.xsi.get(; return cmplx(xsi,O); endmethod endinterface endmodule 181 40 MGS specific LASqrtlnv.bsv /* Author: Sunila Saqib saqib~mit.edu // fixed point square root, sqrt(A) = offset + ((AA.i)*(slope)) [in] SqrtInvFP(m#(Multiplier#(FixedPoint#(is, fs))) mkmul,m#(LAtable#(fb, FixedPoint#(is, fs),tableSize)) mkLUT, SqrtInv#(FixedPoint#(is,fs)) ifc) provisos(IsModule#(m, m__)); FIFO#(FixedPoint#(is,fs)) xfifo <- mkFIFO(); FIFO#(FixedPoint# (is,fs)) offsetReg <-mkFIFO() ;//stage 2 FIFO#(FixedPoint#(is,fs)) xsififo <- mkFIFO0; FIFO#(FixedPoint#(is,fs)) xsfifo <- mkFIFO(); Reg#(FixedPoint#(is,fs)) xsiReg <- mkReg(U); LAtable#(fb, FixedPoint#(is,fs),tableSize ) tbl <- mkLUT(; Multiplier#(FixedPoint#(is, fs)) multiplieri <- mkmul; Stit interim = seq while(True) seq action let tblEntry <- tbl.tableEntry.geto; offsetReg.enq(tblEntry.offset); let xin = xfifo.firsto; Bit#(TSub#(fs, fb)) important = truncate(pack(xin)); FixedPoint#(is,fs) diff = unpack(zeroExtend(important)); multiplieri.request.put(tuple2(tblEntry.slope, diff)); endaction action let prod <- multiplier.response.geto; let offset = offsetReg-firsto; offsetReg.deqo; FixedPoint#(is, fs) temp = offset + prod; xsififo.enq(temp); xsiReg <= temp; endaction action let xin = xfifo.firsto; xfifo.deqo; multiplieri.request.put(tuple2(xin, xsiReg)); endaction action let prod <- multiplier.response.geto; xsfifo.enq(prod); endaction endseq endseq; mkAutoFSM(interim); interface Put x; method Action put(x); tbl.tableIndex.put(x); module 182 xfifo.enq(x); endmethod endinterf ace interface Get xsi = toGet(xsififo); interface Get xs = toGet(xsfifo); endmodule 183 41 MGS specific LogSqrtlnv.bsv /* Author: Sunila Saqib saqib@mit.edu // fixed point square root module Em] SqrtInvLogFP(m#(Multiplier#(FixedPoint#(is, fs))) mkmul,m#(LogTable#(fbl, FixedPoint#(is, fs),hightL)) mkLog,m#(ExpTable#(fbe, FixedPoint#(is, fs),hightE)) mkExp, SqrtInv#(FixedPoint#(is,fs)) ifc) provisos(Add#(a__, fs, TMul#(2, fs)),Add#(b__,I,TAdd#(is, TMul#(2,fs))),Mul#(2,fs,TAdd#(c__,fs)),IsModule#(m,m__)); Vector#(1, LogTable#(fbl, FixedPoint#(is,fs), hightL)) logtbl <- replicateM(mkLog(); Vector#(2, ExpTable#(fbe, FixedPoint#(is,fs), hightE)) exptbl <- replicateM(mkExp(); FIFO#(FixedPoint#(is, fs)) xs-fifo <- mkFIFO(); let xs-g = toGet(xs-fifo); let xs-p = toPut(xs-fifo); FIFO#(FixedPoint#(is, fs)) xsi-fifo <- mkFIFO(); let xsi-g = toGet(xsi-fifo); let xsi-p = toPut(xsi-fifo); Stmt interim = seq while(True) seq action let log-res <- logtbl[O].tableEntry.geto; let r-new-sqr-log = log-res.offset ;// x + fromInteger(valueof(fbl)); let r-new-log = r-new-sqr-log>>1; let r-new-inv-log = 0- r-new-log; exptbl[O].tableIndex.put(r-new-log); exptbl[1].tableIndex.put(rnew_inv-log); endaction par action let r-new <- exptbl[0].tableEntry.geto; xs-p.put(r-new.offset); endaction action let r-inv <- exptbl[ll].tableEntry.geto; xsi-p.put(r-inv.offset); endaction endpar i endseq endseq; Y mkAutoFSM(interim); ,1 interface Put x; method Action put(x); logtbl[O].tableIndex.put(x); 6 endmethod endinterface 184 interface Get xs = xsg; interface Get xsi = xsi-g; endmodule 185 42 MGS specific NRSqrtlnv.bsv /* Author: Sunila Saqib saqib@mit.edu // fixed point square root SqrtInvNRFP(m#(Multiplier#(FixedPoint#(is, fs))) module [ml mkmul, SqrtInv#(FixedPoint#(is,fs)) ifc) provisos(Add#(a__, fs, TMul#(2, fs)), Add#(b__, 1, TAdd#(is, TMul#(2, fs))), Mul#(2, fs, TAdd#(c__, fs)), IsModule#(m, m__) SquareRoot#(FixedPoint#(is, fs)) sqrt <mkFixedPointSquareRoot(1); Divider#(FixedPoint#(is, fs)) dr <- mkFixedPointDivider(2); match {.xs-g, xs-p} <- mkGPFIFO(); match {.xsi-g, xsi_p} <- mkGPFIFO(); Reg#(Bit#(11)) clk <- mkReg(O); Reg#(Bool) timeit <- mkReg(False); rule tick; clk <= clk +1; endrule rule dodivide (True); match {.nr, .*} <- sqrt-response.geto; xs-p.put(nr); dr.request put(tuple2(fromInteger(i), nr)); endrule rule dofinalize (True); match {.xsi, .*} <- dr.response.geto; xsi-p.put(xsi); endrule interface Put x; method Action put(x); sqrt.request.put(x); endmethod endinterface interface Get xs = xs-g; interface Get xsi = xsi-g; endmodule 186 43 MGS specific mkDP.bsv Author: Sunila Saqib saqibomit.edu //boundary unit. interface DP#(type itnum, numeric type n, numeric type m); / n - length of the vector // m items will be handled in one cycle, m copies of hardware // n total entries in the vector interface Put#(Vector#(n,itnum)) invec; interface Get#(itnum) rout;//r interface Get#(Vector#(n,itnum)) qout;//r endinterface DP ((m# (Multiplier# (tnum))) module Em] mkmul,(m#(Norm#(tnum,nTyp,mTyp))) mknorm, m#(SqrtInv#(tnum)) mksqrt,DP#(tnum,nTyp,mTyp) ifc) provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum, a__),Conjugate::Conjugate#(tnum),Print#(tnum), DefaultValue::DefaultValue#(tnum),Add#(mTyp, d__, nTyp)); Integer depth = valueof(Depth); FIFOF#(Vector#(nTyp, tnum)) infifo <- mkSizedFIFOF(depth); FIFOF#(tnum) rfifo <- mkSizedFIFOF(depth); FIFOF#(Vector#(nTyp,tnum)) qfifo <- mkSizedFIFOF(depth); //units Norm#(tnum,nTyp,mTyp) norm <- mknorm; SqrtInv#(tnum) si <- mksqrt; VecProd#(tnum,nTyp,mTyp) oc <- mkVecProd(mkmul); //control logic Reg#(Bit#(TLog#(TAdd#(TDiv#(nTyp,mTyp),1)))) counter<-mkReg(O); //logic cloud rule connect-dot-and-si; //inplicit gaurd : bprod completes generatig output let x <- normoutval.geto; si.x.put(x); endrule rule connect-si-and-out; // inplicit gaurds //si generate soutput, xfifo and infifo not empty let xsi <- si.xsi.geto; let inputvector = infifo.firsto; infifo.deqO; oc.invec.put(tuple2(inputvector, xsi)); endrule rule getting-output-r-val; //implicit gaurd : si generates xs output let r <- si.xs.geto; rfifo.enq(r); endrule rule getting-output-q-vec; 187 //implicit gaurd : batch sub unit completes generating output let q <- oc.outvec.geto; qfifo.enq(q); endrule interface Put invec; method Action put(in); infifo.enq(in); norm invec.put(in); endmethod endinterface interface Get rout = toGet(rfifo); interface Get qout = toGet(qfifo); endmodule 188 44 MGS specific mkTP.bsv Author: Sunila Saqib saqib@mit.edu interface TP#(type tnum, numeric type n, numeric type m); //n total entries in the vector, m entries will be handled simultaneously interface Put#(Vector#(n,tnum)) inveci; interface Put#(Vector#(n,tnuim)) invec2;//q vector, interface Get#(tnum) rout; interface Get#(Vector#(n,tnum)) qout; interface Get#(Vector#(n,tnum)) invec2out;//passing the q, endinterface module [(g jTP(m# (Multiplier# (tnum)) mkmul, TP#(tnum,nTyp,mTyp) ifc) provisos (IsModule#(m, m__), Arith#(tnum),Bits#(tnum, a__),Conjugate::Conjugate#(tnum), Print#(tnum),DefaultValue:: DefaultValue#(tnum),Add#(mTyp, b__, nTyp)); //storage Integer depth = valueof(Depth); FIFOF#(Vector#(nTyp,tnum)) infifoQ <- mkSizedFIFOF(depth); FIFOF#(Vector#(nTyp,tnum)) infifoA <- mkSizedFIFOF(depth); FIFOF#(Vector#(nTyp,tnum)) infifoQinterim <- mkSizedFIFOF(depth); FIFOF#(Vector#(nTyp,tnum)) infifoAinterim <- mkSizedFIFOF(depth); FIFOF#(tnum) outfifoR <- mkSizedFIFOF(depth); FIFOF#(Vector#(nTyp,tnum)) outfifoQ <- mkSizedFIFOF(depth); FIFOF#(Vector#(nTyp,tnum)) invec2outfifo <- mkBypassFIFOF(); //units Dot#(tnum,nTyp,mTyp) dot <- mkDot(mkmul); OffsetCorrection#(tnum,nTyp,mTyp) vecp <- mkOffsetCorrection(mkmul); //logic cloud rule connecting-dot-and-vecp; let dotproduct <- dot.outval.geto; outfifoR.enq(dotproduct); let inQtop = infifoQ.firsto; let inAtop = infifoA.firsto; infifoQ.deqo; infifoA.deq(); vecp.invec.put(tuple3(inQtop,dotproduct,inAtop)); endrule rule getting-output; //inpicit gaurd: sub produces output let outvec <- vecp.outvec.geto; 44 '11 7 outfifoQ.enq(outvec); endrule interface Put inveci; method Action put(inl); 189 infifoA.enq(ini); dot.invecl.put(inl); endmethod endinterface interface Put invec2; method Action put(in2); infifoQ.enq(in2); dot.invec2.put(in2); invec2outfifo.enq(in2); endmethod endinterface interface Get rout = toGet(outfifoR); interface Get qout = toGet(outfifoQ); interface Get invec2out = toGet(invec2outfifo); endmodule 190 45 MGS specific UnitRow.bsv /* Author: Sunila Saqib saqibimit.edu 4 interface UnitRow#(numeric type tsize, numeric type tdim, type tnum); //tsize = size of row, tnum = number type //tdim = dimention of the nxn matrix to be decomposed interface Vector#(tsize, Put#(Vector#(tdim,tnum))) xin; interface Vector#(tsize, Get#(tnum)) rout; interface Vector#(TSub#(tsize,1), Get#(Vector#(tdim,tnum))) qout; endinterface A1nitRow(m#(DP#(tnum,nTyp,mTyp)) mkdp, module " m#(TP#(tnum,nTyp,mTyp)) mktp, UnitRow#(n,nTyp, tnum) ifc) provisos (IsModule#(m,m__),Bits#(tnum, a__)); //units DP#(tnum,nTyp,mTyp) vecDP <- mkdp; //External unit Vector#(TSub#(n,1),TP#(tnum,nTyp,mTyp)) vecTP <replicateM( mktp() ) ; //vector of internal units //interface variables Vector#(n, Put#(Vector#(nTyp,tnum))) xins = newVector; Vector#(n, Get#(tnum)) routs = newVector; Vector#(TSub#(n,1), Get#(Vector#(nTyp,tnum))) qouts = newVector;//n for linear case //connecting output from internal unit to next internal //unit in the row - tunnel for q vector for (Integer i = 0; i < valueof(TSub#(n,i)); i = i+1) if (i+1 < valueof(TSub#(n,1))) mkConnection(vecTP[i].invec2out, vecTP[i-1].invec2); else rule eatit (True); let x <- vecTP[il.invec2out.geto; endrule //connecting output from boundary unit with input of //internal unit - q vector if(1<valueof(n)) mkConnection(vecDP.qout,vecTP[0],invec2); else rule eatitagain; let x <- vecDP.qout.geto; endrule //connecting input x vectors xins[O] = vecDP.invec; for (Integer i = 0; i < valueof(TSub#(n,1)); i = i+1) xins[i+1] = vecTP[ii.inveci; 191 //connecting output q vectors for (Integer i = 0; i < valueof(TSub#(n,1)); i = i+1) qouts[i] = vecTP[i]. qout; //connecting output r vectors routs[0] = vecDP.rout; for (Integer i = 0; i < valueof(TSub#(n,1)); i = i+1) routs[i+1] = vecTP[il.rout; //connecting interface variables to the module interfaces qouts; interface out outs; =I out interface ins; xi interface I endmodule 192 46 MGS specific QR.bsv Author: Sunila Saqib saqib~mit.edu interface QR#(numeric type width,numeric type n, type tnum); interface Vector#(width, Put#(Vector#(n,tnum))) rowin; interface Vector#(width, Get#(tnum)) rowout; endinterf ace 193 47 MGS specific mkStreamQR.bsv Author: Sunila Saqib saqib@mit.edu // Turn a normal QR implementation into a streaming QR // implementation. module Em] HStreamQR(m#(QR#(width,nn, tnum)) mkqr, StreamQR#(width, tnum) ifc) provisos(IsModule#(m, a__), Bits#(tnum, tnumn_), Print#(tnum), Add#(width,nn,TAdd#(width,nn)), DefaultValue::DefaultValue#(tnum), Print::Print#(tnum)); Reg#(Bit#(TLog#(width))) xcin <- mkReg(0); Reg#(Bit#(TLog#(width))) rcout <- mkReg(0); Vector#(nn, Vector#(nn, Reg#(tnum))) xins <replicateM(replicateM(mkRegU()); Reg#(Bit#(TAdd#(TLog#(nn),1))) i <- mkReg(0); Reg#(Bit#(TAdd#(TLog#(nn),1))) j <- mkReg(0); let size = value0f(nn); QR#(width,nn, tnum) qr <- mkqro; interface Put xin; method Action put(Terminating#(tnum) x); if(i+1==fromInteger(size)) begin if(j+1==fromInteger(size)) j<=0; else j<=j+1; i<=0; end else begin j<=i; end if(j+1==fromInteger(size)) begin Vector#(nn, tnum) column = newVector; for (nt o =0; o < fromInteger(size)-I; o = column[o] = xins[i][o]; x.data; column[size-1] qr.rowin[i].put(column); end else xins[i][(j] <= x. data; endmethod endinterface interface Get rout; method ActionValue#(tnum) get(); tnuim r <- qr.rowout[rcout].get(; if (rcout == fromInteger(valueof(width)-1)) rcout <= 0; else rcout <= rcout + 1; return r; endmethod endinterface 194 o + 1) , endmodule StreamQRTestFixedPoint (Empty); module let mkqr = mkQRCPFP; StreamQR#(Dim,CPFP) qr <- mkStreamQR(mkqr); mkStreamQR3Test(qr, 0.001); endmodule 195 48 MGS specific Scemi.bsv Author: Sunila Saqib saqibomit.edu "/ typedef Dim ScemiQRWidth; typedef CPFP ScemiQRData; typedef StremQR#(ScemiQRWidth, ScemiQRData) ScemiQR; (* synthesize *) module [odule] mkScemiQR(ScemiQR); let m <- mkStreamQR(mkQRCPFP); return m; endmodule module HModule] mkScemiDut(Clock qrclk, ScemiQR ifc); Reset myrst <- exposeCurrentReseto; Reset qrrst <- mkAsyncReset(1, myrst, qrclk); ScemiQR qr <- mkScemiQR(clocked-by qrclk, reset-by qrrst); ScemiQR myqr <- mkSyncStreamQR(qr, qrclk, qrrst); return myqr; endmodule module PSceMiModule] mkSceMiLayer(Clock qrclk, Empty ifc); SceMiClockConfiguration conf = defaultValue; SceMiClockPortIfc clk-port <- mkSceMiClockPort(conf); ScemiQR qr <- buildDut(mkScemiDut(qrclk), clk-port); Empty xin <- mkPutXactor(qr.xin, clk-port); Empty rout <- mkGetXactor(qr.rout, clk-port); Empty shutdown <- mkShutdownXactoro; endmodule (- synthesize *) module kTCPBridge 0; Clock myclk <- exposeCurrentClock; Empty scemi <- buildSceMi(mkSceMiLayer(myclk), TCP); endmodule 196 49 MGS Systolic specific FixedPointQR.bsv Author: Sunila Saqib saqib@mit.edu (* synthesize *) module JPipelinedMultiplierUGDSP (PipelinedMultiplier#(Stages, Bit#(BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) m <mkPipelinedMultiplierUGO; return m; endmodule (* synthesize *) (* doc = "synthesis attribute mult-style of mkPipelinedMultiplierUG-LUT is pipejlut" *) module gPipelinedMultiplierUGLUT (PipelinedMultiplier#(Stages, Bit#(BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) m <mkPipelinedMultiplierUGo; return m; endmodule (* synthesize *) module BMultiplierFP-DSP (Multiplier#(FP)); let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGDSP))); return m; endmodule (* synthesize *) JultiplierFPLUT (Multiplier#(FP)); module let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGLUT))); return m; endmodule (* synthesize *) LAtableFP(LAtable#(BitDis, FP,LAlutSize) ifc); module let tbl <- mkLAtableo; return tbl; endmodule (* synthesize *) module RLogtableFP(LogTable#(BitDis, FP,LoglutSize) ifc); let tbl <- mkLogTableo; return tbl; endmodule 197 (* synthesize *) module 9ExptableFP(ExpTable#(BitDisExp, FP,ExplutSize) ifc); let tbl <- mkExpTableo; return tbl; endmodule (* synthesize *) SqrtInvCPFP (SqrtInv#(CPFP)); module 7/ a. DSP based let mkmul = mkMultiplierFPDSP; // b. LUT based let mkmul = mkMultiplierFP16LUT; // // 1. LA /7 let mksi = mkSqrtInvFP(mkmul,mkLAtableFP); // 2. Log let mksi =mkSqrtInvLogFP(mkmul, 7/ mkLogtableFP,mkExptableFP); // // 3. NR let mksi = mkSqrtInvNRFP(mkmul); let m <- mkSqrtInvCPFP(mksi); return m; endmodule (* synthesize *) module JNormCPFP (Norm#(CPFP,Dim,PUarrSize)); let m <- return mkNorm(mkMultiplierFPDSP); m; endmodule (* synthesize *) DPCPFP(DP#(CPFP,Dim,PUarrSize)); module let mkmul = mkComplexMultiplier(mkMultiplierFPDSP); let m <- mkDP(mkmul, mkNormCPFP, mkSqrtInvCPFP); return m; endmodule (* synthesize *) module STPCPFP(TP#(CPFP,Dim,PUarrSize)); let mkmul = mkComplexMultiplier(mkMultiplierFPDSP); let m <- mkTP(mkmul); return m; endmodule QRCPFP(QR#(width,Dim,CPFP)) module provisos(QRtopModule#(width)); let mkdp = mkDP-CPFP; let mktp = mkTPCPFP; m <- mkQRtop(mkdp, let return m; mktp); 198 endmodule 199 50 MGS Systolic specific mkQR.bsv Author: Sunila Saqib saqib~mit.edu //make QR with width Greater Than ONE module [n] QRgtONE(m#(DP#(tnum,nTyp,mTyp)) mkdp, m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(width,nTyp, tnum) ifc) provisos(Bits#(tnum, tnumnsz), Literal#(tnum), IsModule#(m,m__), QRtopModule#(TSub#(width,1)), Add#(1, b__,width)); UnitRow#(width,nTyp, tnum) rowi <- mkUnitRow(mkdp, mktp);//make one full row QR#(TSub#(width, 1),nTyp, tnum) subQR <mkQRtopModule(mkdp, mktp); mkConnection(rowl.qout, subQR.rowin); Vector#(width, Reg#(Bit#(TAdd#(1, TLog#(width))))) rowsTaken <- replicateM(mkReg(O)); Vector#(TSub#(width, 1), FIFO#(tnum)) subrouts <replicateM(mkFIFO); mkConnection(subQR.rowout, map(toPut, subrouts)); Vector#(width, Get#(tnum)) routs = newVector; for (Integer i = 0; i < valueof(width); i = i+1) begin routs[i] = interface Get method ActionValue#(tnum) geto; if (rowsTaken[i] == fromInteger(valueof(width)-1)) rowsTaken[i] <= 0; else begin rowsTaken[i] <= rowsTaken[i] + 1; if(rowsTaken[i] == 0) begin let r <- rowi.rout[i].geto; return r; end else if (i == 0) return 0; else begin let r <- toGet(subrouts[i-11).geto; return r; end end endmethod endinterface; end interface Put rowin = rowi.xin; interface Get rowout = routs; endmodule i mnake QR with width EQual to ONE QReqNE(m#(DP#(tnum,nTyp,mTyp)) mkdp, module [n] m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(1,nTyp, tnum) ifc) 200 provisos(IsModule#(m,m__), Bits#(tnum, tnumisz)); UnitRow#(1,nTyp, tnum) rowi <- mkUnitRow(mkdp, mktp); interface Put rowin rowi.xin; interface Get rowout = rowi.rout; endmodule typeclass QRtopModule #(numeric type width); module [ml QRtopModule(m#(DP#(tnum,nTyp,mTyp)) mkdp, m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(width,nTyp, tnum) ifc) provisos(IsModule#(m,m__), Bits#(tnum, tnum-sz), Literal#(tnum)); endtypeclass instance QRtopModule#(); module Em] QRtopModule(m#(DP#(tnum,nTyp,mTyp)) mkdp, m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(1,nTyp, tnum) ifc) provisos(IsModule#(m,m__), Bits#(tnum, tnum-sz)); QR#(1,nTyp, tnum) qrUnit <- mkQReqONE(mkdp, mktp); return qrUnit; endmodule endinstance instance QRtopModule#(width) provisos (QRtopModule#(TSub#(width,1)), Add#(1, widthmi, width)); module [ml jQRtopModule(m#(DP#(tnum,nTyp,mTyp)) mkdp, m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(width,nTyp, tnum) ifc) provisos(IsModule#(m,m__), Bits#(tnum, tnumnsz), Literal#(tnum)); QR#(width,nTyp,tnum) qrUnit <- mkQRgtONE(mkdp, mktp); return qrUnit; endmodule endinstance QRtop(m#(DP#(tnum,n,mTyp)) mkdp, mktp, QR#(width,n, tnum) ifc) provisos(IsModule#(m,m__), Bits#(tnum, tnumsz), Literal#(tnum), QRtopModule#(width)); QR#(width,n,tnum) qrUnit <- mkQRtopModule(mkdp, mktp); return qrUnit; endmodule module [ml m#(TP#(tnum,n,mTyp)) 201 51 MGS Linear specific FixedPointQR.bsv Author: Sunila Saqib saqibOmit.edu (* synthesize *) module }PipelinedMultiplierUGDSP (PipelinedMultiplier#(Stages, Bit#(BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) m <- mkPipelinedMultiplierUG(); return m; endmodule (* synthesize *) (1 doc = "synthesis attribute mult-style of mkPipelinedMultiplierUG-LUT is pipe-lut" *) module RPipelinedMultiplierUGLUT (PipelinedMultiplier#(Stages, Bit#(BitLen))); PipelinedMultiplier#(Stages, Bit#(BitLen)) m <- mkPipelinedMultiplierUG(); return m; endmodule (* synthesize *) module MultiplierFPDSP (Multiplier#(FP)); let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGDSP))); return m; endmodule (* synthesize *) module MultiplierFPLUT (Multiplier#(FP)); let m <- mkPipelinedMultiplierFixedPoint(mkDePipelinedMultip lier(mkPipelinedMultiplierG(mkPipelinedMultiplierUGLUT))); return m; endmodule (* synthesize *) module jLAtableFP(LAtable#(BitDis, FP,LAlutSize) ifc); let tbl <- mkLAtable(; return tbl; endmodule (* synthesize *) module JLogtableFP(LogTable#(BitDis, FP,LoglutSize) ifc); let tbl <- mkLogTable(; return tbl; r endmodule 202 (* synthesize *) module J~xptableFP(ExpTable#(BitDisExp, FP,ExplutSize) ifc); let tbl <- mkExpTableo; return tbl; endmodule (' synthesize *) module HSqrtInvCPFP (SqrtInv#(CPFP)); 7/ a. DSP based let mkmul = mkMultiplierFP_DSP; // b. LUT based S/ let mkmul = mkMultiplierFP16LUT; // 1. LA // let mksi = mkSqrtInvFP(mkmul,mkLAtableFP); // 2. Log let mksi = mkSqrtInvLogFP(mkmul, /7 mkLogtableFP,mkExptableFP); // 3. NR let mksi = mkSqrtInvNRFP(mkmul); let m <- mkSqrtInvCPFP(mksi); return m; endmodule (* synthesize *) module 9JNormCPFP (Norm#(CP-FP,Dim,PUarrSize)); let m <- mkNorm(mkMultiplierFP DSP); return m; endmodule (* synthesize *) module SDPCPFP(DP#(CPFP,Dim,PUarrSize)); let mkmul = mkComplexMultiplier(mkMultiplierFPDSP); let m <- mkDP(mkmul, mkNormCPFP, mkSqrtInvCPFP); return m; endmodule (* synthesize *) TPCPFP(TP#(CPFP,Dim,PUarrSize)); module let mkmul = mkComplexMultiplier(mkMultiplierFPDSP); let m <- mkTP(mkmul); return m; endmodule QRCP-FP(QR#(Dim,Dim,CPFP)); module let mkdp = mkDPCPFP; let mktp = mkTP-CP-FP; let m return <- mkQRtop(mkdp, mktp); m; endmodule 203 52 MGS Linear specific mkQR.bsv S/* Author: Sunila Saqib saqibmit.edu module [ml iQRtop(m#(DP#(tnum,nTyp,mTyp)) mkdp, m#(TP#(tnum,nTyp,mTyp)) mktp, QR#(n,nTyp,tnum) ifc) provisos(IsModule#(m,m__),Literal#(tnum),Bits#(tnum, a__), Add#(TLog#(n),1,adsize),DefaultValue::DefaultValue#(tnum)); Vector#(n,FIFO#(Vector#(nTyp,tnum))) xinFIFO <-replicateM(mkFIF01); Vector#(n,FIFO#(tnum)) routFIFO <- replicateM(mkFIFO); Vector#(n,Put#(Vector#(nTyp,tnum))) xinPut; Vector#(n,Get#(tnum)) routGet; Vector#(n,FIFO#(Vector#(nTyptnum))) qFIFO <-replicateM(mkFIF01); Reg#(Bit#(adsize)) counter <- mkReg(O); ; Reg#(Bit#(adsize)) counterR <- mkReg(O); Reg#(Bit#(adsize)) counterQ <- mkReg(O); UnitRow#(n,nTyp,tnum) ur <- mkUnitRow(mkdp,mktp); Reg#(Bool) resetall <-mkReg(False); rule get-qout;//implicit gaurd ur generates output for(Integer i=O;i<valueof(TSub#(n,1));i=i+1) begin let x <- ur.qout[i].get(; if(counterQ!=fromInteger(valueof(nTyp)-1)) qFIFO[i] .enq(x); end if(counterQ+1==fromInteger(valueof(nTyp))) counterQC=0; else counterQ<=counterQ+1; endrule rule get-rout; Vector#(n, tnum) rs = newVector; for(Integer i=O;i<valueof(n);i=i+1) rs[i] <- ur.rout[i].get; Vector#(n, tnum) shiftedRout = shiftOutFromN(defaultValue, rs, counterR); for(Integer i=O;i<valueof(n);i=i+1) routFIFO[i].enq( shiftedRout[i]); if(counterR+1==fromInteger(valueof(nTyp))) counterR<=O; else counterR<=counterR+1; endrule rule putInput ;//implicit gaurd either xinFIFO has new input //and counter is 0 (machine = idle).. or qFIFO has now the output //generated by unitrow and counter > 0 (in progress) for (Integer i=0;i<valueof(n);i=i+1) begin if(counter==0) begin let xins = xinFIFO[i].firsto; ur.xin[i].put(xins); xinFIFO[i] deqO; 4 204 end else begin if(i =fromInteger(valueof(TSub#(n,1)))) begin let xins = qFIFO[i] .first(; qFIFO[i].deqo; ur.xin[i].put(xins); end else begin Vector#(nTyp,tnum) xins = replicate(C); ur.xin[i].put(xins); end end end if(counter+1 ==fromInteger(valueof(n))) counter<=; else counter<=counter+1; endrule xinPut = map(fifoToPut, xinFIFO); routGet = map(fifoToGet, routFIFO); interface Put rowin = xinPut; interface Get rowout = routGet; endmodule 205 53 Multiplier.bsv /* Author: Richard Uhler ruhler@mit.edu Revised by: Sunila Saqib saqib@mit.edu typedef Server#(Tuple2#(tnum, tnum), tnum) Multiplier#(type tnum); module Em] NDePipelinedMultiplier(m#( PipelinedMultiplier#(stages, tnum)) mkmul, Multiplier#(tnum) ifc) provisos(IsModule#(m, m__)); PipelinedMultiplier#(stages, tnum) m <- mkmul; m.request; interface Put request interface Get response = m.response; endmodule module [ml HPipelinedMultiplierFixedPoint ( m#(Multiplier#(Bit#(bLen))) mkmul, Multiplier#(FixedPoint#(is, fs)) ifc) provisos(IsModule#(m, m_-), Add#(a__, TAdd#(is, fs), bLen)); Multiplier#(Bit#(bLen)) mul <- mkmul; FIFO#(Bool) negative <- mkSizedFIFO(4); interface Put request; method Action put(Tuple2#(FixedPoint#(is, fs), FixedPoint#(is, fs)) x); match {.xO, xi = x; let s-x = fxptGetInt(xO) < 0; let s-y = fxptGetInt(xl) < 0; xO; let a = sx ? -xO let b = s-y ? -xl : x1; mul.request.put(tuple2(zeroExtend(pack(a)), zeroExtend(pack(b)))); negative.enq((s-x && !s_y) 11 (!sx && sy)); endmethod endinterface interface Get response; method ActionValue#(FixedPoint#(is, fs)) get(); Bit#(bLen) bits <- mul.response.geto; FixedPoint#(is, fs) rv = unpack(bits[(2*valueof(fs) +valueof(is)-1):valueof(fs)]); Bool neg <- toGet(negative).get(); return (neg ? -rv : rv); endmethod endinterface , endmodule // A Multiplexed Multiplier. module Em] iMultiplexedMultiplier(m#(Multiplier#(tnum)) 206 mkmul, Multiplier#(Vector#(n, tnum)) ifc) provisos(IsModule#(m, mn), Bits#(tnum, tnumisz)); FIFO#(Tuple2#(Vector#(n, tnum), Vector#(n, tnum))) infifo <- mkFIF01); match {.out-g, out_p} <- mkGPFIFO(); Reg#(Vector#(n, tnum)) pending <- mkRegUO; Multiplier#(tnum) multiplier <- mkmul; Reg#(Bit#(TAdd#(1, TLog#(n)))) inloc <- mkReg(O); rule domultiply (True); let a = tpl_1(infifo.first)[inloc]; let b = tpl_2(infifo.first)[inloc]; multiplier.request.put(tuple2(a, b)); if (inloc -+ 1 == fromInteger(valueof(n))) begin infifo.deqO; inloc <= 0; end else inloc <= inloc +, 1; endrule Reg#(Bit#(TAdd#(1, TLog#(n)))) outloc <- mkReg(O); rule getresult (True); let res <- multiplier.response.geto; let npending = pending; npending[outloc] = res; if (outloc + 1 == fromInteger(valueof(n))) begin outloc <= 0; out-p.put(npending); end else begin outloc <= outloc + 1; pending <= npending; end endrule interface Put request = toPut(infifo); interface Get response = out-g; endmodule module in] GomplexMultiplier3(m#(Multiplier#(tnum)) mkmul, Multiplier#(Complex#(tnum)) ifc) provisos(Arith#(tnum), Bits#(tnum, tnum__), IsModule#(m, m__)); Vector#(3, Multiplier#(tnum)) multiplier <replicateM(mkmul); interface Put request; method Action put(Tuple2#(Complex#(tnum), Complex#(tnum)) x); match {.a, .b} = x; Vector#(4, tnum) as = as[] = a.rel-a.img; as[1] = b.rel- b.img; as[21 = b.rel+ b.img; Vector#(4, tnum) bs = ?; 207 bs[0] = b.img; bs[1] = a.rel; bs[21 = a.img; for(Integer i =0; i< 3; i=i+1) multiplier[i].request.put(tuple2(as[i], endmethod endinterface interface Get response; method ActionValue#(Complex#(tnum)) get(); Vector#(3, tnum) z = replicate(0); for(Integer i =0; i< 3; i=i+1) z[i] <- multiplier[i].response.get(; return cmplx(z[0] + z[l], z[0] + z[2]); endmethod endinterface endmodule 208 bs[i])); 54 PipelinedMultiplier.bsv Author: Richard Uhler ruhler@mit.edu Revised by: Sunila Saqib saqib@mit.edu interface PipelinedMultiplier#(numeric type stages, type tnum); interface Put#(Tuple2#(tnum, tnum)) request; interface Get#(tnum) response; endinterface // Implementation of an unsigned multiplier which should be 7/ inferred as pipelined by the xilinx synthesis tools. // Methods are not guarded. The response to a request will be / available exactly stages cycles after the request is made, /7 and will only be available for that one cycle. // If you don't get the response on time, it will be dropped. module JPipelinedMultiplierUG(PipelinedMultiplier#(stages, Bit#(t))) provisos (Add#(smi, 1, stages)); Reg#(Bit#(t)) a <- mkRegU(); Reg#(Bit#(t)) b <- mkRegUO; Vector#(smi, Reg#(Bit#(t))) shiftregs <replicateM(mkRegU()); (* fire-when-enabled *) (no-implicit-conditions * rule multiplyandshift (True); shiftregs[O] <= a * b; for (Integer i = 1; i < valueof(sml); i = i+1) begin shiftregs[i] <= shiftregs[i-1]; end endrule interface Put request; method Action put(Tuple2#(Bit#(t), Bit#(t)) operands); a <= tpl_1(operands); b <= tpl_2(operands); endmethod endinterface interface -Get response; method ActionValue#(Bit#(t)) geto; return shiftregs[valueof(sml)- 1]; endmethod endinterface endmodule / Provide a semi-safe, semi-guarded interface to an unguarded 7/ multiplier. // The get method is not enabled until a value is available to 7/ be gotten, but you must take the result as soon as it is // available, otherwise it will be lost. 209 S// // // // // // / To use this safely, ensure the rule which calls response.get fires when enabled (use the fire-whenenabled attribute), has no explicit condition, and has only response.get.ready as the implicit conditions. The unguarded multiplier IIith( odule n synthesize module asse shoul e odule *) JPipelinedMultiplierSG(m#(PipelinedMultiplier#(stages, tnum)) mkmul, PipelinedMultiplier#(stages, tnum) ifc) provisos(IsModule#(m, a__)); PipelinedMultiplier#(stages, tnum) mul <- mkmul(); PulseWire incoming <- mkPulseWireo; Vector#(stages, Reg#(Bool)) valids <replicateM(mkReg(False)); (* fire-when-enabled *) (* noimplicit-conditions *) rule shift (True); valids[0] <= incoming; for (Integer i = 1; i < valueof(stages); i = i+1) begin valids[i] <= valids[i-1]; end endrule interface Put request; method Action put(Tuple2#(tnum, tnum) operands); mul.request.put(operands); incoming.sendo; endmethod endinterface interface Get response; method ActionValue#(tnum) get() if (valids[valueof(stages)-1]); tnum x <- mul.response.geto; return x; endmethod endinterface endmodule // Provide a safe, guarded interface to an unguarded multiplier. shoulde R odule // The unguarded multiplier module asse ith (synthesize *). PipelinedMultiplierG( rodule n m#(PipelinedMultiplier#(stages, tnum)) mkmul, PipelinedMultiplier#(stages, tnum) ifc) provisos(Bits#(tnum, tnumxsz), IsModule#(m, b__)) PipelinedMultiplier#(stages, tnum) mul <mkPipelinedMultiplierSG(mkmul); FIFOF#(tnum) results <- mkGSizedFIFOF(True, False, valueof(stages) + 1); 210 Counter#(TAdd#(I, TLog#(stages))) pending <- mkCounter(O); (* fire-when-enabled *) rule takeresult (True); tnum x <- mul.response.geto; results.enq(x); endrule interface Put request; method Action put(Tuple2#(tnum, tnum) operands) if fromInteger(valueof(stages))); (pending~value() pending.up(); mul request.put(operands); endmethod endinterface interface Get response; method ActionValue#(tnum) get(); pending.downo; results.deqo; return results.firsto; endmethod endinterface endmodule 211 55 GR specific ComplexFixedPointRotation.bsv Author: Richard Uhler ruhler@mit.edu // This is a naive implementation. It has the following stages: a = r.r*r.r + xrr*x.r + x.i*x.i where r.i is assumed to be 0 //2. r' = sqrt(a) using a fixed point square root. //3. c = r.r/r', s.r = x.r/r', s.i = x.i/r'using fixed point divides module Em ComplexFixedPointRotation (m#(Multiplier#(FixedPoint#(is, fs))) mkmul, Rotate#(Complex#(FixedPoint#(is, fs))) ifc) provisos(Add#(a__, fs, TMul#(2, fs)), Add#(b__, 1, TAdd#(is, TMul#(2, fs))), Mul#(2, fs, TAdd#(c__, fs)), IsModule#(m, m__)); match {.in-g, in-p} <- mkGPFIFO(); match {.csoutg, .csout-p} <- mkGPFIFO(); Multiplier#(Vector#(3, FixedPoint#(is, fs))) mul <mkMultiplexedMultiplier(mkmul); SquareRoot#(FixedPoint#(is, fs)) sqrt <mkFixedPointSquareRoot(1); Divider#(FixedPoint#(is, fs)) dr <- mkFixedPointDivider(2); Divider#(FixedPoint#(is, fs)) dxr <mkFixedPointDivider(2); Divider#(FixedPoint#(is, fs)) dxi <mkFixedPointDivider(2); match {.xr-g, xrp} <- mkGPFIFO(); GetPut#(Complex#(FixedPoint#(is, fs))) rout-gp <mkGPFIFO(); match {.rout-g, .rout-p} = rout-gp; rule domultiply (True); RotateInput#(Complex#(FixedPoint#(is, fs))) i <in-g.geto; xr-p.put(i); Vector#(3, FixedPoint#(is, fs)) as = ?; as[0] = i.r.rel; as[1] = i.x.rel; as[2] = i.x.img; Vector#(3, FixedPoint#(is, fs)) bs = 7; bs[0] = i.r.rel; bs[1] = i.x.rel; bs[2] = i.x.img; mul.request.put(tuple2(as, bs)); endrule rule dosqrt (True); let z <- mul.response.geto; sqrt.request.put(z[0] + z[1] + z[2]); endrule rule dodivide (True); match {.nr, .*} <- sqrt.response.geto; /1. 212 let i <- xr-g.getO; dr.request.put(tuple2(i.r.rel, nr)); dxr.request.put(tuple2(i.x.rel, nr)); dxi.request.put(tuple2(i.x.img, nr)); rout-p.put(cmplx(nr, 0)); endrule rule dofinalize (True); match {.cout, .*} <- dr.response.geto; match {.srout, .*} <- dxr-response.geto; match {.siout, .*} <- dxi.response.geto; csout-p.put(RotationCS {c: cmplx(cout, 0), s: cmplx(srout, siout)}); endrule interface Put request = in-p; interface Get rout = rout-g; interface Get csout = csout-g; endmodule 213 56 GR Systolic specific StreamQR.bsv Author: Richard Uhler ruhler@mit.edu */ // Stream QR // An implementation of QR with an interface where each element 7/ of the matrix is input one at a time, from first row /7 to last, and within each row from first column to last. // Every input belonging to the last row of a matrix should be /7 annotated as such. interface StreamQR#(numeric type width, type tnum); interface Put#(Terminating#(tnum)) xin; interface Get#(tnum) rout; endinterface // Cross Clock Domain StreamQR 7/ sqr stream qr in the source clock domain // sclk - the source clock // srst the source reset 7/ Implements a StreamQR in the current clock domain. module SyncStreamQR(StreaQR#(w, t) sqr, Clock sclk, Reset srst, StreamQR#(w, t) ifc) provisos(Bits#(t, t-sz)); Clock myclk <- exposeCurrentClocko; Reset myrst <- exposeCurrentReseto; SyncFIFOIfc#(Terminating#(t)) xi <- mkSyncFIFO(2, myclk, myrst, sclk); SyncFIFOIfc#(t) ro <- mkSyncFIFO(2, sclk, srst, myclk); mkConnection(toGet(xi), sqr.xin); mkConnection(sqr.rout, toPut(ro)); interface Put xin = toPut(xi); interface Get rout = toGet(ro); endmodule 214 57 Divider.bsv // The MIT License // Copyright (c) 2010 Massachusetts Institute of Technology // Permission is hereby granted, free of charge, to any person 7 obtaining a copy of this software and associated documentation // files (the "Software"), to deal in the Software without // restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, // and/or sell copies of the Software, and to permit persons to 7 whom the Software is furnished to do so, subject to the 7 following conditions: 77 The above copyright notice and this permission notice shall // be included in all copies or substantial portions of the 7/ Software. 7/ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY // KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE / WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR 7/ PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS // OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR // OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR // OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR // OTHER DEALINGS IN THE SOFTWARE. 77 Author: Richard Uhler ruhler~mit.edu typedef Server#( Tuple2#(word, word), Tuple2#(word, word) ) Divider#(type word); // Unsigned division / Input: n, d // Output: q, r where n = d*q + r. 7/ This implementation uses the non-restoring algorithm (or so 77 I'm told). itersPerCycle is an integer specifying how many iterations /7 of the algorithm should be performed each clock cycle. // This number should divide the bit width evenly. module ENonRestoringDivider(Integer itersPerCycle, z Divider#(Bit#(ws)) ifc) provisos(Add#(, 1, ws)); if (valueof(ws) % itersPerCycle != 0) begin error("itersPerCycle must evenly divide operand bit v width"); 4i end FIFO#(Tuple2#(Bit#(ws), Bit#(ws))) incoming <- mkFIFO(); FIFO#(Tuple2#(Bit#(ws), Bit#(ws))) outgoing <- mkFIFO0; .v Reg#(Bool) busy <- mkReg(False); A Reg#(Bit#(ws)) xReg <- mkRegU(; 'a Reg#(Bit#(ws)) dReg <- mkRegU(; Reg#(Bit#(ws)) pReg <- mkRegU(; 47 215 Reg#(Bit#(TAdd#(l, TLog#(ws)))) iReg <- mkRegUO; rule start (!busy); match {.ix, .id} <- toGet(incoming).geto; busy <= True; dReg <= id; xReg <= ix; pReg <= 0; iReg <= 0; endrule // Return the most significant bit of a bit vector function Bit#(1) top(Bit#(n) x) = x[valueof(n)-1]; // Return all but the most significant bit of a bit vector function Bit# (TSub#(n,1)) rest(Bit#(n) x) = x[valueof(n)-2:0]; // Returns new x, p after a single iteration of the /7 division algorithm. function Tuple2#(Bit#(ws), Bit#(ws)) iterate(Bit#(ws) x, Bit#(ws) p, Bit#(ws) d); if (top(p) == 1) begin p = ((p << 1) zeroExtend(top(x))) + d; end else begin p = ((p << 1) I zeroExtend(top(x))) - d; end x = (x << 1) 1 zeroExtend(~top(p)); return tuple2(x, p); endfunction rule doiterate (busy); Bit#(ws) x = xReg; Bit#(ws) p = pReg; for (Integer i = 0; i < itersPerCycle; i i+1) begin let iout = iterate(x, p, dReg); x = tpll(iout); p = tpl_2(iout); end if (iReg + fromInteger(itersPerCycle) == fromInteger(valueof(ws))) begin if (top(p) == 1) p = p + dReg; outgoing.enq(tuple2(x, p)); busy <= False; end else begin xReg <= x; pReg <= p; iReg <= iReg + fromInteger(itersPerCycle); end endrule interface Put request = toPut(incoming); interface Get response = toGet(outgoing); ,4 endmodule // Fixed Point divider module HFixedPointDivider(Integer itersPerCycle, c, Divider#(FixedPoint#(iw, fw)) ifc) 216 provisos( Add#(a__, TAdd#(iw, fw), TAdd#(iw, TMul#(2, fw))), Add#(b__, 1, TAdd#(iw, TMul#(2, fw))) // We use the integer division algorithm to do fixed point // division. If you pack a fixed point number x, /7 you get a new number which is x * 2-fw // If you unpack a number y into a fixed point number, you // get a fixed point number which is y / 2^fw. // Let the inputs be two fixed point numbers a, b. We want // to generate the fixed point number a / b. // We unpack both numbers, multiply the numerator by 2^fw, // perform integer division and get the integer: 7/ (a * 2^fw) * 2^fw / (b & 2-fw) = (a/b) * 2^fw // Simply unpack and we get our fixed point a/b. Divider#(Bit#(TAdd#(iw, TMul#(2, fw)))) div <mkNonRestoringDivider(itersPerCycle); FIFO#(Bool) negate <- mkFIFOO; interface Put request; method Action put(Tuple2#(FixedPoint#(iw, fw), FixedPoint#(iw, fw)) x); match {.a, .b} = x; Bool neg = False; if (pack(fxptGetInt(a))[valueof(iw)-I] == 1'bi) begin neg = !neg; a = -a; end if (pack(fxptGetInt(b))[valueof(iw) neg = !neg; 1] == 1' bi) begin b = -b; end negate.enq(neg); Bit#(TAdd#(iw, fw)) aa = pack(a); Bit#(TAdd#(iw, TMul#(2, fw))) aaa = zeroExte nd(aa); aaa = aaa << valueof(fw); Bit#(TAdd#(iw, fw)) bb = pack(b); Bit#(TAdd#(iw, TMul#(2, fw))) bbb = zeroExte nd(bb); div.request.put(tuple2(aaa, bbb)); endmethod endinterface interface Get response; method ActionValue#(Tuple2#(FixedPoint#(iw, fw), FixedPoint#(iw, fw))) get(); match {.q, .r} <- div.response.geto; Bit#(TAdd#(iw, fw)) qb = q[valueof(iw) + valueof (fw)-1:0]; FixedPoint#(iw, fw) qf = unpack(qb); if (negate.firsto) qf = -qf; negate.deqo; 217 return tuple2(qf, ?); endmethod endinterf ace endmodule 218 58 SquareRoot.bsv // The MIT License // Copyright (c) 2010 Massachusetts Institute of Technology // Permission is hereby granted, free of charge, to any person /7 obtaining a copy of this software and associated documentation 7/ files (the "Software"), to deal in the Software without 77 restriction, including without limitation the rights 77 to use, copy, modify, merge, publish, distribute, sublicense, 77 and/or sell copies of the Software, and to permit persons to 77 whom the Software is furnished to do so, subject to the 7/ following conditions: // The above copyright notice and this permission notice shall /7 be included in all copies or substantial portions of the 77 Software. 7/ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY 7/ KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE // WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR 77 PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS 77 OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 77 OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 77 OTHERWISE, ARISING FROM, 77 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 77 OTHER DEALINGS IN THE SOFTWARE. 77 Author: Richard Uhler ruhler~mit.edu typedef Server#( word, Tuple2#(word, word) ) SquareRoot#(type word); 77 Square root 77 Input: x 77 Output: q, r where x = q*q + r. // Algorithm developed based on description at // http://www.itl.nist.gov/div897/sqg/dads/HTML/squareRoot.html 77 itersPerCycle specifies how many iterations of the algorithm 77 to perform each clock cycle. This should divide evenly half 77 the bitwidth of the operand. SquareRoot(Integer itersPerCycle, module SquareRoot#(Bit#(ws)) ifc); if (valueof(ws) % (2 * itersPerCycle) != 0) begin error("itersPerCycle must evenly divide half the bitwidth"); end FIFO#(Bit#(ws)) incoming <- mkFIFO(); FIFO#(Tuple2#(Bit#(ws), Bit#(ws))) outgoing <- mkFIFO(); Reg#(Bool) busy <- mkReg(False); Reg#(Bit#(ws)) qReg <- mkRegU(; Reg#(Bit#(ws)) rReg <- mkRegUo; Reg#(Bit#(ws)) xReg <- mkRegU(); 219 Reg#(Bit#(TLog#(ws))) iReg <- mkRegUO; rule start ('busy); let xi <- toGet(incoming).geto; busy <= True; xReg <= xi; rReg <= 0; qReg <= 0; iReg <= 0; endrule // Get the top 2 bits of a bit vector function Bit#(2) top2(Bit#(n) a) = a[valueof(n)-1:valueof(n)-2]; // Perform a single iteration of the square root algorithm. // Returns new r,q,x function Tuple3#(Bit#(ws), Bit#(ws), Bit#(ws)) iterate(Bit#(ws) r, Bit#(ws) q, Bit#(ws) x); Bit#(TAdd#(2, ws)) d = {q, 2'bO1}; Bit#(TAdd#(2, ws)) r2 = {r, top2(x)}; if (d > r2) begin q = q << 1; r r2[valueof(ws)-1:0]; end else begin q = (q << 1) 1; Bit#(TAdd#(2, ws)) diff = r2 - d; r = diff[valueof(ws)-1:0]; end return tuple3(r, q, x << 2); endfunction rule doiterate (busy); Bit#(ws) r = rReg; Bit#(ws) q = qReg; Bit#(ws) x = xReg; for (Integer i = 0; i < itersPerCycle; i let r = q = x = = i+1) begin iout = iterate(r, q, x); tpl_1(iout); tpl2(iout); tpl_3(iout); end if (iReg + fromInteger(itersPerCycle) fromInteger(valueof(ws)/2)) begin outgoing.enq(tuple2(q, r)); busy <= False; end else iReg <= iReg + fromInteger(itersPerCycle); rReg <= r; qReg <= q; xReg <= x; endrule interface Put request = toPut(incoming); interface Get response = toGet(outgoing); 220 endmodule // Fixed Point square root module FixedPointSquareRoot(Integer itersPerCycle, SquareRoot#(FixedPoint#(iw, fw)) ifc) provisos( Add#(a__, TAdd#(iw, fw), TAdd#(iw, TMul#(2, fw))) // We use the integer square root algorithm to do fixed // point square root. If you pack a fixed point number x, // you get a new number which is x * 2^fw // If you unpack a number y into a fixed point number, you 7/ get a fixed point number which is y / 2^fw. // Let the input be a fixed point number x. We want to // generate the fixed point number sqrt(x). // We unpack x, multiply by 2^fw, perform integer square // root, and get the integer: 7/ sqrt((x * 2^fw) * 2^fw) = sqrt(x) * 2^fw // Simply unpack and we get our fixed point sqrt(x). SquareRoot#(Bit#(TAdd#(iw, TMul#(2, fw)))) sqrt <mkSquareRoot(itersPerCycle); interface Put request; method Action put(FixedPoint#(iw, fw) x); Bit#(TAdd#(iw, fw)) xx = pack(x); Bit#(TAdd#(iw, TMul#(2, fw))) xxx = zeroExtend(xx); xxx = xxx << valueof(fw); sqrt.request.put(xxx); endmethod endinterface interface Get response; method ActionValue#(Tuple2#(FixedPoint#(iw, fw), FixedPoint#(iw, fw))) get(); match {.q, .r} <- sqrt-response.geto; Bit#(TAdd#(iw, fw)) qb = q[valueof(iw) + valueof(fw)-1:0]; FixedPoint#(iw, fw) qf = unpack(qb); return tuple2(qf, ?); endmethod endinterface endmodule 221