10/06/2010 MIDTERM PRESENTATION By: Supervisors: Daniel Barsky Rolf Hilgendorf Natalie Pistunovich Inna Rivkin SUB-NYQUIST SYSTEM OPTIMIZATION AGENDA Overview Project objectives Hardware Introduction to CAD tools Detailed Data Flow through the system Points of possible improvement: Hardware utilization summary Latencies summary Implementation Architecture Algorithm Selected improvement to be implemented Timeline Future plans Project status Gantt Chart PROJECT OBJECTIVES On the short run, optimize the algorithm to use minimal hardware, in order to fit on 2 FPGA chips, while maintaining minimum latency On the long run, determine an optimal architecture to be implemented on chip (ASIC) HARDWARE Altera Stratix-III GIDEL PROCStarEP3SE260 III card 255K 4 x Altera Logic Stratix Elements III FPGA Maximum 1 GB DDR 768 DRAM 18x18 bit multipliers* Max. frequency - ~300MHz * In FIR mode INTRODUCTION TO CAD TOOLS In order to get acquainted to the different CAD tools in use, we have constructed a model design and ran it through the entire process until it is burned and run on the card OVERVIEW – DATA FLOW (NORMAL MODE) Memory stored samples for later reconstruction, each with the appropriate support index Incoming Samples At 60MHz Samples are filtered and decimated to 12 channels of 20MHz, Sent to Memory & Q-Frame Expander In Iteration Mode, samples are further filtered and decimated to 12 channels of 2MHz each iteration & sent to the CTF Memory Q-Frame collects 70 samples, calculates Q-Frame and sends it to OMP Q-Frame In iteration mode, a Q-Frame is constructed, a support is calculated and accumulated for each iteration CTF SCD checks for a significant change in the support, if detected – initiates calculating a new one OMP OMP calculates the support from the Q-Frame. Then, it sendsSamples it to the are also sent to Pseudo Inverse the SCD to check for a Pseudo-Inverse recovers changethe in the columns of the support from support matrix A, constructs their pseudoinverse & sends it to the Reconstruction Reconstruction & Support Change Detection DSP Pseudo Inverse Reconstruction reconstructs data from input samples using the pseudoinverse EXPANDER –NORMAL MODE LPF 60 MHz 12 bit Anal og Syst em + A/D 2 -30 MHz 2 12 bit -30 MHz 60 MHz 12 bit 3 30 MHz LPF LPF 60 MHz 60 MHz 12 bit 1 2 1 30 MHz 1 -30 MHz 2 -30 MHz 3 3 30 MHz 1 3 30 MHz -10 MHz 10 MHz -10 MHz 10 MHz -10 MHz 10 MHz LPF -10 MHz 10 MHz LPF -10 MHz 10 MHz LPF -10 MHz 10 MHz 2 LPF -10 MHz LPF -10 MHz 10 MHz LPF -10 MHz 10 MHz LPF 10 MHz -10 MHz LPF -10 MHz 10 MHz LPF -10 MHz 10 MHz 20 MHz sample Memory 12 samples 20MHz each CTF 10 MHz 20 MHz sample DSP The expander (master) sends 12 20MHz slices to the CTF (slave) each cycle The expander sends new 20MHz slices to the Memory each cycle cycle and to the DSP EXPANDER – ITERATION MODE LPF 60 MHz 12 bit Anal og Syst em + A/D 60 MHz 12 bit 2 1 3 -30 MHz 30 MHz LPF LPF LPF 2 1 3 -30 MHz 30 MHz LPF LPF 60 MHz 12 bit 60 MHz 12 bit 2 1 1 -30 MHz 2 -30 MHz 3 30 MHz 1 3 30 MHz LPF 3x80x(20/180) LPF 2 -10 MHz 1 -10 MHz 10 MHz -1 MHz 1 MHz 10 MHz -1 MHz 1 MHz 10 MHz -1 MHz 1 MHz 3 -10 MHz -10 MHz 2 10 MHz -1 MHz 1 MHz 10 MHz -1 MHz 1 MHz 10 MHz -1 MHz 1 MHz 1 -10 MHz 3 -10 MHz -10 MHz 2 1 10 MHz 1 10 MHz -10 MHz 3 LPF -10 MHz LPF 10 MHz LPF -10 MHz LPF -10 MHz 10 MHz 2 -10 MHz 1 3 10 MHz 10 MHz 2 -1 MHz 1 MHz 10x40x(2/180) -1 MHz 1 MHz -1 MHz 1 MHz -1 MHz 1 MHz -1 MHz 1 MHz -1 MHz 1 MHz 20 MHz sample Memory 12 samples 2MHz each CTF new 20 MHz sample The expander (master) sends 12 2MHz slices to the CTF (slave) each cycle Once the CTF requests for new, the expander changes and sends The expander sends new 20MHz slices to the Memory each cycle and to the DSP DSP EXPANDER - SOURCE UTILIZATION Normal: 20 4 2 3 3 80 2 Foperating Iterations: 20 2 4 2 3 3 80 2 1 10 40 Foperating Foperating Multipliers 120 MHz 180 MHz Normal 496 336 Iteration 672 459 EXPANDER - LATENCIES Normal: 240 @ 60MHz :1 81cycles 3 240 400 Iterations: @ 20 / 60MHz : 1 1 204cycles @ 60 MHz 3 10 @ 20 MHz @60 MHz Cycles Normal Iterations @60MHz 81 204 @120MHz 162 408 @180MHz 243 612 @us 1.35 3.4 CTF – Q-FRAME CONSTRUCTION Constructs the Q-Frame for the support calculation, and sends it to the Q-Frame Block Input Channels From Expander Q-Frame 3x2x18 bit 3x2x18 bit Conversion To Complex 3x2x18 bit Q-Frame entries To OMP Q-Frame Memory Vector Multiplier Mem A 5kbit 12x12x18 bit complex Mem B 5kbit Controller Support Vector From OMP 3x2x18 bit 12 bit 4 bit Support Accumulator Support Length Vector To DSP 7x12 bit Support Indices To DSP CTF – Q-FRAME – VECTOR MULTIPLIER Receives a vector of 12 18-bit complex samples from the Expander (Y[1..12]) i H Q Y Calculates 1212 121 Y112 in 2 clock cycles Vector Multiplier Y1 Y2 Y3 Y1H Y2H Y4 Y3H Y12 Y4H Y12H CTF – Q-FRAME – VECTOR MULTIPLIER On the 1st cycle, calculates and stores the first 3 columns Requires: 33 Complex 18 18 Complex multipliers Vector Multiplier Memory Bank Column 1 y1 Q1,1 Q2,1 Q3,1 H y1 y2 y3 y12 y2 Q1,2 Q2,2 Q3,2 H y3 Q1,3 Q2,3 Q3,3 H Q12,1 Q12,2 Q12,3 Column 2 Column 3 Column 4 Column 5 Column 6 Column 12 CTF – Q-FRAME – VECTOR MULTIPLIER On the 2nd cycle, calculates and stores the last 9 columns Requires: 45 Complex 18 18 Complex multipliers Vector Multiplier Memory Bank Column 1 Column 2 y 1H y 2H y 3H y 4H y 5H y 6H y12H y1 Q1,1 Q1,2 Q1,3 Q1,4 Q1,5 Q1,6 Q1,12 y2 Q2,1 Q2,2 Q2,3 Q2,4 Q2,5 Q2,6 Q2,12 Column 4 y3 Q3,1 Q3,2 Q3,3 Q3,4 Q3,5 Q3,6 Q3,12 Column 5 y4 Q4,1 Q4,2 Q4,3 Q4,4 Q4,5 Q4,5 Q4,12 Column 6 y5 Q5,1 Q5,2 Q5,3 Q5,4 Q5,5 Q5,6 Q5,12 y6 Q6,1 Q6,2 Q6,3 Q6,4 Q6,5 Q6,6 Q6,12 y12 Q12,1 Q12,2 Q12,3 Q12,4 Q12,5 Q12,6 Q12,12 Column 3 Column 12 CTF – Q-FRAME - SUMMARY Uses 45 18x18 bit Complex Multipliers (45 DSP Half-Blocks) Latency: Normal Mode: 70 Number of samples per frame Iteration Mode: 70 Number of samples per frame Independent 1 20 MHz 3.5 sec Input Rate 1 2MHz 35 sec Input Rate of system clock frequency! CTF – SUPPORT CALCULATION Calculates the signal’s support from the QFrame using the Orthogonal Matching Pursuit algorithm, using several iterations Support Calculation Q-Frame entries from Q-Frame 12x12x18 bit complex OMP Matrix Multiplier Support Merge Support Vector to Q-Frame 12 bit A Matrix Memory CTF – SUPPORT CALCULATION - OMP Initialization: Q-frame is loaded into residual matrix 1 cycle Q-Frame Residual Matrix CTF – SUPPORT CALCULATION - OMP Phase 1: Projection 101 cycles 144 18x18 Complex multipliers Residual Z AH A CTF – SUPPORT CALCULATION - OMP Phase 2: Energy Calculation, Find maximum energy & Update Support 101 cycles 12 18x18 Complex multipliers Z Z1 Maximum Energy Z2 Z3 Z4 Z1H 22 |Z |Z| 1| Z5 Z6 Z100 Z101 Current Support CTF – SUPPORT CALCULATION - OMP Phase 4: Vector Orthogonalization Number of cycles depends on iteration (on i-th iteration – 2i cycles) 12 18x18 Complex Multipliers Previous Orthogonal Vectors A Wj Vsupport Vsupport Current Support Wj CTF – SUPPORT CALCULATION - OMP Phase 5: Vector Normalization 2 cycles + (square root calculation time) 12 18x18 Complex Multipliers Previous Orthogonal Vectors VsupportH 1 Wsupport Vsupport Vsupport 2 CTF – SUPPORT CALCULATION - OMP Phase 6: Residual Matrix Update 14 cycles 144 18x18 Complex Multipliers WsupportH Residual Wsupport Wsupport CTF – SUPPORT CALCULATION - OMP Phase 6: Residual Matrix Energy Calculation & Stopping Condition Check 13 cycles 12 18x18 Complex Multipliers Calculate Column Energy Calculate Overall Energy Energy Residual ? threshold CTF – OMP - SUMMARY Uses 144 18x18 bit Complex Multipliers (144 DSP Half-Blocks) Latency: Normal Mode: ~1100 Clock Cycles 6.1 usec at 180MHz 9.1 usec at 120MHz Iteration Mode: ~2560 Clock Cycles per iteration 14.2 usec per iteration at 180MHz 21.3 usec per iteration at 120MHz (latency is contained in Q-Frame construction latency for the next iteration, which is 35 usec per iteration) DSP – THE FLOW Memory Samples Y DSP Expander Samples Y Support Change detector Reconstructed signal Pseudo Inverse CTF Matrix A Support External memory DSP – QR DECOMPOSITION Amxn Qmxm QR Decomposition Support . . . 261 cycles 51 multipliers, 1 sqrt Matrix A DSP - DSP – QR DECOMPOSITION QR Q Decomposition A 12 cycles 144 multipliers X QT = R DSP - RINV R-1 156 cycles 1 multiplier, 1 divide 0 0 0 0 R DSP – PSEUDO INVERSE R-1 0 0 R-1 X 0 12 cycles 144 multipliers 0 QT = At DSP - RECONSTRUCTION Y samples X 1 cycle @ 20MHz 144 multipliers Y = Z Memory At DSP – SUPPORT CHANGE DETECTOR At Support changed DSP Pseudo Inverse Support Change detector X Y = Z DSP - SUMMARY Multipliers @120MHz 144 @180MHz Cycles us @120MHz 441 3.6 @180MHz 441 2.45 THE SYSTEM TODAY – STATUS Memory FPGA 1 FPGA 2 FPGA 3 73% 98% 75% CTF Q-Frame Expander DSP OMP . . . A† Controller New Incoming Sample Pseudo-Inverse Delay Q-Frame Delay 1.3usec Expander Delay 3.5usec Support Change Detector 6usec OMP Delay 2.4usec Reconstruction Delay Sample ready For reconstruction Timeline POSSIBLE IMMEDIATE IMPROVEMENTS Use Matrix Multiplication Unit Extend Q-Frame Calculation Reconstruction using Matrix Multiplication Unit Memory CTF Expander 1 divide Controller Q-Frame DSP OMP Support Change Detector . . . SYSTEM ARCHITECTURE – OUR SUGGESTION Memory FPGA#1 – Expander FPGA#2 - Matrix Multiplication . . . Expander Controller Matrix multiplication unit CTF DSP Q-Frame Pseudo-Inverse OMP Reconstruction & Support Change Detection . . . PLANS FOR THE FUTURE Consider rank-1 updates for a change in the support Consider changing the QR decomposition algorithm in the DSP: Householder modified Gramm-Schmidt Consider another decomposition: QR LQ, SVD, etc. Consider another the MP algorithm: OMP BMP, Convex Optimization, etc. PROJECT STATUS System Analysis: DONE PRESENT Locating points of possible optimization Current System Simulation FUTURE Studying the system’s algorithm Understanding algorithm implementation Analyzing hardware usage & latency Creating Entire Current System test environment Simulating entire current system System Optimization Selecting optimizations to be implemented Implementing optimizations Simulating optimized system GANTT CHART 14/02/2010 06/03/2010 26/03/2010 15/04/2010 05/05/2010 25/05/2010 14/06/2010 04/07/2010 24/07/2010 Learning the system Understanding EXP algorithm Understanding CTF algorithm Understanding DSP algorithm Understanding SCD algorithm Understanding algorithm for the whole system Preparing characterization presentation Understanding EXP implementation Understanding CTF implementation Understanding DSP implementation Understanding SCD implementation Understanding implementation for the whole system Mapping out the data flow through the system Learning CAD systems Characterizing optimal architecture outline Midterms Preparing midterm presentation Presentation for Yonina Constructing full system logical test environment Full system logic simulation Exam period Preparing final A presentation