Mid Semester Presentation

10/06/2010 MIDTERM PRESENTATION By: Supervisors: Daniel Barsky Rolf Hilgendorf Natalie Pistunovich Inna Rivkin SUB-NYQUIST SYSTEM OPTIMIZATION AGENDA      Overview Project objectives Hardware Introduction to CAD tools Detailed Data Flow through the system    Points of possible improvement:      Hardware utilization summary Latencies summary Implementation Architecture Algorithm Selected improvement to be implemented Timeline    Future plans Project status Gantt Chart PROJECT OBJECTIVES On the short run, optimize the algorithm to use minimal hardware, in order to fit on 2 FPGA chips, while maintaining minimum latency  On the long run, determine an optimal architecture to be implemented on chip (ASIC)  HARDWARE  Altera Stratix-III GIDEL PROCStarEP3SE260 III card  255K 4 x Altera Logic Stratix Elements III FPGA  Maximum 1 GB DDR 768 DRAM 18x18 bit multipliers*  Max. frequency - ~300MHz * In FIR mode INTRODUCTION TO CAD TOOLS  In order to get acquainted to the different CAD tools in use, we have constructed a model design and ran it through the entire process until it is burned and run on the card OVERVIEW – DATA FLOW (NORMAL MODE) Memory stored samples for later reconstruction, each with the appropriate support index Incoming Samples At 60MHz Samples are filtered and decimated to 12 channels of 20MHz, Sent to Memory & Q-Frame Expander In Iteration Mode, samples are further filtered and decimated to 12 channels of 2MHz each iteration & sent to the CTF Memory Q-Frame collects 70 samples, calculates Q-Frame and sends it to OMP Q-Frame In iteration mode, a Q-Frame is constructed, a support is calculated and accumulated for each iteration CTF SCD checks for a significant change in the support, if detected – initiates calculating a new one OMP OMP calculates the support from the Q-Frame. Then, it sendsSamples it to the are also sent to Pseudo Inverse the SCD to check for a Pseudo-Inverse recovers changethe in the columns of the support from support matrix A, constructs their pseudoinverse & sends it to the Reconstruction Reconstruction & Support Change Detection DSP Pseudo Inverse Reconstruction reconstructs data from input samples using the pseudoinverse EXPANDER –NORMAL MODE LPF 60 MHz 12 bit Anal og Syst em + A/D 2 -30 MHz 2 12 bit -30 MHz 60 MHz 12 bit 3 30 MHz LPF LPF 60 MHz 60 MHz 12 bit 1 2 1 30 MHz 1 -30 MHz 2 -30 MHz 3 3 30 MHz 1 3 30 MHz -10 MHz 10 MHz -10 MHz 10 MHz -10 MHz 10 MHz LPF -10 MHz 10 MHz LPF -10 MHz 10 MHz LPF -10 MHz 10 MHz 2 LPF -10 MHz LPF -10 MHz 10 MHz LPF -10 MHz 10 MHz LPF 10 MHz -10 MHz LPF -10 MHz 10 MHz LPF -10 MHz 10 MHz 20 MHz sample Memory 12 samples 20MHz each CTF 10 MHz 20 MHz sample DSP The expander (master) sends 12 20MHz slices to the CTF (slave) each cycle The expander sends new 20MHz slices to the Memory each cycle cycle and to the DSP EXPANDER – ITERATION MODE LPF 60 MHz 12 bit Anal og Syst em + A/D 60 MHz 12 bit 2 1 3 -30 MHz 30 MHz LPF LPF LPF 2 1 3 -30 MHz 30 MHz LPF LPF 60 MHz 12 bit 60 MHz 12 bit 2 1 1 -30 MHz 2 -30 MHz 3 30 MHz 1 3 30 MHz LPF 3x80x(20/180) LPF 2 -10 MHz 1 -10 MHz 10 MHz -1 MHz 1 MHz 10 MHz -1 MHz 1 MHz 10 MHz -1 MHz 1 MHz 3 -10 MHz -10 MHz 2 10 MHz -1 MHz 1 MHz 10 MHz -1 MHz 1 MHz 10 MHz -1 MHz 1 MHz 1 -10 MHz 3 -10 MHz -10 MHz 2 1 10 MHz 1 10 MHz -10 MHz 3 LPF -10 MHz LPF 10 MHz LPF -10 MHz LPF -10 MHz 10 MHz 2 -10 MHz 1 3 10 MHz 10 MHz 2 -1 MHz 1 MHz 10x40x(2/180) -1 MHz 1 MHz -1 MHz 1 MHz -1 MHz 1 MHz -1 MHz 1 MHz -1 MHz 1 MHz 20 MHz sample Memory 12 samples 2MHz each CTF new 20 MHz sample The expander (master) sends 12 2MHz slices to the CTF (slave) each cycle Once the CTF requests for new, the expander changes and sends The expander sends new 20MHz slices to the Memory each cycle and to the DSP DSP EXPANDER - SOURCE UTILIZATION   Normal:     20 4   2  3   3  80    2   Foperating       Iterations:        20 2 4   2  3   3  80    2  1  10  40        Foperating  Foperating         Multipliers 120 MHz 180 MHz Normal 496 336 Iteration 672 459 EXPANDER - LATENCIES  Normal: 240 @ 60MHz :1   81cycles 3  240   400   Iterations: @ 20 / 60MHz : 1    1    204cycles @ 60 MHz 3   10   @ 20 MHz @60 MHz Cycles Normal Iterations @60MHz 81 204 @120MHz 162 408 @180MHz 243 612 @us 1.35 3.4 CTF – Q-FRAME CONSTRUCTION Constructs the Q-Frame for the support calculation, and sends it to the Q-Frame Block Input Channels From Expander Q-Frame 3x2x18 bit 3x2x18 bit Conversion To Complex 3x2x18 bit Q-Frame entries To OMP Q-Frame Memory Vector Multiplier Mem A 5kbit 12x12x18 bit complex Mem B 5kbit Controller Support Vector From OMP 3x2x18 bit 12 bit 4 bit Support Accumulator Support Length Vector To DSP 7x12 bit  Support Indices To DSP CTF – Q-FRAME – VECTOR MULTIPLIER   Receives a vector of 12 18-bit complex samples from the Expander (Y[1..12]) i H Q  Y Calculates 1212 121  Y112 in 2 clock cycles Vector Multiplier Y1 Y2 Y3 Y1H Y2H Y4 Y3H Y12 Y4H Y12H CTF – Q-FRAME – VECTOR MULTIPLIER   On the 1st cycle, calculates and stores the first 3 columns Requires: 33 Complex 18 18 Complex multipliers Vector Multiplier Memory Bank Column 1 y1 Q1,1 Q2,1 Q3,1 H y1 y2 y3  y12 y2 Q1,2 Q2,2 Q3,2 H y3 Q1,3 Q2,3 Q3,3 H    Q12,1 Q12,2 Q12,3 Column 2 Column 3 Column 4 Column 5 Column 6 Column 12 CTF – Q-FRAME – VECTOR MULTIPLIER   On the 2nd cycle, calculates and stores the last 9 columns Requires: 45 Complex 18 18 Complex multipliers Vector Multiplier Memory Bank Column 1 Column 2 y 1H y 2H y 3H y 4H y 5H y 6H  y12H y1 Q1,1 Q1,2 Q1,3 Q1,4 Q1,5 Q1,6  Q1,12 y2 Q2,1 Q2,2 Q2,3 Q2,4 Q2,5 Q2,6  Q2,12 Column 4 y3 Q3,1 Q3,2 Q3,3 Q3,4 Q3,5 Q3,6  Q3,12 Column 5 y4 Q4,1 Q4,2 Q4,3 Q4,4 Q4,5 Q4,5  Q4,12 Column 6 y5 Q5,1 Q5,2 Q5,3 Q5,4 Q5,5 Q5,6  Q5,12 y6 Q6,1 Q6,2 Q6,3 Q6,4 Q6,5 Q6,6  Q6,12          y12 Q12,1 Q12,2 Q12,3 Q12,4 Q12,5 Q12,6  Q12,12 Column 3 Column 12 CTF – Q-FRAME - SUMMARY Uses 45 18x18 bit Complex Multipliers (45 DSP Half-Blocks)  Latency:   Normal Mode: 70  Number of samples per frame  Iteration Mode: 70  Number of samples per frame  Independent 1 20 MHz   3.5  sec Input Rate 1 2MHz   35 sec Input Rate of system clock frequency! CTF – SUPPORT CALCULATION  Calculates the signal’s support from the QFrame using the Orthogonal Matching Pursuit algorithm, using several iterations Support Calculation Q-Frame entries from Q-Frame 12x12x18 bit complex OMP Matrix Multiplier Support Merge Support Vector to Q-Frame 12 bit A Matrix Memory CTF – SUPPORT CALCULATION - OMP  Initialization: Q-frame is loaded into residual matrix  1 cycle Q-Frame Residual Matrix CTF – SUPPORT CALCULATION - OMP  Phase 1: Projection   101 cycles 144 18x18 Complex multipliers Residual Z AH A CTF – SUPPORT CALCULATION - OMP  Phase 2: Energy Calculation, Find maximum energy & Update Support   101 cycles 12 18x18 Complex multipliers Z Z1 Maximum Energy Z2 Z3 Z4 Z1H 22 |Z |Z| 1| Z5 Z6 Z100 Z101 Current Support CTF – SUPPORT CALCULATION - OMP  Phase 4: Vector Orthogonalization   Number of cycles depends on iteration (on i-th iteration – 2i cycles) 12 18x18 Complex Multipliers Previous Orthogonal Vectors A Wj Vsupport Vsupport Current Support Wj CTF – SUPPORT CALCULATION - OMP  Phase 5: Vector Normalization   2 cycles + (square root calculation time) 12 18x18 Complex Multipliers Previous Orthogonal Vectors VsupportH 1 Wsupport  Vsupport Vsupport 2 CTF – SUPPORT CALCULATION - OMP  Phase 6: Residual Matrix Update   14 cycles 144 18x18 Complex Multipliers WsupportH Residual Wsupport Wsupport CTF – SUPPORT CALCULATION - OMP  Phase 6: Residual Matrix Energy Calculation & Stopping Condition Check   13 cycles 12 18x18 Complex Multipliers Calculate Column Energy Calculate Overall Energy Energy Residual ?  threshold CTF – OMP - SUMMARY   Uses 144 18x18 bit Complex Multipliers (144 DSP Half-Blocks) Latency:  Normal Mode: ~1100 Clock Cycles 6.1 usec at 180MHz  9.1 usec at 120MHz   Iteration Mode: ~2560 Clock Cycles per iteration 14.2 usec per iteration at 180MHz  21.3 usec per iteration at 120MHz (latency is contained in Q-Frame construction latency for the next iteration, which is 35 usec per iteration)  DSP – THE FLOW Memory Samples Y DSP Expander Samples Y Support Change detector Reconstructed signal Pseudo Inverse CTF Matrix A Support External memory DSP – QR DECOMPOSITION Amxn Qmxm QR Decomposition Support . . . 261 cycles  51 multipliers, 1 sqrt  Matrix A DSP - DSP – QR DECOMPOSITION QR Q Decomposition A 12 cycles  144 multipliers  X QT = R DSP - RINV R-1 156 cycles  1 multiplier, 1 divide  0 0 0 0 R DSP – PSEUDO INVERSE R-1 0 0 R-1 X 0 12 cycles  144 multipliers  0 QT = At DSP - RECONSTRUCTION Y samples X 1 cycle @ 20MHz  144 multipliers  Y = Z Memory At DSP – SUPPORT CHANGE DETECTOR At Support changed DSP Pseudo Inverse Support Change detector X Y = Z DSP - SUMMARY Multipliers @120MHz 144 @180MHz Cycles us @120MHz 441 3.6 @180MHz 441 2.45 THE SYSTEM TODAY – STATUS Memory FPGA 1 FPGA 2 FPGA 3 73% 98% 75% CTF Q-Frame Expander DSP OMP . . . A† Controller New Incoming Sample Pseudo-Inverse Delay Q-Frame Delay 1.3usec Expander Delay 3.5usec Support Change Detector 6usec OMP Delay 2.4usec Reconstruction Delay Sample ready For reconstruction Timeline POSSIBLE IMMEDIATE IMPROVEMENTS    Use Matrix Multiplication Unit Extend Q-Frame Calculation Reconstruction using Matrix Multiplication Unit Memory CTF Expander 1 divide Controller Q-Frame DSP OMP Support Change Detector . . . SYSTEM ARCHITECTURE – OUR SUGGESTION Memory FPGA#1 – Expander FPGA#2 - Matrix Multiplication . . . Expander Controller Matrix multiplication unit CTF DSP Q-Frame Pseudo-Inverse OMP Reconstruction & Support Change Detection . . . PLANS FOR THE FUTURE Consider rank-1 updates for a change in the support  Consider changing the QR decomposition algorithm in the DSP: Householder modified Gramm-Schmidt  Consider another decomposition: QR LQ, SVD, etc.  Consider another the MP algorithm: OMP BMP, Convex Optimization, etc.  PROJECT STATUS  System Analysis:  DONE   PRESENT   Locating points of possible optimization Current System Simulation   FUTURE  Studying the system’s algorithm Understanding algorithm implementation Analyzing hardware usage & latency Creating Entire Current System test environment Simulating entire current system System Optimization    Selecting optimizations to be implemented Implementing optimizations Simulating optimized system GANTT CHART 14/02/2010 06/03/2010 26/03/2010 15/04/2010 05/05/2010 25/05/2010 14/06/2010 04/07/2010 24/07/2010 Learning the system Understanding EXP algorithm Understanding CTF algorithm Understanding DSP algorithm Understanding SCD algorithm Understanding algorithm for the whole system Preparing characterization presentation Understanding EXP implementation Understanding CTF implementation Understanding DSP implementation Understanding SCD implementation Understanding implementation for the whole system Mapping out the data flow through the system Learning CAD systems Characterizing optimal architecture outline Midterms Preparing midterm presentation Presentation for Yonina Constructing full system logical test environment Full system logic simulation Exam period Preparing final A presentation

Mid Semester Presentation

Related documents

Products

Support

Mid Semester Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib