Mid Semester Presentation

advertisement
10/06/2010
MIDTERM PRESENTATION
By:
Supervisors:
Daniel Barsky
Rolf Hilgendorf
Natalie Pistunovich
Inna Rivkin
SUB-NYQUIST
SYSTEM OPTIMIZATION
AGENDA





Overview
Project objectives
Hardware
Introduction to CAD tools
Detailed Data Flow through the system



Points of possible improvement:





Hardware utilization summary
Latencies summary
Implementation
Architecture
Algorithm
Selected improvement to be implemented
Timeline



Future plans
Project status
Gantt Chart
PROJECT OBJECTIVES
On the short run, optimize the algorithm to use
minimal hardware, in order to fit on 2 FPGA
chips, while maintaining minimum latency
 On the long run, determine an optimal
architecture to be implemented on chip (ASIC)

HARDWARE

Altera Stratix-III
GIDEL
PROCStarEP3SE260
III card
 255K
4 x Altera
Logic
Stratix
Elements
III FPGA
 Maximum
1 GB DDR 768
DRAM
18x18 bit multipliers*
 Max. frequency - ~300MHz
* In FIR mode
INTRODUCTION TO CAD TOOLS

In order to get acquainted to the different CAD
tools in use, we have constructed a model
design and ran it through the entire process
until it is burned and run on the card
OVERVIEW – DATA FLOW (NORMAL MODE)
Memory stored samples for later
reconstruction, each with the
appropriate support index
Incoming
Samples
At 60MHz
Samples are
filtered and
decimated to
12 channels
of 20MHz,
Sent to
Memory &
Q-Frame
Expander
In Iteration Mode,
samples are
further filtered
and decimated to
12 channels of
2MHz each
iteration & sent to
the CTF
Memory
Q-Frame collects
70 samples,
calculates Q-Frame
and sends it to
OMP
Q-Frame
In iteration mode, a
Q-Frame is
constructed, a
support is
calculated and
accumulated for
each iteration
CTF
SCD checks for a significant
change in the support, if
detected – initiates
calculating a new one
OMP
OMP calculates the
support from the
Q-Frame. Then, it
sendsSamples
it to the are
also
sent to
Pseudo Inverse
the SCD to
check for a
Pseudo-Inverse recovers
changethe
in the
columns of the support
from
support
matrix A, constructs their pseudoinverse & sends it to the
Reconstruction
Reconstruction
& Support
Change
Detection
DSP
Pseudo
Inverse
Reconstruction
reconstructs
data from input
samples using
the pseudoinverse
EXPANDER –NORMAL MODE
LPF
60 MHz
12 bit
Anal
og
Syst
em
+
A/D
2
-30 MHz
2
12 bit
-30 MHz
60 MHz
12 bit
3
30 MHz
LPF
LPF
60 MHz
60 MHz
12 bit
1
2
1
30 MHz
1
-30 MHz
2
-30 MHz
3
3
30 MHz
1
3
30 MHz
-10 MHz
10 MHz
-10 MHz
10 MHz
-10 MHz
10 MHz
LPF
-10 MHz
10 MHz
LPF
-10 MHz
10 MHz
LPF
-10 MHz
10 MHz
2
LPF
-10 MHz
LPF
-10 MHz
10 MHz
LPF
-10 MHz
10 MHz
LPF
10 MHz
-10 MHz
LPF
-10 MHz
10 MHz
LPF
-10 MHz
10 MHz
20 MHz sample
Memory
12 samples
20MHz each
CTF
10 MHz
20 MHz sample
DSP
The expander (master) sends 12 20MHz slices to the CTF (slave) each cycle
The expander sends new 20MHz slices to the Memory each cycle cycle and to the DSP
EXPANDER – ITERATION MODE
LPF
60 MHz
12 bit
Anal
og
Syst
em
+
A/D
60 MHz
12 bit
2
1
3
-30 MHz
30 MHz
LPF
LPF
LPF
2
1
3
-30 MHz
30 MHz
LPF
LPF
60 MHz
12 bit
60 MHz
12 bit
2
1
1
-30 MHz
2
-30 MHz
3
30 MHz
1
3
30 MHz
LPF
3x80x(20/180)
LPF
2
-10 MHz
1
-10 MHz
10 MHz
-1 MHz
1 MHz
10 MHz
-1 MHz
1 MHz
10 MHz
-1 MHz
1 MHz
3
-10 MHz
-10 MHz
2
10 MHz
-1 MHz
1 MHz
10 MHz
-1 MHz
1 MHz
10 MHz
-1 MHz
1 MHz
1
-10 MHz
3
-10 MHz
-10 MHz
2
1 10 MHz
1
10 MHz
-10 MHz
3
LPF
-10 MHz
LPF
10 MHz
LPF
-10 MHz
LPF
-10 MHz
10 MHz
2
-10 MHz
1
3
10 MHz
10 MHz
2 -1 MHz 1 MHz
10x40x(2/180)
-1 MHz
1 MHz
-1 MHz
1 MHz
-1 MHz
1 MHz
-1 MHz
1 MHz
-1 MHz
1 MHz
20 MHz sample
Memory
12 samples
2MHz each
CTF
new
20 MHz sample
The expander (master) sends 12 2MHz slices to the CTF (slave) each cycle
Once the CTF requests for new, the expander changes and sends
The expander sends new 20MHz slices to the Memory each cycle and to the DSP
DSP
EXPANDER - SOURCE UTILIZATION


Normal:




20
4   2  3   3  80 
  2


Foperating   



Iterations:


 

 
20
2
4   2  3   3  80 
  2  1  10  40 
 

 

Foperating 
Foperating    




Multipliers
120 MHz
180 MHz
Normal
496
336
Iteration
672
459
EXPANDER - LATENCIES

Normal:
240
@ 60MHz :1 
 81cycles
3
 240   400 
 Iterations: @ 20 / 60MHz : 1 
  1 
  204cycles @ 60 MHz
3  
10 

@ 20 MHz
@60 MHz
Cycles
Normal
Iterations
@60MHz
81
204
@120MHz
162
408
@180MHz
243
612
@us
1.35
3.4
CTF – Q-FRAME CONSTRUCTION
Constructs the Q-Frame for the support
calculation, and sends it to the Q-Frame Block
Input Channels
From Expander
Q-Frame
3x2x18 bit
3x2x18 bit
Conversion To Complex
3x2x18 bit
Q-Frame entries To OMP
Q-Frame Memory
Vector
Multiplier
Mem
A
5kbit
12x12x18 bit complex
Mem
B
5kbit
Controller
Support Vector From OMP
3x2x18 bit
12 bit
4 bit
Support Accumulator
Support
Length
Vector
To DSP
7x12 bit

Support
Indices
To DSP
CTF – Q-FRAME – VECTOR MULTIPLIER


Receives a vector of 12 18-bit complex samples from the
Expander (Y[1..12])
i
H
Q

Y
Calculates 1212 121  Y112 in 2 clock cycles
Vector Multiplier
Y1
Y2
Y3
Y1H
Y2H
Y4
Y3H
Y12
Y4H
Y12H
CTF – Q-FRAME – VECTOR MULTIPLIER


On the 1st cycle, calculates and stores the first 3 columns
Requires: 33 Complex 18 18 Complex multipliers
Vector Multiplier
Memory Bank
Column 1
y1
Q1,1
Q2,1
Q3,1
H
y1
y2
y3

y12
y2
Q1,2
Q2,2
Q3,2
H
y3
Q1,3
Q2,3
Q3,3
H



Q12,1 Q12,2 Q12,3
Column 2
Column 3
Column 4
Column 5
Column 6
Column 12
CTF – Q-FRAME – VECTOR MULTIPLIER


On the 2nd cycle, calculates and stores the last 9 columns
Requires: 45 Complex 18 18 Complex multipliers
Vector Multiplier
Memory Bank
Column 1
Column 2
y 1H
y 2H
y 3H
y 4H
y 5H
y 6H

y12H
y1
Q1,1
Q1,2
Q1,3
Q1,4
Q1,5
Q1,6

Q1,12
y2
Q2,1
Q2,2
Q2,3
Q2,4
Q2,5
Q2,6

Q2,12
Column 4
y3
Q3,1
Q3,2
Q3,3
Q3,4
Q3,5
Q3,6

Q3,12
Column 5
y4
Q4,1
Q4,2
Q4,3
Q4,4
Q4,5
Q4,5

Q4,12
Column 6
y5
Q5,1
Q5,2
Q5,3
Q5,4
Q5,5
Q5,6

Q5,12
y6
Q6,1
Q6,2
Q6,3
Q6,4
Q6,5
Q6,6

Q6,12









y12
Q12,1
Q12,2
Q12,3
Q12,4
Q12,5
Q12,6

Q12,12
Column 3
Column 12
CTF – Q-FRAME - SUMMARY
Uses 45 18x18 bit Complex Multipliers
(45 DSP Half-Blocks)
 Latency:

 Normal
Mode:
70 
Number of
samples per
frame
 Iteration
Mode:
70 
Number of
samples per
frame
 Independent
1
20 MHz 
 3.5  sec
Input Rate
1
2MHz 
 35 sec
Input Rate
of system clock frequency!
CTF – SUPPORT CALCULATION

Calculates the signal’s support from the QFrame using the Orthogonal Matching Pursuit
algorithm, using several iterations
Support Calculation
Q-Frame entries from Q-Frame
12x12x18 bit complex
OMP
Matrix
Multiplier
Support
Merge
Support Vector to Q-Frame
12 bit
A Matrix Memory
CTF – SUPPORT CALCULATION - OMP

Initialization: Q-frame is loaded into residual
matrix

1 cycle
Q-Frame
Residual
Matrix
CTF – SUPPORT CALCULATION - OMP

Phase 1: Projection


101 cycles
144 18x18 Complex multipliers
Residual
Z
AH
A
CTF – SUPPORT CALCULATION - OMP

Phase 2: Energy Calculation, Find maximum energy &
Update Support


101 cycles
12 18x18 Complex multipliers
Z
Z1
Maximum
Energy
Z2
Z3
Z4
Z1H
22
|Z
|Z|
1|
Z5
Z6
Z100
Z101
Current Support
CTF – SUPPORT CALCULATION - OMP

Phase 4: Vector Orthogonalization


Number of cycles depends on iteration (on i-th iteration – 2i cycles)
12 18x18 Complex Multipliers
Previous
Orthogonal
Vectors
A
Wj
Vsupport
Vsupport
Current Support
Wj
CTF – SUPPORT CALCULATION - OMP

Phase 5: Vector Normalization


2 cycles + (square root calculation time)
12 18x18 Complex Multipliers
Previous
Orthogonal
Vectors
VsupportH
1
Wsupport

Vsupport
Vsupport
2
CTF – SUPPORT CALCULATION - OMP

Phase 6: Residual Matrix Update


14 cycles
144 18x18 Complex Multipliers
WsupportH
Residual
Wsupport
Wsupport
CTF – SUPPORT CALCULATION - OMP

Phase 6: Residual Matrix Energy Calculation & Stopping
Condition Check


13 cycles
12 18x18 Complex Multipliers
Calculate Column
Energy
Calculate Overall
Energy
Energy
Residual
?
 threshold
CTF – OMP - SUMMARY


Uses 144 18x18 bit Complex Multipliers
(144 DSP Half-Blocks)
Latency:

Normal Mode: ~1100 Clock Cycles
6.1 usec at 180MHz
 9.1 usec at 120MHz


Iteration Mode: ~2560 Clock Cycles per iteration
14.2 usec per iteration at 180MHz
 21.3 usec per iteration at 120MHz
(latency is contained in Q-Frame construction latency for the next
iteration, which is 35 usec per iteration)

DSP – THE FLOW
Memory
Samples
Y
DSP
Expander
Samples
Y
Support
Change
detector
Reconstructed
signal
Pseudo
Inverse
CTF
Matrix
A
Support
External
memory
DSP – QR DECOMPOSITION
Amxn
Qmxm
QR
Decomposition
Support
.
.
.
261 cycles
 51 multipliers, 1 sqrt

Matrix A
DSP - DSP – QR DECOMPOSITION
QR Q
Decomposition
A
12 cycles
 144 multipliers

X
QT
=
R
DSP - RINV
R-1
156 cycles
 1 multiplier, 1 divide

0
0
0
0
R
DSP – PSEUDO INVERSE
R-1
0
0
R-1
X
0
12 cycles
 144 multipliers

0
QT
=
At
DSP - RECONSTRUCTION
Y
samples
X
1 cycle @ 20MHz
 144 multipliers

Y
=
Z
Memory
At
DSP – SUPPORT CHANGE DETECTOR
At
Support
changed
DSP
Pseudo
Inverse
Support
Change
detector
X
Y
=
Z
DSP - SUMMARY
Multipliers
@120MHz
144
@180MHz
Cycles
us
@120MHz
441
3.6
@180MHz
441
2.45
THE SYSTEM TODAY – STATUS
Memory
FPGA 1
FPGA 2
FPGA 3
73%
98%
75%
CTF
Q-Frame
Expander
DSP
OMP
.
.
.
A†
Controller
New Incoming
Sample
Pseudo-Inverse Delay
Q-Frame Delay
1.3usec
Expander Delay
3.5usec
Support Change Detector
6usec
OMP Delay
2.4usec
Reconstruction Delay
Sample ready
For reconstruction
Timeline
POSSIBLE IMMEDIATE IMPROVEMENTS



Use Matrix
Multiplication Unit
Extend Q-Frame
Calculation
Reconstruction
using Matrix
Multiplication Unit
Memory
CTF
Expander
1 divide
Controller
Q-Frame
DSP
OMP
Support Change Detector
.
.
.
SYSTEM ARCHITECTURE – OUR SUGGESTION
Memory
FPGA#1 – Expander
FPGA#2 - Matrix Multiplication
.
.
.
Expander
Controller
Matrix multiplication unit
CTF
DSP
Q-Frame
Pseudo-Inverse
OMP
Reconstruction &
Support Change
Detection
.
.
.
PLANS FOR THE FUTURE
Consider rank-1 updates for a change in the
support
 Consider changing the QR decomposition
algorithm in the DSP:
Householder
modified Gramm-Schmidt
 Consider another decomposition:
QR
LQ, SVD, etc.
 Consider another the MP algorithm:
OMP
BMP, Convex Optimization, etc.

PROJECT STATUS

System Analysis:

DONE


PRESENT


Locating points of possible optimization
Current System Simulation


FUTURE

Studying the system’s algorithm
Understanding algorithm implementation
Analyzing hardware usage & latency
Creating Entire Current System test environment
Simulating entire current system
System Optimization



Selecting optimizations to be implemented
Implementing optimizations
Simulating optimized system
GANTT CHART
14/02/2010 06/03/2010 26/03/2010 15/04/2010 05/05/2010 25/05/2010 14/06/2010 04/07/2010 24/07/2010
Learning the system
Understanding EXP algorithm
Understanding CTF algorithm
Understanding DSP algorithm
Understanding SCD algorithm
Understanding algorithm for the whole system
Preparing characterization presentation
Understanding EXP implementation
Understanding CTF implementation
Understanding DSP implementation
Understanding SCD implementation
Understanding implementation for the whole system
Mapping out the data flow through the system
Learning CAD systems
Characterizing optimal architecture outline
Midterms
Preparing midterm presentation
Presentation for Yonina
Constructing full system logical test environment
Full system logic simulation
Exam period
Preparing final A presentation
Download