ppt - University of Michigan

advertisement
A System Solution for HighPerformance, Low Power SDR
Yuan Lin1, Hyunseok Lee1, Yoav Harel1, Mark Woh1,
Scott Mahlke1, Trevor Mudge1 and Krisztian Flautner2
1Advanced
Computer Architecture Laboratory
University of Michigan
2ARM, Ltd.
SDR Design Challenges:

Hardware design challenges




High computational throughput (~40 Gops)
Low power consumption (~200mW)
Meet real-time requirements
DSP programming support


System-level development
 Inter-algorithm communication
Algorithm-level development
 Efficient DSP representations
2
Advanced Computer Architecture Laboratory
University of Michigan
SDR Benchmark
Design & Analysis
W-CDMA Protocol: 2Mbps
Transmitter
modulator
LPF-Tx
scrambler
spreader
Interleaver
Channel
encoder
U
p
p
e
r
la
y
e
rs
F
ro
n
te
n
d
demodulator
Receiver
searcher
c
o
m
b
in
e
r
LPF-Rx
descrambler
despreader
..
.
descrambler
deinteleaver
Channel
decoder
(turbo/viterbi)
despreader
4
Advanced Computer Architecture Laboratory
University of Michigan
W-CDMA Characteristics






Plenty of vector parallelism
8 & 16-bit DSP algorithms
Multiplication is not dominant
No floating-point operation
Small instruction/data memory
Has periodic real-time tasks
5
Advanced Computer Architecture Laboratory
University of Michigan
802.11a Protocol: 24Mbps
Transmitter
modulator
LPF-Tx
Preamble
Insertion
IFFT
QAM
Mapping
Interleaver
Channel
encoder
U
p
p
e
r
la
y
e
rs
F
ro
n
te
n
d
demodulator
FFT
Frequency
Equalization
Receiver
QAM
demapping
deinteleaver
LPF-Rx
Frequency, Time
Synchronization
Channel
Estimation
Channel
decoder
(viterbi)
6
Advanced Computer Architecture Laboratory
University of Michigan
802.11a Characteristics

Similar to W-CDMA




Plenty of vector parallelism
No floating-point operation
Small instruction/data memory
Different from W-CDMA



Mostly 16-bit DSP algorithms
Multiplication is more dominant
No periodic real-time tasks
7
Advanced Computer Architecture Laboratory
University of Michigan
SDR Processor
Architecture Design
System Architecture Design Tradeoffs
PE
alu alu alu alu alu alu alu alu
Uniprocessor
alu alu alu alu alu alu alu alu
alu alu alu alu
PE
PE
alu alu alu alu
alu alu alu alu
alu alu alu alu
Inter-processor
Network
alu alu alu alu
PE
PE
alu alu alu alu
alu alu alu alu
alu alu alu alu
Coarse grain
processors
Amortized fetch
Inter-processor
Communication
Fine grain
processors
Intra-processor
Communication
PE PE PE PE PE PE PE PE
alu
alu
alu
alu
alu
alu
alu
alu
Inter-processor
Network
PE PE PE PE PE PE PE PE
alu
alu
alu
alu
alu
alu
alu
alu
9
Advanced Computer Architecture Laboratory
University of Michigan
System Architecture Design Tradeoffs
Number of Processing Elements x SIMD width
For W-CDMA 2Mbps (51.2GOP/sec) 90nm 1V @400MHz
300
4x32
250
2x64
1x128
1x256
(50% Utilization)
8x16
MOPS/mW
200
150
100
Coarse grain processors
Fine grain processors
PE
PE
PE
PE alu
PEaluPEaluPEaluPE
alu alu
alu PE
alu PE
alu alu alu alu Uniprocessor
alu alu alu alu
Inter-processor
PE
Inter-processor
Network
16x8
32x4
alu alu alu alu alu alu alu alu
Network
PE
PE
alu
alu
alu
alu alu
alu
alualu
alualu
alu
PE
PE alu
PEalu
PEalu
PE
PE
PEaluPE
64x2
alu
alu
alu
alu
alu
alu
alu
alu
alu
50
128x1
Intra-processor shuffle network
critical delay path
0
ALU dominated
0
critical delay path
50
100
150
200
250
critical
delay
path
300
SIMD width (arithmetic units)
10
Advanced Computer Architecture Laboratory
University of Michigan
System Architecture Design
Controller
ALU
MEM
MEM
Memory
Global
Memory
Global
Memory
SoC
Interface
SoC
Interface
SoC
Interface
Interconnect
SoC
Interface
SoC
Interface
PE
SoC
Interface
PE
PE
Scalar
Memory
SIMD
Memory
Scalar
Memory
SIMD
Memory
Scalar
Memory
SIMD
Memory
Scalar
RF
SIMD RF
Scalar
RF
SIMD RF
Scalar
RF
SIMD RF
Scalar
ALU
SIMD ALU
Scalar
ALU
SIMD ALU
Scalar
ALU
SIMD ALU
Scalar
Pipeline
SIMD
Pipeline
Scalar
Pipeline
SIMD
Pipeline
Scalar
Pipeline
SIMD
Pipeline
12
Advanced Computer Architecture Laboratory
University of Michigan
PE Design
(Area < 1mm2 Power<50mW)
PE
SIMD Unit
SIMD
Memory
4 KB
32x8bit
8bit
16x8bit
RegFile
8bit
8bit
8bit ALU
8bit
16x8bit
RegFile
8bit
8bit
8bit ALU
8bit
16x8bit
RegFile
8bit
8bit
8bit ALU
Data
Shuffle
Network
8bit
16x8bit
RegFile
8bit
8bit
8bit ALU
16bit
16x16bit
RegFile
16bit
16bit
16bit ALU
DMA
Scalar
Memory
4KB
16bit
Scalar Unit
13
Advanced Computer Architecture Laboratory
University of Michigan
Mapping DSP Algorithms: Filters
z-1
In
b
Out
spread Vin, Sin
shift z, z, up
mac z, Vin, Sin
z-1
In
b3
b2
Z-1
b1
Z-1
b0
Z-1
Out
14
Advanced Computer Architecture Laboratory
University of Michigan
Mapping DSP Algorithm: Filter
PE
SIMD Unit
SIMD
Memory
4 KB
spread Vin, Sin
shift z, z, up
mac z, Vin, Sin
32x8bit
8bit
16x8bit
RegFile
8bit
8bit
8bit ALU
8bit
16x8bit
RegFile
8bit
8bit
8bit ALU
8bit
16x8bit
RegFile
8bit
8bit
8bit ALU
Data
Shuffle
Network
8bit
16x8bit
RegFile
8bit
8bit
8bit ALU
16bit
16x16bit
RegFile
16bit
16bit
16bit ALU
DMA
Scalar
Memory
4KB
16bit
Scalar Unit
15
Advanced Computer Architecture Laboratory
University of Michigan
Efficient Design






Wide SIMD width
Small register file with minimum ports
Small memories
Narrow system BUS
Data-path optimized for 8bits
Vector shuffle reduce memory ports
16
Advanced Computer Architecture Laboratory
University of Michigan
Transmitter
modulator
LPF-Tx
LPF-Tx
scrambler
scrambler
spreader
spreader
Channel
Channel
encoder
encoder
Interleaver
Interleaver
U
p
p
e
r
la
y
e
rs
F
ro
n
te
n
d
demodulator
Receiver
searcher
searcher
descrambler
descrambler
PE
PN Code
Code
PN
TX/RX
TX/RX
23Mops
Mops
23
Misc. Control
Control
Misc.
Mops
11Mops
Buffer
Buffer
(10
Bytes)
(10 Bytes)
Power
Power
Control
Control
15Kops
Kops
15
Deinterleaver
Deinterleaver
16 Mops
16
despreader
..
.
descrambler
descrambler
ARM
c
o
m
b
in
e
r
LPF-Rx
LPF-Rx
despreader
PE
Buffer
Buffer
(1280 Bytes)
(1280
Bytes)
LPF-Rx
22LPF-Rx
182Mops
Mops
182
Channel
Channel
decoder
decoder
(turbo/viterbi)
(turbo/viterbi)
deinteleaver
deinteleaver
PE
Buffer
Buffer
(2560
Bytes)
(2560 Bytes)
Searcher
Searcher
200Mops
Mops
200
PE
Buffer
Buffer
(1024
Bytes)
(1024 Bytes)
Turbo Encoder
Turbo Encoder
2 Mops
2 Mops
Interleaver
Interleaver
2 Mops
2 Mops
Spreader
Spreader
4.7 Mops
4.7 Mops
Scrambler
Scrambler
8.6 Mops
8.6 Mops
4
LPF-Rx
4 LPF-Rx
307
Mops
307 Mops
Buffer
Buffer
(1024
Bytes)
(1024 Bytes)
Turbo Decoder
Decoder
Turbo
324 Mops
Mops
324
Buffer
Buffer
(1360
Bytes)
(1360 Bytes)
Descrambler
Descrambler
22.5Mops
Mops
22.5
Despreader
Despreader
11.3Mops
Mops
11.3
Global
Memory
FIFOQueue
Queue
FIFO
(12.5KBytes)
KBytes)
(12.5
Buffer
Buffer
(20KBytes)
KBytes)
(20
Buffer
Buffer
(20KBytes)
KBytes)
(20
Combiner
Combiner
Mops
33Mops
WCDMA Receiver
WCDMA Transmitter
18
Advanced Computer Architecture Laboratory
University of Michigan
Transmitter
modulator
LPF-Tx
LPF-Tx
scrambler
scrambler
spreader
spreader
Channel
Channel
encoder
encoder
Interleaver
Interleaver
U
p
p
e
r
la
y
e
rs
F
ro
n
te
n
d
demodulator
Receiver
searcher
searcher
c
o
m
b
i
n
e
r
LPF-Rx
LPF-Rx
descrambler
descrambler
despreader
..
.
descrambler
descrambler
Channel
Channel
decoder
decoder
(turbo/viterbi)
(turbo/viterbi)
deinteleaver
deinteleaver
despreader
MAC
Frontend A/D
ARM
PE
PN Code
Code
PN
TX/RX
TX/RX
23Mops
Mops
23
Misc. Control
Control
Misc.
Mops
11Mops
Buffer
Buffer
(10
Bytes)
(10 Bytes)
Power
Power
Control
Control
15Kops
Kops
15
Deinterleaver
Deinterleaver
16 Mops
16
PE
Buffer
Buffer
(1280 Bytes)
(1280
Bytes)
LPF-Rx
22LPF-Rx
182Mops
Mops
182
PE
Buffer
Buffer
(2560
Bytes)
(2560 Bytes)
Searcher
Searcher
200Mops
Mops
200
PE
Buffer
Buffer
(1024
Bytes)
(1024 Bytes)
Turbo Encoder
Turbo Encoder
2 Mops
2 Mops
Interleaver
Interleaver
2 Mops
2 Mops
Spreader
Spreader
4.7 Mops
4.7 Mops
Scrambler
Scrambler
8.6 Mops
8.6 Mops
4
LPF-Rx
4 LPF-Rx
307
Mops
307 Mops
Buffer
Buffer
(1024
Bytes)
(1024 Bytes)
Turbo Decoder
Decoder
Turbo
324 Mops
Mops
324
Buffer
Buffer
(1360
Bytes)
(1360 Bytes)
Descrambler
Descrambler
22.5Mops
Mops
22.5
MAC
Despreader
Despreader
11.3Mops
Mops
11.3
Global
Memory
FIFOQueue
Queue
FIFO
(12.5KBytes)
KBytes)
(12.5
Buffer
Buffer
(20KBytes)
KBytes)
(20
Buffer
Buffer
(20KBytes)
KBytes)
(20
Combiner
Combiner
Mops
33Mops
WCDMA Receiver
WCDMA Transmitter
Frontend D/A
Advanced Computer Architecture Laboratory
University of Michigan
19
802.11a PE Mapping
802.11a Receiver
ARM
PE
Deinterleaver
60 Mops
PE
Buffer
(2048 Bytes)
FIR (Rx)
320 Mops
Buffer
(1024 Bytes)
PE
Buffer
(2048 Bytes)
Interplator
250 Mops
FFT
120 Mops
PE
Buffer
(2048 Bytes)
Freq. Eq.
120 Mops
Buffer
(2048 Bytes)
Viterbi Dec.
398 Mops
QAM Demod
2 Mops
Misc. Control
80 Mops
Interleaver
60 Mops
FIR (Tx)
320 Mops
Buffer
(1024 Bytes)
QAM Mod
2 Mops
IFFT
120 Mops
Buffer
(30 KBytes)
Buffer
(30 KBytes)
Sync.
20 Mops
Buffer
(2048 Bytes)
Global
Memory
Buffer
(22048 Bytes)
Channel Enc.
20 Mops
Buffer
(30 KBytes)
Buffer
(30 KBytes)
Preamble Ins.
50 Mops
802.11a Transmitter
20
Advanced Computer Architecture Laboratory
University of Michigan
Power Results
400
350
250
WCDMA
200
802.11a
150
100
50
To
ta
l
th
er
s
O
ce
ss
or
In
te
r-p
ro
em
Bu
s
or
y
M
G
lo
ba
lM
O
PE
PE

AR
th
er
s
U
AL
PE
R
M
em
or
y
eg
ist
er
Fi
le
0
PE
Power (mW)
300
Configuration




4 PEs, 1 ARM (Cortex M3) controller
Global scratchpad memory (64Kb)
90nm (1V @ 400 MHZ),
Advanced Computer Architecture Laboratory
of Michigan
SynthesizedUniversity
conservatively
21
Area Results
3.5
2.5
2
1.5
1
0.5
ta
l
To
er
s
O
th
lM
em
er
-p
or
ro
y
ce
ss
or
Bu
s
ba
AR
M
In
t
PE
G
lo
M
em
or
Re
y
gi
st
er
Fi
le
PE
AL
U
PE
O
th
er
s
0
PE
mm^2 (90nm)
3
22
Advanced Computer Architecture Laboratory
University of Michigan
SDR Programming
Language Support
Software Development Flow
Programmer
Defined
Matlab-Simulink/C
Floating Point
Algorithm Prototyping
Algorithm-level
Design Flow
System-Level
Design Flow
Timing
Requirements
C/SPEX
Fixed Point
Algorithm Kernel
Implementations
IO/
Control
Code
C/SPEX
System
Description
Timing
Requirements
Timing/Machine
Independent
asm code
Memory
Access
Patterns
System
Description
asm code
Compiler
Generated
Controller
asm
code
PE
asm
code
c
24
Advanced Computer Architecture Laboratory
University of Michigan
SPEX (Signal Processing EXtension)


Implemented as a library extension to C
System-level development



Support concurrent DSP kernel function definitions
Channel variables for inter-kernel communications
Algorithm-level development



Native vector & matrix variables
Explicit DSP variable attribute definition
Native vector & matrix operations
25
Advanced Computer Architecture Laboratory
University of Michigan
SPEX Overview
SPEX
DSP kernel
extension
DSP variable
extension
DSP
variable
attributes
Mode
Bitwidth
Type
Scalar
object
SIMD
object
Concurrent
variable
object
DSP
variable
operations
DSP
variable
objects
Arith. &
logic
op.
Predication
Permutation
Kernel
object
Channel
object
Concurrent
variable
operations
Concurrent
kernel
mangement
Channel
communication
operations
26
Advanced Computer Architecture Laboratory
University of Michigan
SPEX Example Code: Viterbi ACS
Concurrent DSP
kernel definitions
void* acs(void*) {
/* variable declaration */
saturated char<64> metrics1, metrics2;
saturated char<64> states;
saturated char<64> t1, t2;
while (!viterbi.stop()) {
/* receiving data from BMC */
metrics1 = bmc_to_acs.receive();
metrics2 = bmc_to_acs.receive();
/* add */
metrics1 += states;
metrics2 += states;
/* compare and select */
t1 = (metrics1(0,2,62),metrics2(0,2,62));
t2 = (metrics1(1,2,63),metrics2(1,2,63));
states(t1<t2) = t1;
states(t1>=t2) = t2;
/* sending data to TB */
acs_to_tb.send(states);
}
}
Advanced Computer Architecture Laboratory
University of Michigan
27
SPEX Example Code: Viterbi ACS
void* acs(void*) {
/* variable declaration */
saturated char<64> metrics1, metrics2;
saturated char<64> states;
saturated char<64> t1, t2;
Native SIMD variable
definition with explicit
attributes
while (!viterbi.stop()) {
/* receiving data from BMC */
metrics1 = bmc_to_acs.receive();
metrics2 = bmc_to_acs.receive();
SPEX variable supports
1. saturated/overflow
2. various variable bit-width
3. vector & matrices
/* add */
metrics1 += states;
metrics2 += states;
/* compare and select */
t1 = (metrics1(0,2,62),metrics2(0,2,62));
t2 = (metrics1(1,2,63),metrics2(1,2,63));
states(t1<t2) = t1;
states(t1>=t2) = t2;
/* sending data to TB */
acs_to_tb.send(states);
}
}
Advanced Computer Architecture Laboratory
University of Michigan
28
SPEX Example Code: Viterbi ACS
void* acs(void*) {
/* variable declaration */
saturated char<64> metrics1, metrics2;
saturated char<64> states;
saturated char<64> t1, t2;
Inter-kernel
communication through
channel operations
while (!viterbi.stop()) {
/* receiving data from BMC */
metrics1 = bmc_to_acs.receive();
metrics2 = bmc_to_acs.receive();
/* add */
metrics1 += states;
metrics2 += states;
Channel types:
1. FIFO queue
2. Broadcast queue
3. Sync/control channel
4. Random-read FIFO queue
/* compare and select */
t1 = (metrics1(0,2,62),metrics2(0,2,62));
t2 = (metrics1(1,2,63),metrics2(1,2,63));
states(t1<t2) = t1;
states(t1>=t2) = t2;
/* sending data to TB */
acs_to_tb.send(states);
}
}
Advanced Computer Architecture Laboratory
University of Michigan
29
SPEX Example Code: Viterbi ACS
void* acs(void*) {
/* variable declaration */
saturated char<64> metrics1, metrics2;
saturated char<64> states;
saturated char<64> t1, t2;
while (!viterbi.stop()) {
/* receiving data from BMC */
metrics1 = bmc_to_acs.receive();
metrics2 = bmc_to_acs.receive();
SPEX vector operations
Supports
(Matlab-like C code)
/* add */
metrics1 += states;
metrics2 += states;
1. SIMD arithmetic
operations
2. SIMD permutation
3. SIMD predication
/* compare and select */
t1 = (metrics1(0,2,62),metrics2(0,2,62));
t2 = (metrics1(1,2,63),metrics2(1,2,63));
states(t1<t2) = t1;
states(t1>=t2) = t2;
/* sending data to TB */
acs_to_tb.send(states);
}
}
Advanced Computer Architecture Laboratory
University of Michigan
30
Summary
-
Hardware & software solutions for SDR
-
Hardware
-
-
4 dual-issue asymmetric SIMD processing elements
Consumes 200~300mW for 90nm
Meets the performance requirements for WCDMA &
802.11a
Software
-
SPEX provides efficient DSP algorithm and system
implementation
31
Advanced Computer Architecture Laboratory
University of Michigan
Questions?
32
Advanced Computer Architecture Laboratory
University of Michigan
Download