HPEC and HPC via FPGAs - Utah State University

advertisement
HPEC using FPGAs
Challenges and Benefits

2
Utah State University
Cache Valley 90 miles North of Salt Lake City
David. Sant. Engineering
Innovation Building
3
Agenda
 On-board computing for Spacecraft
 A primer on FPGAs (5 slides)
 HPEC using FPGAs (26 slides)
 The Polymorphic Systolic Array Framework
 Improving productivity
 Enabling real time and responsive reconfiguration
 Future technologies for FPGAs
 Acknowledgements
4
On-board Computing
 Civilian and Military space missions getting more complex

Need to support several types of data from several types of sensors
 Missions will require spacecraft computer to be more responsive


Need for In-situ data processing (signal processing)
Not just compression, but data analysis, decision making etc.
 Power budget, form factors of spacecraft computer extremely tight


State of the art RadHard microprocessor from BAE systems or RISC
processor?
Aging workhorse, time to upgrade big time
5
So, what do we upgrade to?
 Commodity Microprocessors
 Cell, GPU, Many/Multi core
 Very powerful
 Blows out the power budget
 RadHard parts need to be custom ordered
 Commodity DSP chips
 Good as long as you stick to just one chip
 Rahhard parts can be custom ordered
 Commodity Reconfigurable chips
 FPGAs (field programmable gate arrays)
 Can perform like a custom silicon chip
 Best performance/power ratios
 RadHard parts already available with steady roadmap from Xilinx
6
Programming perspective
 Microprocessors
 DSP chips
 FPGAs
 Optimistic view point
Frozen pizza
Take ‘n’ bake
Raw ingredients
7
Quick Primer on FPGAs
 Mixture of blocks on a die
 Some dedicated
 DSP (MAC units)
 PPC (optional)
 RAM
 Some programmable
 Look Up Tables (LUT)
 Gazillions of network switches
 Hidden
 Special circuit
 ICAP (internal configuration
access port)
8
Simple View of Programming an FPGA
All computations are assumed to be based on Boolean Logic
So,
Problem solving concept => algorithmsNMOS
transistor
Algorithms => Discrete set of simple tasks
(add/multiply…)
Simple tasks => A set of Boolean functions talking to each other
Boolean function=> simple manipulation of 1 and 0 bits

Each bit
stored in a
small
memory
cell
(SRAM)
An FPGA is
essentially a vast
set of SRAM cells
waiting to be
loaded with 0s and
1s to mimic
Boolean logic
9
Programming an FPGA
 Each Look Up Table (LUT) has a unique mailing address
 16 bits go into each Look Up Table (LUT)
 Each routing switch has a unique mailing address
 One bit for each switch
 Executable for an FPGA is sequence of bits that have to be
delivered precisely to each LUT and Switch Box
 This binary/executable is called “Configuration Bitstream” or
simply “Bitstream”
10
Programming an FPGA

Programming the FPGA is like having a Mailman deliver bits to each address
correctly

Slow process

But a Bitstream is slightly more complex

Each FPGA is like a Country (has a unique code)



A “Bitstream” before entering the chip has to undergo security clearance (CRC or
cyclic redundancy check)
Port of Entry = ICAP
FPGA addresses are hierarchical (state, county, city, suburb, house address)


Term used for encoding all this overhead is “Frame Address”
All this address stuff is overhead

Actual useful stuff is inside the mail envelope
11
So what does a real
configured/programmed FPGA look like?
Before Programming
Nice clean plate
Empty LUTs, Switches….
After Programming
Messy plate of spaghetti
Configured LUTs, Switches….
All those green things are wires that
have been setup to carry data
between LUTs, FFs etc…
12
High Performance Embedded Computing
(HPEC) using FPGAs

Signal processing algorithms
 Wildly useful and hence widely used
 Computationally quite parallel/pipeline-amenable
 Proven to be accelerate-able by Systolic Array designs on FPGAs
 The Good of FPGAs:
 FPGAs claim to have orders of magnitude performance advantage over
DSP chips (www.xilinx.com www.altera.com)
 They can be reconfigured partially and dynamically
 The Bad (no the Ugly):
 Productivity is the biggest barrier
 The number of signal processing folks willing to adopt FPGAs is small and
stagnant
 Partial dynamic reconfiguration is very slow compared to processing
speeds
13
Elaborating the Good of FPGAs:
Extreme DSP computing
14
Elaborating the Good of FPGAs:
Partial Dynamic Reconfiguration
At some point in time……
FPGA
FPGA
Abruptly…say we need to quickly increase parallelism support for
application α
Can we dynamically reconfigure the chip, without disturbing the execution of either
( 5)
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
application?Circuit
At the cost
α application,
ααα of takingαααaway parallelism
ααα support
αααfor the other
And do it fast enough?
parallel
processing
circuits
for
Application
4454Four
parallel
parallel
processing
processing
circuits
circuits
Application
for
Application
ααα αRELATIVE
parallel
processing
circuits
for
Application
Because we
did
not
have
enough
onfor
the
chip
toprocess:
support
high levels
Remember,
programming
the
FPGA
is aspace
very
very
very
slow
to of
parallelism
for both
applications, or
execution
speeds
of applications
Circuit
Circuit
Circuit
Circuit Circuit
Circuit
Circuit
Circuit
Circuit
Circuit
Circuit Circuit
Circuit Circuit
Circuit Circuit
Circuit Circuit
Circuit
Thereββwas a power
budget
satisfy
β
βββ
βββ
ββ
ββ we couldn’t
ββ
parallel
processing
circuits
for
application
parallel
parallel
processing
processing
circuits
circuits
forapplication
application
for application
667Seven
parallel
processing
circuits
for
βββ β
15
Productivity
 It’s a funny thing in the FPGA world
 FPGA programmers are essentially VLSI design guys
 They don’t buy $5K parts to get average performance
 Every clock cycle is precious
 Every LUT/FF/MAC/BRAM is precious
 They don’t adopt new programming languages in a hurry
 They love to have full control over every operation
16
Productivity, so what does it mean?
 Wants an entire system on FPGA modeled, performance
predicted, designed, implemented, debugged, verified,
guaranteed timing closure, low power, high throughput….
 Done really really fast, just like software
 And then wants to make some minor changes and do it quickly
all over again, just like software…
17
Why cant new designs be compiled,
loaded onto FPGAs and tested super fast?

Need to look at traditional design flow
1.
2.
3.
4.
5.
Hardware-Software partition (quick)
Create macro and micro architectures for hardware portion (a month, two
months..)
Write bug free VHDL/Verilog code for architectures (a few months)
Synthesize, translate, map, place and route (5 to 15 hours)
Simulate

6.
Load configuration onto chip


7.
8.
9.
If there is a functional or timing bug, you pay a penalty of a few days to weeks
Test again.
If there is a timing bug, you pay a penalty of several weeks
If you decide to make a micro architecture change, go back to step 2
Good luck trying to finish your project on time and budget
This will still not get you a dynamically reconfigurable design
18
One way to Improve Productivity
 Stick to the traditional design flow as much as possible
 FPGA users are once bitten twice shy
 Very conservative and believe in the existing flow
 But introduce structure into the flow, i.e. physical structure,
macro-architecture structure
 Make Partial Dynamic Reconfiguration (PDR) almost automatic
 FPGA designers are not conversant with PDR designs
19
Augmented Design Flow: Exclusively for
Signal Processing Algorithms

Hardware-Software Partitioning (just a concept and specific to an application)

Structured Macro-architecture via Floor Planning


Structure Micro-architecture design



Generic structure applicable to many algorithms
Project, Schedule data flow model of Sig. Proc. Kernel onto things called
Sockets of Macro-architecture
Well understood process
Embed dynamic reconfiguration capability


New technology
Works in tandem with Macro-architecture

Code, Synthesize….

Test on chip
Structured Macro-architecture

Some important Terms/Elements:



Socket: A physical region on the FPGA chip reserved by designer to be loaded
with/configured with a PE. This is also called a Partial Reconfiguration Region (PRR)
Switch Box: A circuit that makes the array of Sockets re-partition-able
PE/Processing Element: A circuit/bitstream to implement a signal processing kernel’s
systolic array data-flow functionality. To activate a socket, a PE must be loaded into it
21
Socket/PRR: Under the Hood
Yellow box: A socket/PRR
It contains BRAMs, MACs and LUTs/FFs
(purple and blue/green/black stuff)
If you want to dynamically
reconfigure the parallelism of Systolic
Arrays on an FPGA:
All PRRs must be created with
identical resources of MACs, BRAMs,
LUTs, FFs.
Physical fabric of Virtex SX 35 FPGA
Switch Box: Stuff that makes the Array of
Sockets Re-partition-able
Simple circuit
Need to set mux sel lines & fifo controls
Resides in static region on FPGA
Change SB connections to change
partitioning of sockets/PRRs between
systolic array kernels’ nodes
23
Ok, time to port Macro-architecture
Framework onto Chip
What really happened when we tried it
Virtex 4 SX 35
Static region
(luminescent green stuff)
•Microprocessor
•Switch Boxes
•Cache
•Controller
PRRs/Sockets
(white boxes)
•To be filled with Systolic
Array Processing Elements
25
Now to the Micro-architecture…
First, Hardware Software Partitioning
Example: Extended
Kalman Filter (EKF). A
critical navigation
algorithm and a nasty
signal processing kernel.
All stuff with rounded
edges are tasks that can
change based on physics
of the problem. So put it
all in software
(Microblaze).
All else is consistent and
so put them in hardware
(PolySAF)
26
Designing/Deriving the Processing
Element: Example EKF
Works on Faddeev Algorithm to compute
Schur compliment
27
One of the many possible ways
Port
28
Code, Synthesize, …Optimize
 Port: Code, synthesize, Translate, Map, Place and Route
 For One Socket/PRR (just a few days worth of work)
 Move Nets around to meet timing: Manually pick up a wire in this
small bowl of spaghetti of wires, and move it around.


Nuisance of a task, but necessary
But you need to do it only in one PRR (just a few hours worth of work)
 Copy Locally optimized bitstream/circuit of the one PRR to all PRRs
 Automatically obtain Global Timing closure for the PolySAF
 If Microprocessor, Cache are retained for multiple designs, then global
timing closure for whole chip is also automatically gifted to you
29
Have we answered the Productivity
problem?
Time to Grade the Approach

Need to look at traditional design flow
1.
2.
Hardware-Software partition (quick)
Create macro and micro architectures for hardware portion (a month, two months..)

3.
Write bug free VHDL/Verilog code for architectures (a few months)

4.
If there is a functional or timing bug, you pay a penalty of a few days to weeks
Load configuration onto chip


7.
8.
Do for only one PRR
Simulate

6.
Reuse most of the macro structure and code only for one PRR
Synthesize, translate, map, place and route (5 to 15 hours)

5.
Applicable to a wide range of Sig. Proc. Algorithms
Test again.
If there is a timing bug, you pay a penalty of several weeks
If you decide to make a micro architecture change, go back to step 3
Good luck trying to finish your project on time and budget
30
Want the details, the math, the
algorithms etc?
 Read this paper



A. Sudarsanam, R. Barnes, A. Dasu, J. Carver, and R. Kallam,
“Dynamically Reconfigurable Systolic Array Accelerators: A case study
with EKF and DWT Algorithms,” IET/IEE Computers & Digital Techniques.
Vol 4, Issue 1. Jan 2010.
Author preprint available on line at Reconfigurable Computing Group
www.usu.edu/rcg
31
Now, onto Partial Dynamic
Reconfiguration in the PolySAF
3 nodes EKF
2 nodes DWT
Detach Socket
2 nodes EKF
2 nodes DWT
Reconfigure
Reset new PRR
Re-attach
2 nodes EKF
3 nodes DWT
DWT: discrete wavelet transform. The kernel used in JPEG 2000 image compression
32
How to Physically Reconfigure PRR?
 Known Methods
33
Comparison of all known options
Best known technique: from Microsoft Research Labs (2008) eMIPS project
Too Slow, Too expensive (hogs up valuable on-chip BRAMs)
34
Embedding Dynamic Reconfiguration into
the System
 Active Bitstream (PRR) to PRR: Hardware Circuit
ARC
ICAP wrapper
FPGA
ICAP
snoop
PRR
(source)
active bitstream
PRR
(destination)
PRR
(destination)
35
Accelerated Relocation Circuit (ARC)
 Manipulate Frame addresses


FAR is Frame address register
Lots of unnecessary overhead can be avoided
 No need for CRC processing
36
Results…reconfiguration times in millisecs
Test Circuit
PolySAF
node
Resources
Bitstream #.of.
Size
frames
(Bytes)
ARC
BiRF*
IEEE
TVLSI
2009
Same Side/
Opp Side
Same
side
BRAM
Same
Side
Opp
Side
Microsoft*
Tech. Report 2008
LUT
FF
DSP
BRAM
486
273
0
0
31159
195
0.48
84.7
14
3.38
8.86
438
273
0
0
30693
195
0.48
83.4
14
3.33
8.73
1234
988
0
0
68469
432
1.07
186.1
30
7.42
19.47
423
216
1
0
32349
195
0.48
87.9
15
3.50
9.20
375
216
1
0
32349
195
0.48
89.8
15
3.58
9.20
502
466
8
0
65261
432
1.07
177.3
29
7.07
18.56
DCT
1419
1636
8
8
44397
540
1.34
120.64
22
4.81
12.62
CSC
318
438
1
12
17313
301
0.74
47.04
9
1.87
4.92
DWT
940
389
0
4
47897
303
0.75
130.2
21
5.19
13.62
FSA_
no_DSP
DSA_
no_DSP
Matrix_Mult
no_DSP
FSA_
with_DSP
DSA_
with_DSP
Matrx_Mult
with_DSP
RFT cases
All systems run @ 100 MHz * Estimated values for state of the art competing technologies
Footprint of ARC:
1064 LUTs, 638 FFs and 1 BRAM
37
Next steps…
Improve, Formalize and Collaborate

Performance prediction Model



Predict how big circuit will be, how it will perform using Excel and Matlab
Big leap in productivity
Arithmetic Precision manipulation is extraordinarily powerful when it comes to
FPGAs


If the right non-IEEE precision can be chosen for a Sig. Proc. App. Then you can save
medium to massive amounts of area, power in the circuit mapped onto the FPGA
Great opportunity for Small Satellites

Efficient communication between Microprocessor and PolySAF via threads

Validate and brutally test this on a large number of algorithms (FFTs, Filters,
Hyperspectral processing…..)


NASA can help with this
Technology is attractive for software defined radios, precision navigation…
38
Kaleidoscope: Future of FPGA
 Near term
 Maybe better tools to program and debug FPGAs?
 Mentor’s Catapult, AutoESL compiler, Synfora compiler….
 Maybe some sort of standardization in FPGA programming
 Hopefully DARPA HPCS program will produce something
 Longer term (Revolutionary things to come)
 Vertically Integrated FPGA + DRAM on a single chip
 1000x improvement in performance/watt
 Visit Micron Research Center at USU to learn more
 www.usu.edu/mrc
39
Acknowledgements
 Joe Bredekamp and the NASA AISR program
 Applied Information Systems Research
 Funding from NASA is valuable
 Focused research
 Want my technology to be adopted for real missions
 Xilinx and Mentor Graphics (donated > $ 100K worth software)
 My Grad Students
Download