lectureship

advertisement
FPGAs for the Masses:
Hardware Acceleration without Hardware Design
David B. Thomas
dt10@doc.ic.ac.uk
Contents
• Motivation for hardware acceleration
– Increase performance, reduce power
– Types of hardware accelerator
• Research achievements
– Accelerated Finance research group
– Research direction and publications
• Highlighted contribution: Contessa
– Domain specific language for Monte-Carlo
– Push-button compilation to hardware
• Conclusion
Motivation
• Increasing demand for High Performance Computing
– Everyone wants more compute-power
– Finer time-steps; larger data-sets; better models
• Decreasing single-threaded performance
– Emphasis on multi-core CPUs and parallelism
– Do computational biologists need to learn PThreads?
• Increasing focus on power and space
– Boxes are cheap: 16 node clusters are very affordable
– Where do you put them? Who is paying for power?
• How can we use hardware acceleration to help?
Types of Hardware Accelerator
• GPU : Graphics Processing Unit
– Many-core - 30 SIMD processors per device
– High bandwidth, low complexity memory – no caches
• MPPA : Massively Parallel Processor Array
– Grid of simple processors – 300 tiny RISC CPUs
– Point-to-point connections on 2-D grid
• FPGA : Field Programmable Gate Array
– Fine-grained grid of logic and small RAMs
– Build whatever you want
Hardware Advantages: Performance
S peed-up
40
30
20
10
0
31
10
CPU
•
•
•
•
1
1
GPU
MP P A
FPGA
More parallelism - more performance
GPU: 30 cores, 16-way SIMD
MPPA: 300 tiny RISC cores
FPGA: hundreds of parallel functional units
A Comparison of CPUs, GPUs, FPGAs, and MPPAs for Random Number Generation,
D. Thomas, L. Howes, and W. Luk , In Proc. of FPGA (To Appear) , 2009
Hardware Advantages: Power
S peed-up
200
150
100
50
0
E fficiency
175
1
1
CPU
9
1 18
31
GPU
MP P A
FPGA
10
• GPU: 1.2GHz - same power as CPU
• MPPA: 300MHz - Same performance as CPU, but 18x less power
• FPGA: 300MHz - faster and less power
A Comparison of CPUs, GPUs, FPGAs, and MPPAs for Random Number Generation,
D. Thomas, L. Howes, and W. Luk , In Proc. of FPGA (To Appear) , 2009
FPGA Accelerated Applications
• Finance
– 2006: Option pricing:
– 2007: Multivariate Value-at-Risk:
– 2008: Credit-risk analysis:
30x
33x
60x
CPU
Quad CPU
Quad CPU
100x
Quad CPU
4x
1.1x
Quad CPU
GPU
• Bioinformatics
– 2007: Protein Graph Labelling:
• Neural Networks
– 2008: Spiking Neural Networks:
All with less than a fifth of the power
Problem: Design Effort
256
64
Relative
Performance 16
4
CPU
Compiled
Scripted
Vectorised
1
Multi-threaded
0.25
Hour
Day
Week
Month
Year
Design-time
• Researchers love scripting languages: Matlab, Python, Perl
– Simple to use and understand, lots of libraries
– Easy to experiment and develop promising prototype
• Eventually prototype is ready: need to scale to large problems
– Need to rewrite prototype to improve performance: e.g. Matlab to C
– Simplicity of prototype is hidden by layers of optimisation
Problems: Design Effort
Memory Opt.
256
Reorganisation
64
C to GPU
GPU
Relative
Performance 16
Compiled
CPU
4
1
0.25
Hour
Day
Week
Month
Year
Design-time
• GPUs provide a somewhat gentle learning curve
– CUDA and OpenCL almost allow compilation of ordinary C code
• User must understand GPU architecture to maximise speed-up
– Code must be radically altered to maximise use of functional units
– Memory structures and accesses must map onto physical RAM banks
• We are asking the user to learn about things they don’t care about
Problems: Design Effort
Clock Rate
256
Parallelisation
FPGA
GPU
64
Relative
Performance 16
CPU
4
1
Initial Design
0.25
Hour
Day
Week
Month
Year
Design-time
• FPGAs provide large speed-up and power savings – at a price!
– Days or weeks to get an initial version working
– Multiple optimisation and verification cycles to get high performance
• Too risky and too specialised for most users
– Months of retraining for an uncertain speed-up
• Currently only used in large projects, with dedicated FPGA engineer
Goal: FPGAs for the Masses
• Accelerate niche applications with limited user-base
– Don’t have to wait for traditional “heroic” optimisation
• Single-source description
– The prototype code is the final code
• Encourage experimentation
– Give users freedom to tweak and modify
• Target platforms at multiple scales
– Individual user; Research group; Enterprise
• Use domain specific knowledge about applications
– Identify bottlenecks: optimise them
– Identify design patterns: automate them
– Don’t try to do general purpose “C to hardware”
Accelerated Finance Research Project
• Independent sub-group in Computer Systems section
– EPSRC project: 3 years, £670K
• “Optimising hardware acceleration for financial computation”
– Team of four: Me, Wayne Luk, 2 PhD students
• Active engagement with financial institutes
– Six month feasibility study for Morgan Stanley
– PhD student funded by J. P. Morgan
• Established a lead in financial computing using FPGAs
– 7 journal papers, 17 refereed conference papers
– Book chapter in “GPU Gems 3”
Finance: Increasing Automation
Automation
Hardware architectures for Monte-Carlo
based financial simulations,
D. Thomas, J. Bower, W. Luk, Proc. FPT, 2006
Application-Specific
Custom Design
A Reconfigurable Simulation Framework
for Financial Computation,
J. Bower, D. Thomas, et. al, Proc. Reconfig, 2006
Manual
Design Pattern
Automatic Generation and Optimisation of
Reconfigurable Financial Monte-Carlo Simulations
D. Thomas, J. Bower, W. Luk, Proc ASAP, 2007
Credit Risk Modelling using Hardware
Accelerated Monte-Carlo Simulation,
D. Thomas, W. Luk, Proc FCCM, 2008
A Domain Specific Language for Reconfigurable
Path-based Monte Carlo Simulations
D. Thomas, W. Luk, Proc. FPT, 2007
Semi-Automated
Design Pattern
Automated Design
Tool (Contessa)
Case Studies
Simple
Monte-Carlo
Option Pricing
Path-dependent
option pricing
Correlated
Value-at-Risk
Discrete-event
Credit-Risk models
Irregular Monte-Carlo
Target GPUs,
MPPAs, ...
Dynamic Memory
Support
Variance Reduction
Quasi Monte-Carlo
Numerical Solutions
Finance: Increasing Performance
Case Studies
Simple
Monte-Carlo
Option Pricing
Path-dependent
option pricing
Correlated
Value-at-Risk
Discrete-event
Credit-Risk models
Optimisation
Uniform RNGs
High quality uniform random number generation
through LUT optimised linear recurrences,
D. Thomas, W. Luk, Proc. FPT, 2005
Arbitrary
Distribution RNGs
Efficient Hardware Generation of Random
Variates with Arbitrary Distributions,
D. Thomas, W. Luk, Proc. FCCM, 2006
Multivariate
Gaussian RNGs
Sampling from the Multivariate Gaussian
Distribution using Reconfigurable Hardware
D. Thomas, W. Luk, Proc FCCM, 2007
Exponential RNGs
Sampling from the Exponential Distribution
using Independent Bernoulli Variates
D. Thomas, W. Luk, Proc FPL, 2008
Statistical
Accumulators
Estimation of Sample Mean and
Variance for Monte-Carlo Simulations
D. Thomas, W. Luk, Proc FPT, 2008
Binomial and
Trinomial Trees
Irregular Monte-Carlo
Numerical
Integration
Variance Reduction
Quasi Monte-Carlo
Numerical Solutions
Finite Difference
Methods
Exploring reconfigurable architectures
for financial computation
Q. Jin, D. Thomas, W. Luk, B. Cope, Proc ARC, 2007
Contessa: Overall Goals
• Language for Monte-Carlo applications
• One description for all platforms
– FPGA family independent
– Hardware accelerator card independent
•
•
•
•
“Good” performance across all platforms
No hardware knowledge needed
Quick to compile
It Just Works: no verification against software
FPGA : Field Programmable Gate Array
• Grid of logic gates
– No specific function
– Connect as needed
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
FPGA : Field Programmable Gate Array
• Grid of logic gates
– No specific function
– Connect as needed
?
?
+
• Allocate logic
– Adders, multipliers, RAMs
?
?
+
?
?
?
?
?
?
?
?
?
?
• Area = performance
– Make the most of it
– Fixed-size grid
×
?
?
FPGA : Field Programmable Gate Array
• Grid of logic gates
– No specific function
– Connect as needed
?
?
+
• Allocate logic
– Adders, multipliers, RAMs
?
?
+
?
?
?
?
?
?
?
?
?
?
• Area = performance
– Make the most of it
• Pipelining is key
– Lots of registers in logic
– Pipeline for high clock rate
×
?
?
FPGA : Field Programmable Gate Array
• Grid of logic gates
– No specific function
– Connect as needed
?
?
?
?
?
?
?
?
?
?
?
?
?
?
• Allocate logic
– Adders, multipliers, RAMs
×
• Area = performance
– Make the most of it
• Pipelining is key
– Lots of registers in logic
– Pipeline for high clock rate
• Multi-cycle feedback paths
– Floating-point: 10+ cycles
+
?
?
Contessa: Basic Ideas
• Contessa: Pure functional high-level language
– Variables can only be assigned once
– No shared mutable state
• Continuation-based control-flow
– Iteration through recursion
– Functions do not return: no stack
• Syntax driven compilation to FPGA
– No side-effects: maximise thread-level parallelism
– Thread-level parallelism allows deep pipelines
– Deep pipelines allow high clock rate - high performance
• Hardware independent
– No explicit timing or parallelism information
– No explicit binding to hardware resources
• Familiar semantics
– Looks like C
– Behaves like C
– No surprises for user
• Built-in primitives
– Random numbers
– Statistical accumulators
– Map to FPGA optimised
functional units
• Restricts choices
– User can’t write poorly
performing code
– e.g. Just-in-time random
number generation
• Straight to hardware
– No hardware hints
– Push a button
parameter(float,VOLATILE_ENTER);
parameter(int, MAX_D); // Remaining parameters elided.
accumulator(float,price);
// Price at end of simulations.
// This block is the arity-0 entry point for all threads.
void init()
{ stable(1, S_INIT); } // Start threads in stable block
// Stable market: step price forward for each day in simulation.
void stable(int d, float s)
{
if(d==MAX_D){
price += s; // Accumulate final price of simulation.
return;
// Exit thread with nullary return.
}else if(unifrnd()>VOLATILE_ENTER){
volatile(d+1,0,VOL_INIT,s); // Simulate volatile day.
}else{
float ns=s*lognrnd(STABLE_MU, STABLE_SIGMA);
stable(d+1, ns);
// Simulate stable day in one step
}}
// Volatile market: step price in small increments through day.
void volatile(int dinc, float t, float v, float s)
{
if(t>MAX_T){
// End of day, so ...
stable(dinc, s); // ... return to stable phase.
}else{
float nv=sqrt(v+unifrnd());
// New volatility...
float ns=s*lognrnd(VOL_MU,VOL_SIGMA*nv); // & price
volatile(dinc, t+exprnd(), nv, ns);
// Loop
}}
• Each function is a pipeline
– Parameters are inputs
– Function calls are outputs
– Can be very deep pipelines
– Floating-point: 100+ cycles
• Function call = continuation
– Tuple of target + arguments
– Completely captures thread
– Can be queued and routed
• Massively multi-threaded
– Threads are cheap to create
– Route threads like packets
– Queue threads in FIFOs
parameter(float,VOLATILE_ENTER);
parameter(int, MAX_D); // Remaining parameters elided.
accumulator(float,price);
// Price at end of simulations.
// This block is the arity-0 entry point for all threads.
void init()
{ stable(1, S_INIT); } // Start threads in stable block
// Stable market: step price forward for each day in simulation.
void stable(int d, float s)
{
if(d==MAX_D){
price += s; // Accumulate final price of simulation.
return;
// Exit thread with nullary return.
}else if(unifrnd()>VOLATILE_ENTER){
volatile(d+1,0,VOL_INIT,s); // Simulate volatile day.
}else{
float ns=s*lognrnd(STABLE_MU, STABLE_SIGMA);
stable(d+1, ns);
// Simulate stable day in one step
}}
// Volatile market: step price in small increments through day.
void volatile(int dinc, float t, float v, float s)
{
if(t>MAX_T){
// End of day, so ...
stable(dinc, s); // ... return to stable phase.
}else{
float nv=sqrt(v+unifrnd());
// New volatility...
float ns=s*lognrnd(VOL_MU,VOL_SIGMA*nv); // & price
volatile(dinc, t+exprnd(), nv, ns);
// Loop
}}
Convert Functions to Pipelines
t
void step(int t, float s)
{
float ds=s+rand();
step(t+1, ds);
}
s
Cycle -1
rand
Cycle 0
add
inc
Cycle 1
Cycle 2
t’
s’
Nested Loops
void outer(...)
{
if(...)
outer(...);
else if(...)
inner(...);
else
acc();
}
void inner(...)
{
if(...)
inner(...);
else
outer(...);
}
init
+
+
outer
acc
+
inner
Replicating Bottleneck Functions
void init()
{
step(...);
}
void step(...)
{
if(...)
step(...);
else
acc(...);
}
init
+
+
step
step
void acc(...)
{
...
}
acc
Contessa: User Experience
• True push-button compilation
– Hardware usage is transparent to user
– High-level source code through to hardware
• Progressive optimisation: speed vs startup delay
–
–
–
–
Interpreted: immediate
Software: 1-10 seconds
Hardware: 10-15 minutes
No alterations to source code
• Speedup of 15x to 60x over software
– Greater speedup in more computationally intensive apps.
• Power reduction of 75x to 300x over software
– 300MHz FPGA vs 3GHz CPU
Contessa: Future Work
• Scaling across multiple FPGAs
– Easy to move thread-states over high-speed serial links
– Automatic load-balancing
• Move threads between FPGA and CPU/GPU
– Some functions are infrequently used: place in CPU
– Threads move seamlessly back and forth
• Automatic optimisation of function network
– Replication of bottleneck functions
– Place lightly loaded functions in slower clock domains
• Allow more general computation
– Fork/join semantics
– Dynamic data structures
Conclusion
• Goal: hardware acceleration of applications
– Increase performance, reduce power
– Make hardware acceleration more widely available
• Achievements: accelerated finance on FPGAs
– Three year EPSRC project: 25 papers (so far)
– Speedups of 100x over quad CPU, using less power
– Domain specific language for financial Monte-Carlo
• Future: ease-of-use and generality
– Target more platforms, hybrids: CPU+FPGA+GPU
– DSLs for other domains: bioinformatics, neural nets
Download