A Case for FAME: FPGA Architecture Model Execution

advertisement
BERKELEY PAR LAB
A Case for FAME:
FPGA Architecture
Model Execution
Zhangxi Tan, Andrew Waterman,
Henry Cook, Sarah Bird,
Krste Asanovic, David Patterson
The Parallel Computing Lab, UC Berkeley
ISCA ’10
A Brief History of Time
BERKELEY PAR LAB
 Hardware prototyping initially popular for
architects
 Prototyping each point in a design space is
expensive
 Simulators became popular cost-effective
alternative
 Software Architecture Model Execution (SAME)
simulators most popular
 SAME performance scaled with uniprocessor
performance scaling
2
The Multicore Revolution
BERKELEY PAR LAB
 Abrupt change to multicore architectures
 HW, SW systems larger, more complex
 Timing-dependent nondeterminism
 Dynamic code generation
 Automatic tuning of app kernels
 We need more simulation cycles than ever
3
The Multicore Simulation Gap
BERKELEY PAR LAB
 As number of cores increases exponentially,
time to model a target cycle increases
accordingly
 SAME is difficult to parallelize because of
cycle-by-cycle interactions
 Relaxed simulation synchronization may not work
 Must bridge simulation gap
4
One Decade of SAME
BERKELEY PAR LAB
Median
Instructions
Simulated/
Benchmark
Median
#Cores
Median
Instructions
Simulated/
Core
ISCA 1998
267M
1
267M
ISCA 2008
825M
16
100M
 Effect is dramatically shorter (~10 ms) simulation
runs
5
FAME: FPGA Architecture
Model Execution
BERKELEY PAR LAB
 The SAME approach provides inadequate simulation
throughput and latency
 Need a fundamentally new strategy to maximize
useful experiments per day
 Want flexibility of SAME and performance of hardware
 Ours: FPGA Architecture Model Execution (FAME)
 (cf. SAME, Software Architecture Model Execution)
 Why FPGAs?
 FPGA capacity scaling with Moore’s Law
 Now can fit a few cores on die
 Highly concurrent programming model with cheap
synchronization
6
Non-FAME:
FPGA Computers
BERKELEY PAR LAB
 FPGA Computers: using FPGAs
to build a production computer
 RAMP Blue (UCB 2006)
 1008 MicroBlaze cores
 No MMU, message passing only
 Requires lots of hardware
• 21 BEE2 boards (full rack) / 84 FPGAs
 RTL directly mapped to FPGA
 Time-consuming to modify
 Cool, useful, but not a flexible
simulator
7
FAME:
System Simulators in FPGAs
Target System A
I$
CORE
I$
CORE
D$
D$
…
Target System B
I$
CORE
I$
CORE
I$
CORE
I$
CORE
I$
CORE
D$
D$
D$
D$
D$
L2$
L2$
L2$
Shared L2$ / Interconnect
DRAM
BERKELEY PAR LAB
DRAM
Host System
(FAME simulator)
8
A Vast FAME Design Space
BERKELEY PAR LAB
 FAME design space even larger than SAME’s
 Three dimensions of FAME simulators
 Direct or Decoupled: does one host cycle model
one target cycle?
 Full RTL or Abstract RTL?
 Host Single-threaded or Host Multi-threaded?
 See paper for a FAME taxonomy!
9
FAME Dimension 1:
Direct vs. Decoupled
BERKELEY PAR LAB
 Direct FAME: compile target RTL to FPGA
 Problem: common ASIC structures map poorly to FPGAs
 Solution: resource-efficient multi-cycle FPGA mapping
 Decoupled FAME: decouple host cycles from target cycles
 Full RTL still modeled, so timing accuracy still guaranteed
R1
R2
R3
R4
RegFile
W1
W2
R1
R2
W1
Rd1
Rd2
Rd3
Rd4
Target System Regfile
Rd1
Rd2
RegFile
FSM
Decoupled Host Implementation
10
FAME Dimension 2:
Full RTL vs. Abstract RTL
BERKELEY PAR LAB
 Decoupled FAME models full RTL of target machine
 Don’t have full RTL in initial design phase
 Full RTL is too much work for design space exploration
 Abstract FAME: model the target RTL at a high level
 For example, split timing and functional models (à la SAME)
 Also enables runtime parameterization: run different simulations without
re-synthesizing the design
 Advantages of Abstract FAME come at cost: model verification
 Timing of abstract model not guaranteed to match target machine
Target
RTL
Abstraction
Functional
Model
Timing
Model
11
FAME Dimension 3:
Single- or Multi-threaded Host
BERKELEY PAR LAB
 Problem: can’t fit big manycore on FPGA, even abstracted
 Problem: long host latencies reduce utilization
 Solution: host-multithreading
Target Model
CPU
1
CPU
2
CPU
3
CPU
4
Multithreaded Emulation Engine
(on FPGA)
PC
PC1
PC
PC1 1
1
Single hardware
pipeline with
multiple copies +1
of CPU state
2
I$
IR
GPR
GPR
GPR
GPR1
X
Y
2
D$
12
Metrics besides Cycles:
Power, Area, Cycle Time
BERKELEY PAR LAB
 FAME simulators determine how many cycles a
program takes to run
 Computing Power/Area/Cycle Time: SAME old story
 Push target RTL through VLSI flow
 Analytical or empirical models
 Collecting event stats for model inputs is much faster than
with SAME
13
RAMP Gold: A Multithreaded
FAME Simulator
BERKELEY PAR LAB
 Rapid accurate simulation of
manycore architectural ideas
using FPGAs
 Initial version models 64 cores of
SPARC v8 with shared memory
system on $750 board
 Hardware FPU, MMU, boots OS.
Cost
Simics (SAME)
RAMP Gold (FAME)
Performance
(MIPS)
Simulations per day
$2,000
0.1 - 1
1
$2,000 + $750
50 - 100
100
14
RAMP Gold Target Machine
BERKELEY PAR LAB
64 cores
SPARC V8
CORE
SPARC V8
CORE
I$
I$
D$
…
D$
SPARC V8
CORE
SPARC V8
CORE
I$
I$
D$
D$
Shared L2$ / Interconnect
DRAM
15
RAMP Gold Model
BERKELEY PAR LAB
I$
CORE
I$
CORE
D$
D$
64
cores
…
I$
CORE
D$
Shared L2$ / Interconnect
DRAM
Timing
State
Timing
Model
Pipeline
Arch
State
Functional
Model
Pipeline
 SPARC V8 ISA
I$
 One-socket manycore target
CORE
Split functional/timing
D$
model, both in hardware
– Functional model: Executes
ISA
– Timing model: Capture
pipeline timing detail
Host multithreading of both
functional and timing models
Functional-first, timingdirected
Built for Xilinx Virtex-5
systems
16
[ RAMP Gold, DAC ‘10 ]
Case Study: Manycore OS
Resource Allocation
BERKELEY PAR LAB
 Spatial resource allocation in a manycore
system is hard
 Combinatorial explosion in number of apps and
number of resources
 Idea: use predictive models of app




performance to make it easier on OS
HW partitioning for performance isolation
(so models still work when apps run together)
Problem: evaluating effectiveness of resulting
scheduling decisions requires running
hundreds of schedules for billions of cycles
each
Simulation-bound: 8.3 CPU-years for Simics!
See paper for app modeling strategy details
17
Case Study: Manycore OS
Resource Allocation
BERKELEY PAR LAB
Normalized Runtime
4
3.5
worst
sched.
chosen
sched.
best
sched.
3
2.5
2
1.5
1
0.5
0
Synthetic Only PARSEC Small
18
Case Study: Manycore OS
Resource Allocation
BERKELEY PAR LAB
Normalized Runtime
4
3.5
worst
sched.
chosen
sched.
best
sched.
3
2.5
2
1.5
1
0.5
0
Synthetic Only PARSEC Small PARSEC Large
 The technique appears to perform very well for synthetic or
reduced-input workloads, but is lackluster in reality!
19
RAMP Gold Performance
BERKELEY PAR LAB
 FAME (RAMP Gold) vs. SAME (Simics) Performance
 PARSEC parallel benchmarks, large input sets
 >250x faster than full system simulator for a 64-core target system
Speedup (Geometric Mean)
300
250
200
150
Functional only
Functional+cache/memory (g-cache)
Functional+cache/memory+coherency (GEMS)
100
50
0
4
8
16
32
Number of Target Cores
64
20
Researcher Productivity is
Inversely Proportional to Latency
BERKELEY PAR LAB
 Simulation latency is even more important than
throughput
 How long before experimenter gets feedback?
 How many experimenter-days are wasted if there was an
error in the experimental setup?
Median Latency (days)
Maximum Latency (days)
FAME
0.04
0.12
SAME
7.50
33.20
21
Fallacy: FAME is too hard
BERKELEY PAR LAB
 FAME simulators more complex, but not greatly so
 Efficient, complete SAME simulators also quite
complex
 Most experiments only need to change timing model
 RAMP Gold’s timing model is only 1000 lines of
SystemVerilog
 Modeled Globally Synchronized Frames [Lee08] in 3 hours
& 100 LOC
 Corollary fallacy: architects don’t need to write RTL
 We design hardware; we shouldn’t be scared of HDL
22
Fallacy: FAME Costs Too Much
BERKELEY PAR LAB
 Running SAME on cloud (EC2) much more expensive!
 FAME: 5 XUP boards at $750 ea.; $0.10 per kWh
 SAME: EC2 Medium-High instances at $0.17 per hour
Runtime
(hours)
Cost for first Cost for next Carbon
experiment experiment offset (trees)
FAME
257
$3,750
$10
0.1
SAME
73,000
$12,500
$12,500
55.0
 Are architects good stewards of the environment?
 SAME uses energy of 45 seconds of Gulf oil spill!
23
Fallacy: statistical sampling
will save us
BERKELEY PAR LAB
 Sampling may not make sense for multiprocessors
 Timing is now architecturally visible
 May be OK for transactional workloads
 Even if sampling is appropriate, runtime dominated by
functional warming => still need FAME
 FAME simulator ProtoFlex (CMU) originally designed for this
purpose
 Parallel programs of the future will likely be
dynamically adaptive and auto-tuned, which may
render sampling useless
24
Challenge: Simulator Debug
Loop can be Longer
 Takes 2 hours to push RAMP Gold through the
CAD tools
BERKELEY PAR LAB
 Software RTL simulation to debug simulator is also
very slow
 SAME debug loop only minutes long
 But sheer speed of FAME eases some tasks
 Try debugging and porting a complex parallel
program in SAME
25
Challenge: FPGA CAD Tools
BERKELEY PAR LAB
 Compared to ASIC tools, FPGA tools are
immature
 Encountered 84 formally-tracked bugs
developing RAMP Gold
 Including several in the formal verification tools!!
 By far FAME’s biggest barrier
 (Help us, industry!)
 On the bright side, the more people using FAME,
the better
26
When should Architects
still use SAME?
BERKELEY PAR LAB
 SAME still appropriate in some situations
 Pure functional simulation
 ISA design
 Uniprocessor pipeline design
 FAME necessary for manycore research with
modern applications
27
Conclusions
BERKELEY PAR LAB
 FAME uses FPGAs to build simulators, not
computers
 FAME works, it’s fast, and we’re using it
 SAME doesn’t cut it, so use FAME!
 Thanks to the entire RAMP community for contributions to
FAME methodology

Thanks to NSF, DARPA, Xilinx, SPARC International, IBM, Microsoft, Intel,
and UC Discovery for funding support
RAMP Gold source code is available:
http://ramp.eecs.berkeley.edu/gold
28
Download