PPT - RAMP

advertisement
Usability Challenges
for RAMP2
Eric Chung
James C. Hoe
Computer Architecture Lab at
1
ProtoFlex in a nut shell
•
FPGA-accelerated full-system simulation by
virtualization
1–
Hybrid Full-System Simulation
2– Multiprocessor Host Interleaving
1
2
P
CPU
P
Memory
Common-case
behaviors
P
P
P
P
P
P
P
P
2
Devices
Uncommon
behaviors
4-way P
4-way P
Memory
2
2
In the beginning . . .
3
Then came . . .
Simulated target
console window
Host command-line
for breakpoints,
introspection,
modification
Inspect/modify
registers
Create checkpoint
Undo the last instruction!
4
Now “they” want . . .
Execute
‘exc_callback’
every time a CPU
hits an exception
Print out
exception name
and triggering PC
• Using SW simulator, takes 5 lines of Python
– Body of callback runs arbitrary instrumentation code
How would you do this in FPGAs?
5
What else could “they” want
• Interaction with virtual display of target system
• Fully deterministic and controllable execution
• Command-line control and scripting capabilities
• API for state inspection/modification
• Modularity features for adding/changing components
• Checkpoint save/restore
• Host-target communication (e.g., for bootstrapping)
• Full-system I/O capabilities (e.g., OS)
• Target resource virtualization
6
Outline
• Introduction
• Practical Feature Development
• Case Study: ProtoFlex Monitoring
• Closing Thoughts
7
Practical Feature Development
• Porting simul. features into FPGA not easy
– RTL modification almost always required
– Unlike SW, state in FPGA not easy to inspect/modify
(but required in most cases)
• Goal: make feature porting easier!
– With minimum FPGA expertise
8
Example
Execute
‘exc_callback’
every time a CPU
hits an exception
Print out
exception name
and triggering PC
• Using SW simulator, takes 5 lines of Python
– Body of callback runs arbitrary instrumentation code
How would we implement in FPGAs?
9
How to implement in FPGA?
• Necessary steps
– Modify RTL of FPGA soft core to monitor exceptions
(add bits to pipeline stages, modify decoder)
– Collect PC register during exceptions into trace buffer
– Simulate, debug, synthesize, place + route
– Collect/compress traces from multiple CPU cores
(possibly across multiple FPGAs)
– Decompress/post-process traces and print
Can we reduce effort for
RAMP developers?
10
Justifying the hardware
• For some efforts, RTL change unavoidable
– Ex: redesign memory subsystem, change # cores
• But for other things, can we do better?
– Instrumentation example from earlier?
(print the PC during exceptions)
– Testing a new instruction?
– Inspecting a few CPU registers?
11
Observation
• Only frequent uses of a given hardware
modification benefit from FPGA speedup
• Can we relegate infrequent events to software?
• Examples
– Instrumenting rare events (e.g. exceptions)
– Monitoring/analyzing subset of instruction traces
– Periodic sampling of counters
– Monitor range of ‘watched’ memory addresses
12
Outline
• Usability Challenges
• Practical Feature Development
• Case Study: ProtoFlex Monitoring
• Closing Thoughts
13
Case Study: ProtoFlex Monitoring
• Our objective:
– Diagnose an ‘anomaly’ while running commercial
apps in BlueSPARC simulator*
• Requirements:
– At runtime, get names of processes running on CPUs
– Extract/verify user- and kernel-level stack traces
– WITHOUT modifications to target workload or OS
*BlueSPARC is our 16-CPU full-system FPGA-based simulator
14
Technique Used: Whitebox Tool
• Whitebox Profiling
– Input: real-time traces from full-system simulation
– Output: human-readable stack traces and visualization
– Simics tool authored by Mike Ferdman & Brian Gold
– Less than 300L of ‘Simics’ Python
DB2 2-CPU-1CL (CPU 0 - Server)
unknown
tpcc
120000
sh
% user
100000
sched
nscd
80000
fsflush
60000
db2sysc
db2set
40000
db2fmp
20000
db2fmd
db2fmcd
0
1
2
3
4
5
6
7
8
9
10
11
0 to 2B cycles
12
13
14
15
16
17
18
19
20
db2fm
15
db2bp
automountd
Supporting Whitebox in ProtoFlex
• Basic whitebox technique:
– Simulation runtime is periodically halted
– Various registers are first checked
– Virtual-to-physical translations used to locate key data structures
– Physical memory reads are used to extract kernel state
• Naïve solution
– Add state machine to FPGA soft core to perform the steps
– Works but inflexible; may require significant HW changes
Is there an easier way?
16
Solution: Hybrid Simulation
for Monitoring
• ProtoFlex hybrid simulation
– Recall in ProtoFlex: CPU pipeline implements only subset of
instructions; nearby hard core simulates ISA remainder
Virtex II Pro 70
16-way
Pipeline
PowerPC
(Hard core)
Operation is called a
‘Transplant’
Processor Bus
Interface to 2nd
FPGA (memory)
PowerPC simulates
unimplemented
SPARC instructions.
Ethernet
Transplants can be used for monitoring!
17
Transplants for Monitoring
1) ‘Simulation’ engine
periodically requests
VirtextoII PowerPC
Pro 70
transplant
16-way
Pipeline
2) PowerPC performs ‘inspection’
by requesting register/memory
state from engine
PowerPC
(Hard core)
Processor Bus
Interface to 2nd
FPGA (memory)
Ethernet
Inspection/monitoring code
written in C language
transplant() {
…
read_register(…)
translate(…)
read_memory(…)
…
}
18
Tradeoffs
• Advantages
– SW approach to flexibly monitor events of interest
– For rare events, performance impact negligible
– Validate instrumentation idea before building in HW
• Disadvantages
– If events occur too frequently, must accelerate in HW
– How to know which HW interfaces to provide?
– How to know which events to monitor?
– How to scale to multiple engines?
19
Designing the HW/SW Interface
• In our design, we needed new interfaces between
ProtoFlex engine & PowerPC
– Engine can issue memory requests on behalf of PowerPC
– Engine can issue TLB translations on behalf of PowerPC
 Still required HW modification!
• For general-purpose monitoring, what interfaces
needed?
20
Designing the HW/SW Interface
• Existing simulators good place to look at
– E.g., Simics provides library of over > 100 API calls
used for inspection/modification/monitoring
• Example API methods:
– read_register(), write_register(), translate(), etc.
• Also over 50 unique ‘event’ types in simics:
– Used to trigger monitoring callback functions
– Ex: exceptions, watched memory locations, etc.
Build these APIs into RAMP?
21
Addressing Scalability Challenges
• In BlueSPARCv1.0, only 1 centrally-located core
– How to scale monitoring up to tens or hundreds of cores?
• Can we disable ½ of the host cores?
– To monitor the other half
– And provide general-purpose instrumentation / monitoring?
– Compile Simics API calls into distributed kernels that run on
cores in monitoring mode
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
22
Outline
• Usability Challenges
• Practical Feature Development
• Case Study: ProtoFlex Monitoring
• Closing Thoughts
23
Closing Thoughts
• Attention to user and developer usability is critical for
practical RAMP adoption
– Goal: minimize FPGA expertise required when possible
• For users, provide familiar SW-simulation interface
• For developers, provide general-purpose monitoring that
is programmable, comprehensive, and scalable
24
Ongoing Work at CMU
• BlueSPARC simulator (ProtoFlex)
– Currently supports subset of Simics user interface
– Supports general-purpose software programmable monitoring
– Virtual console/GFX supported via hybrid simulation
• Still many challenges left
– Not all Simics commands map easily to FPGA
– Execution is non-deterministic
– Checkpoint generation/loading works (but very slow)
– No ‘Undo-ing’ instructions
– Fine-grained ‘stepping’ for large-scale configurations
– Minor monitoring changes still requires re-synthesizing
Release planned for 2009
25
Thanks! Any questions?
echung@ece.cmu.edu
http://www.ece.cmu.edu/~protoflex
COME SEE
OUR DEMO!
Acknowledgements
We would like to thank our colleagues in
the RAMP and TRUSS projects.
26
BACKUP
27
Typical Simulator ‘Must-Haves’
• Features commonly available in simulators today:
– Interaction with virtual display of target system
– Fully deterministic and controllable execution
– Command-line control and scripting capabilities
– API for state inspection/modification
– Modularity features for adding/changing components
– Checkpoint save/restore
– Host-target communication (e.g., for bootstrapping)
– Full-system I/O capabilities (e.g., OS)
– Target resource virtualization
28
Software Usage Example
Simulated target
console window
Host command-line
for breakpoints,
introspection,
modification
Inspect/modify
registers
Create checkpoint
Undo the last instruction!
29
Bringing SW features to RAMP
• Can users with no knowledge of FPGAs use
RAMP out-of-the-box?
• The litmus test
– User is unable to tell using a simulator front-end
whether back-end is FPGAs or not
30
Closing Thought: Unification
• Common UI to ‘unify’ simulators and FPGAs
– Ex: use ‘Simics’ front-end, back-end is either FPGAs
or SW (ProtoFlex has limited form of this)
– Avoid reinventing API/interface; users already familiar
• Benefits of interoperability:
– Gentle transition of whole generation of full-system
simulation users to RAMP
– Support legacy scripts, workloads, configurations
31
Other Simulation Features
• How to provide full-system checkpoints?
– Must save/restore CPU/memory/device states
– But can’t just quickly dump/load 64GB of memory!
• Supporting ‘pause’ and ‘rewind’ in HW
• Deterministic/controllable execution
• ‘Instantaneously’ inspection/modification of
distributed CPU/Mem/Device state
32
Case Study: ProtoFlex
WhiteBox
• Our goals
– Profile IBM DB2/TPCC
– Identify which processes executing on each CPU at
fine-grained intervals (1000s of instructions)
• Technique
– Periodically suspend simulator then access kernel
data structures (in known physical memory locations)
– Extract process information from kernel
33
Tools for
Visualization/Monitoring
• How to build tools that can make sense out of
the behavior of 1000 concurrent threads?
• Dataflow visualization
– E.g., Data flow tomography [Sherwood08]
• Performance monitoring
– E.g., Estimate multi-core cache miss rates
• Black-box program profiling
– E.g., Invisible kernel introspection
34
How to instrument in a practical
way?
• E.g., adding new counter to CPU
• Can we have our cake and eat it too?
– SW-like programming abstraction
– Without resynthesizing and keeping FPGA speeds
35
Example 1: Real-time Cache
Models
• Generate cache model performance in real time
• Applications:
– Generating cache state checkpoints
36
Example 2: Black-Box Profiling
• Used in ProtoFlex to profile black-box
commercial workloads (e.g., IBM DB2, Oracle)
37
Life-cycle of simulation
(approximately)
Great
Idea
Design
Implement+
Instrument
Publish
Measure
Simulate
38
Life-cycle of simulation
(approximately)
Great
Idea
Design
Implement+
Instrument
Publish
Measure
Simulate
39
Back-of-the-envelope
calculation
• Let’s calculate opportunity cost of HW-simulation
• Assumptions
– Only goal is to measure given metric (e.g., IPC)
– Don’t care about prototyping
• 12 hours to design, simulate, P&R
– 12 hours = 12 x 3600s x 1KIPS/s = 43M instructions
– On a cluster of 100, can simulate ~4B instructions
using detailed timing models in 24 hours
40
The FPGA Usability Challenge
• Despite impressive proof-of-concepts, FPGAs
still not widely adopted in arch community
• FPGAs are not user-friendly
– Simulators easier to modify/use
• Lack of instant gratification
slows productivity
– How long to build and run
‘Hello World’ on FPGA?
41
Usability of FPGAs
User Class
Usage Description
Required FPGA
expertise
• Parallel programming on new architectures
Low/None?
Serious
User
• Use predefined target machine
• Want to tweak HW parameters
• Requires inspection/changes to system state
Low/Med?
Casual
Developer
• Large changes to architecture or components
• Monitoring tools to inspect low-level info
Med/High?
Serious
Developer
• Build new components or special-purpose
processing elements from scratch
Casual User
Required expertise should be
minimized when possible
High?
42
The FPGA Usability Challenge
• Challenges for Users
– Typical simulation features missing or hard-to-build
– Low runtime visibility into FPGA HW
• Challenges for Developers
– Even mundane tasks require RTL design/debugging
– Long synthesis turnaround times (up to hours/days)
– Must learn new languages, (buggy) tools
How to improve usability with RAMP2?
43
Closing the Usability Gap
• Ideally want to provide:
– Fast ‘SW’ simulation interface
for casual/serious users
– Fast programming
abstractions for casual
developers without
re-synthesizing designs
• Without sacrificing
(too much) FPGA
performance
44
The Ultimate Productivity Killer
RTL
Design
capture
(code for
describing
HW)
Synthesis
Map/translate
+ Place &
Route
Bitstream generation
+ Download
RF
IF1
IF2
D
I-cache
D
SET
CLR
Q
Q
E1
E2
M1
M2
W
D-cache
5-45 min
15-45 min
5 min
Cost of a mistake or forgetting to add
something? Priceless.
45
Download