FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency

advertisement
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency
Estimator for NoC Modeling in Full-System Simulations
Michael K. Papamichael, James C. Hoe, Onur Mutlu
papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu
Computer Architecture Lab at
Our work was supported by NSF. We thank Xilinx
and Bluespec for their FPGA and tool donations.
5/3/2011
Simulation in Computer Architecture

Slow for large-scale multiprocessor studies
 Full-system fidelity + long benchmarks
How can we make it faster?

Speed, accuracy, flexibility trade-off
Full-system simulators sacrifice
accuracy for speed and flexibility

Accuracy
Speed
Flexibility
Accelerate simulation with FPGAs
Can simulate up to millions of gates
Orders of magnitude simulation speedup

CALCM Computer Architecture Lab at
Carnegie Mellon
2
The FIST Project

Explores fast NoC models for full-system simulations
FPGA-friendly, but avoid direct implementation
 Low error, many topologies, >10M packets/sec


Simpler requirements of full-system simulation

Estimate packet latencies, capture high-order effects
R
P
L2
L1
R
P
L2
L1
4x4
5x5
P
L2
L1
R
FPGA
(Xilinx LX110T)
R
P
L2
L1
FPGA area requirements
for state-of-the-art mesh NoC*
CALCM
*NoC RTL from http://nocs.stanford.edu/router.html
3
Computer Architecture Lab at Carnegie Mellon
FIST Approach
View NoC as set of routers/links
 Abstract router into black-box
 Represent by load-delay curves

R
R
R
R
R
Specific to each router configuration and traffic pattern
R
N
R
N
R
N
R
N
R
N
Latency

R
R
N
State
Logic
Buffers
R
Load
R
N
R
N
R
N
N
CALCM Computer Architecture Lab at
Carnegie Mellon
4
FIST Approach
Treat each hop as a set of load-delay curves


Trade-off between model complexity and fidelity
Keep track of load at each node

To track router load monitor traffic over window of time
R
N
R
N
R
N
R
N
R
N
R
N
Latency

R
N
R
N
Load
R
N
CALCM Computer Architecture Lab at
Carnegie Mellon
5
FIST in Action

Route packet from source to destination


Determine routers that will be traversed
Sum up the delays for each traversed router

Index load-delay curves using current load at each router
R
N
S
R
N
R
N
N
R
N
R
N
R
R
N
R
N
D
packet
delay
R
N
CALCM Computer Architecture Lab at
Carnegie Mellon
6
Outline
 Introduction to FIST
 FIST-based Network Models
 Evaluation
 Related Work & Conclusions
CALCM Computer Architecture Lab at
Carnegie Mellon
Outline
 Introduction to FIST
 FIST-based Network Models
 Evaluation
 Related Work & Conclusions
CALCM Computer Architecture Lab at
Carnegie Mellon
Putting FIST Into Context
Detailed network models



Cycle-accurate network simulators (e.g. BookSim)
Analytical network models
Typically study networks under synthetic traffic patterns
Updated Curves
Use Curves
Train Curves

Feedback

Network models within full-system simulators



Model network within a broader simulated system
Assign delay to each packet traversing the network
Traffic generated by real workloads
CALCM Computer Architecture Lab at
Carnegie Mellon
9
Offline and Online FIST

Offline FIST



Detailed network simulator generates curves offline
Can use synthetic or actual workload traffic
Load curves into FIST and run experiment
Detailed
Network Model

Online FIST (tolerates dynamic changes in network behavior)



Initialization of curves same as offline
Periodically run detailed network simulator on the side
Compare accuracy and, if necessary, update curves
Provide feedback and receive updated curves
Detailed
Network Model
CALCM Computer Architecture Lab at
Carnegie Mellon
10
Online Training in Action
Example with no initial training
50
45
40
35
30
25
20
15
10
5
0
…
After Training
…
Actual Latency
Estimated Latency
Before Training
…
0
66
132
198
264
330
396
462
1884
1950
2016
2082
2148
2214
2280
2346
2412
2478
Latency

El
Ellapsed
Cycles
1000s)
El
Elapsed
cycles
(in(in
1000s)
CALCM Computer Architecture Lab at
Carnegie Mellon
FIST Applicability

“FIST-Friendly” Networks



Exhibit stable, predictable behavior as load fluctuates
Actual traffic similar to training traffic
FIST Limitations


Depends on fidelity, representativeness of training models
Higher loads and large buffers can limit FIST’s accuracy




High network load  increased packet latency variance
Large buffers  increased range of observed packet latencies
Cannot capture fine-grain packet interactions
Cannot replace cycle-accurate detailed network models
FIST only as good as its training data
CALCM Computer Architecture Lab at
Carnegie Mellon
12
Applying FIST to NoCs
NoCs affected by on-chip limitations and scarce resources

Employ simple routing algorithms


Operate at low loads



Usually simple deterministic routing
NoCs usually over-provisioned to handle worst-case
Have been observed to operate at low injection rates
Small buffers


On-chip abundance of wires reduces buffering requirements
Amount of buffering in NoCs is limited or even eliminated
NoCs are “FIST-Friendly”
CALCM Computer Architecture Lab at
Carnegie Mellon
13
Outline
 Introduction to FIST
 FIST-based Network Models
 Evaluation
 Related Work & Conclusions
CALCM Computer Architecture Lab at
Carnegie Mellon
FIST Implementations

Software Implementation of FIST (written in C++)


Implements online and offline FIST models
Hardware Implementation (written in Bluespec)


Precisely replicates software-based FIST
Block diagram of architecture
Packet
Descriptors
Src
Dest
Size
Router Elements
Routing Logic
Pick
routers
Calculate
Delay
Load Tracker
Curve
BRAM
Handle
load tracking
& delay queries
CALCM Computer Architecture Lab at
Packet
Delays
Partial Delays
Carnegie Mellon
15
Peeking Under The Hood

Tracking Latency
T
7
6
5
4
3
2
S
H
D
R0

R2
Store-and-forward
0
R0
R1
R2

R1
H
5
2
3
4
5
6
10
7
15
20
25
30
T
H
2
3
4
5
6
7
T
H
2
3
4
5
6
7
T
Packet Latency = 30
R0 Latency = 9
R1 Latency = 14
R2 Latency = 7
Wormhole-routed
0
R0
R1
R2
H
5
2
3
4
10
15
5
6
H
2
3
7
20
30
T
4
H
Injection latency
25
5
6
2
3
7
T
4
5
T
Packet Latency = 30
R0 Latency =
R1 Latency =
R2 Latency =
?
Traversal latencies
Use separate injection and traversal latency curves per router

Similar issues arise for load tracking & dynamic training
CALCM Computer Architecture Lab at
Carnegie Mellon
Methodology

Examined online and offline FIST models


Network and system configuration




Consists of control, data and coherence packets
Offline and Online FIST models with two curves per router



26 SPEC CPU2006 benchmarks of varying network intensity
8 SPLASH-2 and 2 PARSEC workloads
Traffic generated by cache misses


4x4, 8x8, 16x16 wormhole-routed mesh
Each network node hosts core+coherent L1 and a slice of L2
Multiprogrammed and multithreaded workloads


Replaced cycle-accurate NoC model in tiled CMP simulator
Curves represent injection and traversal latency at each router
Initial training using uniform random synthetic traffic
Please see paper for more details!
CALCM Computer Architecture Lab at
Carnegie Mellon
17
Accuracy Results (offline)

8x8 mesh using FIST offline model

Average Latency and Aggregate IPC Error
Latency/IPC Error (in %)
15%
10%
Latency Error
IPC Error
IPC Error < 4%
5%
0%
-5%
-10%
Latency Error < 8%
-15%
MP (Low)
MP (Med)
CALCM Computer Architecture Lab at
MP (High)
Carnegie Mellon
MT (SPL/PAR)
18
Accuracy Results (online)

8x8 mesh using FIST online model
Error(in %)
LatencyError
Latency/IPC

Average Latency and Aggregate IPC Error
0.1
10%
0.05
5%
Latency Error
IPC Error
0%0
-5%
-0.05
-10%
-0.1
MP (Low)
MP (Med)
MP (High)
MT (SPL/PAR)
Both Latency and IPC Error below 3%
CALCM Computer Architecture Lab at
Carnegie Mellon
19
What about a very simple model?

8x8 mesh using hop-based model
1 How
0.8
80%
Error(in %)
Latency
Latency/IPC
Error
atency Error
0.6
C Error
60%
does simple network model affect high-order results?
Latency Error
Latency Error
IPC Error
IPC Error
0.4
40%
0.2
20%
0
0%
FIST models always within this range
-0.2
-20%
-0.4
-40%
-0.6
-10%
-60%
-0.8
-80%
MP (Low)
MP (Med)
MP (High)
MT (SPL/PAR)
Very high error for both latency and IPC!
CALCM Computer Architecture Lab at
Carnegie Mellon
20
Performance Results
Qualitative comparison
Speed (packets/sec)

100M
10M
1M
100K
10K
HW FIST
SW FIST
SW NoC Sims
(e.g. BookSim)
1K
Complexity / Level of Detail


SW-based speedup results for 16x16 mesh
 Offline FIST: 43x
 Online FIST: 18x
HW-based speedup (offline): ~3-4 orders of magnitude
CALCM Computer Architecture Lab at
Carnegie Mellon
21
Hardware Implementation Results

FPGA resource usage & clock frequency


Different mesh configurations
Xilinx Virtex-5 LX155T FPGA
FIST Model
Size
4x4
8x8
12x12
16x16
20x20
FPGA Area
4%
15%
34%
60%
94%
Freq.
380 MHz
263 MHz
250 MHz
214 MHz
200 MHz
Direct Implementation
FPGA Area
Freq.
61%
130 MHz
Will
not
fit
-
FIST can scale to large NoCs with many routers
CALCM Computer Architecture Lab at
Carnegie Mellon
Outline
 Introduction to FIST
 FIST-based Network Models
 Evaluation
 Related Work & Conclusions
CALCM Computer Architecture Lab at
Carnegie Mellon
Related Work

Vast body of work on network modeling


Abstract network modeling



Analytical models, hardware prototyping, etc.
Performance vs. accuracy trade-off studies [Burger 95]
Load-delay curve representation of network [Lugones 09]
FPGAs for network modeling



Cycle-accurate fidelity at the cost of limited scalability
Time-multiplexing can help with scalability [Wang 10]
But still suffer from high implementation complexity
CALCM Computer Architecture Lab at
Carnegie Mellon
Conclusions & Future Directions
Conclusions


Full-system simulators can tolerate small inaccuracies
FIST can provide fast SW- or HW-based NoC models



SW model provides 18x-43x average speedup w/ <2% error
HW model can scale to 100s routers with >1000x speedup
NoCs within a CMP are “FIST-friendly”

But not all networks good candidates for FIST modeling
Future Directions


FPGA-friendly NoC models at multiple levels of fidelity
Configurable generation of hardware NoC models
CALCM Computer Architecture Lab at
Carnegie Mellon
Thanks!
Questions?
CALCM Computer Architecture Lab at
Carnegie Mellon
Download