PPT

advertisement
Diamonds are a Memory Controller’s
Best Friend*
Dennis Abts
Google
Natalie Enright Jerger
University of Toronto
John Kim
KAIST
Dan Gibson
Univ of Wisconsin
Mikko Lipasti
Univ of Wisconsin
*Also known as: Achieving Predictable Performance through Better Memory Controller
Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked.
Executive Summary ®
• On what tiles should memory controllers reside?
– Three-tiered simulation approach
• Heuristic-guided search
• Detailed network simulation
• Full-system simulation
• Diamond MC placement works well for on-chip
meshes and tori
– Diamonds minimize maximum channel load
– Diamonds deliver lower and more predictable
runtimes
Background
• Diverse on-chip communication
– Cache-to-cache
– LD/ST to Memory
– Off-chip traffic (e.g., I/O)
• Processors/chip on the rise
– Pins available for memory not rising as fast: Memory
bandwidth becomes more precious
– Reality: Many Cores, Few Memory Controllers
• Tiled architectures gaining popularity
– Commonly employ on-chip meshes or tori
The Problem
• What Memory Controller placement is best
overall?
– Flip-chip packaging allows flexible escape routes
– n tiles and m ports:
• Don’t worry, there are only
n 
 configurations!
m  Slight Simplification: Assume n =
k2 and m = 2k
– What are the characteristics of the best
configuration?
• Performance:
Low runtime for a set of objective workloads
• Throughput: Low latency as a function of offered load
• Fairness: Similar (low) average memory latency across all
nodes.
• Predictability: Low latency and runtime variance
Baseline Placement: row0_7
• Ports to MCs located at
top and bottom of chip
• Conceptually similar to
X-Dimension Traffic
real
parts:
Encounters Congestion on
– Tilera’s
Tile64
Rows with
Memory
• 64Controllers
cores, 4 MCs (4 ports
each, top/bottom of chip)
– Intel TeraFLOPs
• 80 cores, 2 MCs (8 ports
each, top/bottom of chip)
Three-Tiered Approach
Link Contention
Simulation
More Detail
Shorter Runtimes
Full System
More Runs
Detailed Network
Simulation
Tier 0.5: Exhaustive Search
k2 
• It turns out   is tractable for k<7
 2k 
– (At least on the link contention simulator – only
3,268,760 possibilities for k=5)
Patterns Emerge!
Another Contender
Tier 1: Heuristic-Guided Search
• k>6: Intractable to search all configurations
– Use search heuristics and random search
• Genetic Algorithm:
– Represent designs as a population of strings (Bit
Vectors)
– Generate new designs by combining members of the
population via genetic crossover (Bit Selection)
– Occasionally, mutate new population members (Swap
adjacent bits)
– Reduce population size by removing least-fit
members – Survival of the Fittest
Genetic MC Placement
0x00AA550000AA5500
0x0000FF0000FF0000
0x00AAF00000F25100
Mutate
0x00AAF00000F25080
Link Contention Results k=8
Config.
Max Channel Load
Mesh
Torus
row0_7
13.5
9.25
X
8.93
7.72
Diamond
8.90
7.72
• GA Selected Diamond as
most fit solution for 8x8
– Minimizes MCs in a single
row/column
– Spreads DOR load
Sanity Check: GA also prefers
Diamond for 4x4, 5x5, and 6x6
Network Simulation: Open-Loop
Evaluation
• Detailed simulation of all network events
(buffers, links, etc.)
• Cores are Bernoulli injection processes, uniform
random traffic
• Measure latency vs. offered load
Parameters
Values
Router latency
1 cycle (aggressive)
Inter-router Delay
1 cycle
Buffers
32-flit sized per port
Packet size
Request: 1 flit
Reply: 4 flit
Virtual Channels
4 (XY-YX routing)
Open-Loop Results
25
Latency (cycles)
20
row0_7
15
row2_5
Diamond
10
X
5
0
0
0.2
0.4
0.6
Offered load (flits/cycle)
0.8
1
Closed-Loop Evaluation
• Each processor executes N memory operations
• Up to r operations outstanding at a time
– Models MSHRs
• Uniform Random requests, and real request
streams with ‘hot spot’ behavior
Closed-Loop Results
Number of Processors
20
16
12
8
4
0
3500
4000
4500
Diamond
5000
5500
6000
6500 8000 8500 9000 9500 10000 10500 11000
Completion Time
row0_7
Average Network Latency (cycles)
for Request to Memory Controller
Full System Results
17.5
JBB
WEB
17
TPC-W+H
TPC-W
16.5
TPC-H
R ow0_7
16
JBB
15.5
Diamond
WEB
TPC-H
Diamond placement
yields lower latency and
lower latency variance.
TPC-W
15
TPC-W+H
14.5
0
0.2
0.4
0.6
0.8
Standard Deviation
1
1.2
Conclusion
• MC Placement Matters!
– Diamond reduces contention, improves latency, and
reduces latency/runtime variance
– X does fairly well
Download