Diamonds are a Memory Controller’s Best Friend* Dennis Abts Google Natalie Enright Jerger University of Toronto John Kim KAIST Dan Gibson Univ of Wisconsin Mikko Lipasti Univ of Wisconsin *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked. Executive Summary ® • On what tiles should memory controllers reside? – Three-tiered simulation approach • Heuristic-guided search • Detailed network simulation • Full-system simulation • Diamond MC placement works well for on-chip meshes and tori – Diamonds minimize maximum channel load – Diamonds deliver lower and more predictable runtimes Background • Diverse on-chip communication – Cache-to-cache – LD/ST to Memory – Off-chip traffic (e.g., I/O) • Processors/chip on the rise – Pins available for memory not rising as fast: Memory bandwidth becomes more precious – Reality: Many Cores, Few Memory Controllers • Tiled architectures gaining popularity – Commonly employ on-chip meshes or tori The Problem • What Memory Controller placement is best overall? – Flip-chip packaging allows flexible escape routes – n tiles and m ports: • Don’t worry, there are only n configurations! m Slight Simplification: Assume n = k2 and m = 2k – What are the characteristics of the best configuration? • Performance: Low runtime for a set of objective workloads • Throughput: Low latency as a function of offered load • Fairness: Similar (low) average memory latency across all nodes. • Predictability: Low latency and runtime variance Baseline Placement: row0_7 • Ports to MCs located at top and bottom of chip • Conceptually similar to X-Dimension Traffic real parts: Encounters Congestion on – Tilera’s Tile64 Rows with Memory • 64Controllers cores, 4 MCs (4 ports each, top/bottom of chip) – Intel TeraFLOPs • 80 cores, 2 MCs (8 ports each, top/bottom of chip) Three-Tiered Approach Link Contention Simulation More Detail Shorter Runtimes Full System More Runs Detailed Network Simulation Tier 0.5: Exhaustive Search k2 • It turns out is tractable for k<7 2k – (At least on the link contention simulator – only 3,268,760 possibilities for k=5) Patterns Emerge! Another Contender Tier 1: Heuristic-Guided Search • k>6: Intractable to search all configurations – Use search heuristics and random search • Genetic Algorithm: – Represent designs as a population of strings (Bit Vectors) – Generate new designs by combining members of the population via genetic crossover (Bit Selection) – Occasionally, mutate new population members (Swap adjacent bits) – Reduce population size by removing least-fit members – Survival of the Fittest Genetic MC Placement 0x00AA550000AA5500 0x0000FF0000FF0000 0x00AAF00000F25100 Mutate 0x00AAF00000F25080 Link Contention Results k=8 Config. Max Channel Load Mesh Torus row0_7 13.5 9.25 X 8.93 7.72 Diamond 8.90 7.72 • GA Selected Diamond as most fit solution for 8x8 – Minimizes MCs in a single row/column – Spreads DOR load Sanity Check: GA also prefers Diamond for 4x4, 5x5, and 6x6 Network Simulation: Open-Loop Evaluation • Detailed simulation of all network events (buffers, links, etc.) • Cores are Bernoulli injection processes, uniform random traffic • Measure latency vs. offered load Parameters Values Router latency 1 cycle (aggressive) Inter-router Delay 1 cycle Buffers 32-flit sized per port Packet size Request: 1 flit Reply: 4 flit Virtual Channels 4 (XY-YX routing) Open-Loop Results 25 Latency (cycles) 20 row0_7 15 row2_5 Diamond 10 X 5 0 0 0.2 0.4 0.6 Offered load (flits/cycle) 0.8 1 Closed-Loop Evaluation • Each processor executes N memory operations • Up to r operations outstanding at a time – Models MSHRs • Uniform Random requests, and real request streams with ‘hot spot’ behavior Closed-Loop Results Number of Processors 20 16 12 8 4 0 3500 4000 4500 Diamond 5000 5500 6000 6500 8000 8500 9000 9500 10000 10500 11000 Completion Time row0_7 Average Network Latency (cycles) for Request to Memory Controller Full System Results 17.5 JBB WEB 17 TPC-W+H TPC-W 16.5 TPC-H R ow0_7 16 JBB 15.5 Diamond WEB TPC-H Diamond placement yields lower latency and lower latency variance. TPC-W 15 TPC-W+H 14.5 0 0.2 0.4 0.6 0.8 Standard Deviation 1 1.2 Conclusion • MC Placement Matters! – Diamond reduces contention, improves latency, and reduces latency/runtime variance – X does fairly well