Multicore Architectures Michael Gerndt Development of Microprocessors © Intel Transistor capacity doubles every 18 months Development of Microprocessors • Moore’s Law • Estimated to stay at least in next 10 years • But: Transistor count ≠ Power • How to use transistor resources? • Better execution core – Enhance pipelining, superscalarity, … – Better vector processing (SIMD, like MMX/SSE) – Problem: Gap to memory speed • Larger Caches – Improves memory access speed • More execution cores – Problem: Gap to memory speed •… Development of Microprocessors • Objective for manufactures • As much profit as possible: Sell processors … • Customers only buy when applications run faster • Increase CPU power • How to increase CPU power • Higher clock rate • More parallelism – Instruction Level Parallelism (ILP) – Thread Level Parallelism (TLP) Development of Microprocessors • Higher clock rates • increase power consumption – proportional to f and U² – higher frequency needs higher voltage – Small structures: Energy loss by leakage • increase heat output and cooling requirements • limit chip size (speed of light) • at fixed technology (e.g. 60 nm) – Smaller number of transistor levels per pipeline stage possible – More, simplified pipeline stages (P4: >30 stages) – Higher penalty of pipeline stalls (on conflicts, e.g. branch misprediction) Development of Microprocessors • More parallelism • Increased bit width (now: 64 bit architectures) – SIMD • Instruction Level Parallelism (ILP) – exploits parallelism found in a instruction stream – limited by data/control dependencies – can be increased by speculation – average of ILP in typical programs: 6-7 – modern superscalar processors can not get better… Development of Microprocessors • More parallelism • Thread Level Parallelism (TLP) – Hardware multithreaded (e.g. SMT: Hyperthreading) – better exploitation of superscalar execution units – Multiple cores – Legacy software must be parallelized – Challenge for whole software industry – Intel moved into the tools business Multicore Architectures • SMPs on a single chip • Chip Multi-Processors (CMP) • Advantage • Efficient exploitation of available transistor budget • Improves throughput and speed of parallelized applications • Allows tight coupling of cores – better communication between cores than in SMP – shared caches • Low power consumption – low clock rates – idle cores can be suspended • Disadvantage • Only improves speed of parallelized applications • Increased gap to memory speed Multicore Architectures • Design decisions • homogeneous vs. heterogeneous – specialized accelerator cores – SIMD – GPU operations – cryptography – DSP functions (e.g. FFT) – FPGA (programmable circuits) – access to memory – own memory area (distributed memory) – via cache hierarchy (shared memory) • Connection of cores – internal bus / cross bar connection – Cache architecture Multicore Architectures: Examples Core Core Core Core L1 L1 L1 L1 L2 L2 Core (2x SMT) Local Store Local Store Core Core Core Core Local Store Local Store L1 L2 I/O L3 L3 Memory Module 1 Memory Module 2 Homogeneous with shared caches and cross bar Memory Module I/O Heterogeneous with caches, local store and ring bus Shared Cache Design Core Core Core Core Core Core L1 L1 L1 L1 L1 L1 Switch Switch Switch L2 L2 L2 Memory Memory Traditional design Multicore Architecture Multiple single-cores with shared cache off-chip Shared Caches on-chip Shared Cache Design Core Core Core Core L1 L1 L1 L1 Switch L2 Memory Multicore Architecture Shared Caches on-chip Shared Caches: Advantages • No coherence protocol at shared cache level • Less latency of communication • Processors with overlapping working set • One processor may prefetch data for the other • Smaller cache size needed • Better usage of loaded cache lines before eviction (spatial locality) • Less congestion on limited memory connection • Dynamic sharing • if one processor needs less space, the other can use more • Avoidance of false sharing Shared Caches: Disadvantages • Multiple CPUs higher requirements • higher bandwidth • Cache should be larger (larger higher latency) • Hit latency higher due to switch logic above cache • Design more complex • One CPU can evict data of other CPU Multicore Processors • SUN • UltraSparc IV / IV+ – dual core – 2x multithreaded per core • UltraSparc T1 (Niagara): – 8 cores – 4x multithreaded per core – one FPU for all cores – low power • UltraSparc T2 (Niagara 2) Intel Itanium 2 Dual Core - Montecito • Two Itanium 2 cores • Multi-threading (2 Threads) – Simultaneous multi-threading for memory hierarchy resources – Temporal multi-threading for core resources – Besides end of time slice, an event, typically an L3 cache miss, might lead to a thread switch. • Caches – L1D 16 KB, L1I 16 KB – L2D 256 KB, L2I 1 MB – L3 9 MB • Caches private to cores • 1,7 Billion transistors Itanium 2 Dual Core Intel Core Duo • 2 mobile-optimized execution cores • No multi-threading • Cache hierarchy • Private 32-KB L1I and L1D • Shared 2 MB L2 cache • Provides efficient data sharing between both cores • Power reduction • Some states individually by each processor • Deeper and enhanced deeper sleep states only for die • Dynamic Cache Sizing feature – Flushes entire cache – This enables Enhanced Deeper Sleep with lower voltage which does not guarantee cache integrity • 151 Million transistors IBM Cell • IBM, Sony, Toshiba • Playstation 3 (Q1 2006) • 256 GFlops • Bei 3 GHz nur ~30W • ganze PS3 nur 300-400$ • http://www-128.ibm.com/developerworks/power/library/pa-cellperf Cell: Architecture • 9 parallele processors • Specialized for different tasks • 1 large PPE - 8 SPEs Synergistic Processing Element Cell: SPE Synergistic Processing Element • • • • • • • • 128 registers 128-Bit SIMD Single Thread 256KByte local memory not cache DMA execute memory transfers Simple ISA Less functionality to save space Limitations can become a problem if memory access is too slow. • 25,6 GFlops single precision für multiply-add operations Intel Westmere EX • Processor of the fat node of SuperMUC @ LRZ • 2,4 GHz • 9.6 Gflop/s per core • 96 Gflop/s per socket • 10 hyperthreaded cores, i.e. two logical cores each • Caches • 32 KB L1 private • 256 KB L2 private • 30 MB L3 shared • 2,9 billion transistors • Xeon E7-4870 (2,4 GHz, 10 Kerne, 30 MByte L3) NUMA • On-chip NUMA • L3 Cache organized in 10 slices • Interconnection via a bidirectional ring bus • 10-way physical address hashing to avoid hot spots, and can handle five parallel cache requests per clock cycle • Mapping algorithm is not known, no migration support • Off-chip NUMA • Glueless combination of up to 8 sockets into SMP • 4 Quick Path Interconnect (QPI) interfaces • 2 on-chip memory controllers Cache Coherency • Cbox • Connects core to ringbus and one memory bank • Responsible for processor read/write/writeback and external snoops, and returning cached data to core and QuickPath agents. • Distribution of physical addresses is determined by hash function • Sbox • Caching Agent • Each associated with 5 Cboxes Cache Coherency • Bbox • Home agent • Responsible for cache coherency of the cache line in this memory. Keeps track of the Cbox replies due to coherence messages. • Directory Assited Snoopy (DAS) • Keeps states per cache line (I – Idle or no remote sharers, R – may be present on remote socket, E/D owned by IO Hub) • If line is in I state it can be forwarded without waiting for snoop replies. Summary • High frequency -> high power consumption • Trend towards multiple cores on chip • Broad spectrum of designs: homogeneous, heterogeneous, specialized, general purpose, number of cores, cache architectures, local memories, simultaneous multithreading, … • Problem: memory latency and bandwidth