18-742 Spring 2011 Parallel Computer Architecture Lecture 7: Symmetric Multi-Core Prof. Onur Mutlu Carnegie Mellon University Research Project Submit your proposal via Blackboard by midnight today. 2 Reviews Due Tuesday (Jan 25) Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984. Kelm et al., “Cohesion: a hybrid memory model for accelerators,” ISCA 2010. Due Friday (Jan 28) Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA 2010. 3 Review: Multi-Core Idea: Put multiple processors on the same die. Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Simultaneous multithreading Integrate platform components on chip (e.g., network interface, memory controllers) 4 Review: Large Superscalar vs. Multi-Core Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. 5 Review: Comparison Points 6 Review: Multi-Core vs. Large Superscalar Multi-core advantages + Simpler cores more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads reduced context switches + Higher system throughput in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand 7 Review: Large Superscalar vs. Multi-Core Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. Technology push Instruction issue queue size limits the cycle time of the superscalar, OoO processor diminishing performance Quadratic increase in complexity with issue width Large, multi-ported register files to support large instruction windows and issue widths reduced frequency or longer RF access, diminishing performance Application pull Integer applications: little parallelism? FP applications: abundant loop-level parallelism Others (transaction proc., multiprogramming): CMP better fit 8 Review: Why Multi-Core? Alternative: (Simultaneous) Multithreading + Exploits thread-level parallelism (just like multi-core) + Good single-thread performance when there is a single thread + No need to have an entire core for another thread + Parallel performance aided by tight sharing of caches - Scalability is limited: need bigger register files, larger issue width (and associated costs) to have many threads complex with many threads - Parallel performance limited by shared fetch bandwidth - Extensive resource sharing at the pipeline and memory system reduces both single-thread and parallel application performance 9 Review: Why Multi-Core? Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections) 10 Why Multi-Core? Alternative: More scalable superscalar, out-of-order engines Clustered superscalar processors (with multithreading) + Simpler to design than superscalar, more scalable than simultaneous multithreading (less resource sharing) + Can improve both single-thread and parallel application performance - Diminishing performance returns on single thread: Clustering reduces IPC performance compared to monolithic superscalar. Why? - Parallel performance limited by shared fetch bandwidth - Difficult to design 11 Why Multi-Core? Alternative: Traditional symmetric multiprocessors + Smaller die size (for the same processing core) + More memory bandwidth (no pin bottleneck) + Fewer shared resources less contention between threads - Long latencies between cores (need to go off chip) shared data accesses limit performance parallel application scalability is limited - Worse resource efficiency due to less sharing worse power/energy efficiency 12 Why Multi-Core? Other alternatives? Dataflow? Vector processors (SIMD)? Integrating DRAM on chip? Reconfigurable logic? (general purpose?) 13 Review: Multi-Core Alternatives Bigger, more powerful single core Bigger caches (Simultaneous) multithreading Integrate platform components on chip instead More scalable superscalar, out-of-order engines Traditional symmetric multiprocessors Dataflow? Vector processors (SIMD)? Integrating DRAM on chip? Reconfigurable logic? (general purpose?) Other alternatives? 14 Piranha Chip Multiprocessor Barroso et al., “Piranha: A Scalable Architecture Based on SingleChip Multiprocessing,” ISCA 2000. An early example of a symmetric multi-core processor Large-scale server based on CMP nodes Designed for commercial workloads Commercial Workload Characteristics Memory system is the main bottleneck Very poor Instruction Level Parallelism (ILP) with existing techniques Very high CPI Execution time dominated by memory stall times Instruction stalls as important as data stalls Fast/large L2 caches are critical Frequent hard-to-predict branches Large L1 miss ratios Small gains from wide-issue out-of-order techniques No need for floating point and multimedia units 16 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz Next few slides from Luiz Barroso’s ISCA 2000 presentation of Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Piranha Processing Node CPU I$ D$ Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Piranha Processing Node CPU CPU CPU CPU I$ D$ I$ D$ I$ D$ I$ D$ Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay ICS I$ D$ I$ D$ I$ D$ I$ D$ CPU CPU CPU CPU Piranha Processing Node CPU CPU L2$ I$ D$ CPU L2$ I$ D$ CPU L2$ I$ D$ L2$ I$ D$ ICS I$ D$ L2$ I$ D$ L2$ CPU I$ D$ L2$ CPU I$ D$ L2$ CPU CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Piranha Processing Node MEM-CTL MEM-CTL MEM-CTL MEM-CTL CPU CPU CPU CPU L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ ICS I$ D$ L2$ I$ D$ L2$ CPU I$ D$ L2$ CPU I$ D$ L2$ CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL 8 banks @1.6GB/sec Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Piranha Processing Node MEM-CTL MEM-CTL MEM-CTL MEM-CTL CPU HE CPU L2$ I$ D$ CPU L2$ I$ D$ CPU L2$ I$ D$ L2$ I$ D$ ICS I$ D$ RE L2$ I$ D$ L2$ CPU I$ D$ L2$ CPU I$ D$ L2$ CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE) prog., 1K instr., even/odd interleaving Piranha Processing Node MEM-CTL MEM-CTL MEM-CTL MEM-CTL 4 Links @ 8GB/s CPU HE CPU L2$ L2$ I$ D$ Router I$ D$ CPU CPU L2$ I$ D$ L2$ I$ D$ ICS I$ D$ RE L2$ I$ D$ L2$ CPU I$ D$ L2$ CPU I$ D$ L2$ CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth Piranha Processing Node MEM-CTL MEM-CTL MEM-CTL MEM-CTL CPU HE CPU L2$ L2$ I$ D$ Router I$ D$ CPU CPU L2$ I$ D$ L2$ I$ D$ ICS I$ D$ RE L2$ I$ D$ L2$ CPU I$ D$ L2$ CPU I$ D$ L2$ CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth Piranha Processing Node 25 Inter-Node Coherence Protocol Engine 26 Piranha System 27 Piranha I/O Node 28 Sun Niagara (UltraSPARC T1) Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE Micro 2005. 29 Niagara Core 4-way fine-grain multithreaded, 6-stage, dual-issue in-order Round robin thread selection (unless cache miss) Shared FP unit among cores 30 Niagara Design Point Also designed for commercial applications 31 Sun Niagara II (UltraSPARC T2) 8 SPARC cores, 8 threads/core. 8 stages. 16 KB I$ per Core. 8 KB D$ per Core. FP, Graphics, Crypto, units per Core. 4 MB Shared L2, 8 banks, 16way set associative. 4 dual-channel FBDIMM memory controllers. X8 PCI-Express @ 2.5 Gb/s. Two 10G Ethernet ports @ 3.125 Gb/s. 32 Chip Multithreading (CMT) Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session, 2005. Idea: Chip multiprocessor where each core is multithreaded Niagara 1/2: fine grain multithreading IBM POWER5: simultaneous multithreading Motivation: Tolerate memory latency better A simple core stays idle on a cache miss Multithreading enables tolerating cache miss latency when there is TLP 33 CMT (CMP + MT) vs. CMP Advantages of adding multithreading to each core + Better memory latency tolerance when there are enough threads + Fine grained multithreading can simplify core design (no need for branch prediction, dependency checking) + Potentially better utilization of core, cache, memory resources + Shared instructions and data among threads not replicated + When one thread is not using a resource, another can Disadvantages - Reduced single-thread performance (a thread does not have the core and L1 caches to itself) - More pressure on the shared resources (cache, off-chip bandwidth) more resource contention - Applications with limited TLP do not benefit 34 Sun ROCK Chaudhry et al., “Rock: A High-Performance Sparc CMT Processor,” IEEE Micro, 2009. Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor,” ISCA 2009 Goals: Maximize throughput when threads are available Boost single-thread performance when threads are not available and on cache misses Ideas: Runahead on a cache miss ahead thread executes missindependent instructions, behind thread executes dependent instructions Branch prediction (gshare) 35 Sun ROCK 16 cores, 2 threads per core (fewer threads than Niagara 2) 4 cores share a 32KB instruction cache 2 cores share a 32KB data cache 2MB L2 cache (smaller than Niagara 2) 36 Runahead Execution (I) A simple pre-execution method for prefetching purposes Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003. When the oldest instruction is a long-latency cache miss: In runahead mode: Checkpoint architectural state and enter runahead mode Speculatively pre-execute instructions The purpose of pre-execution is to generate prefetches L2-miss dependent instructions are marked INV and dropped Runahead mode ends when the original miss returns Checkpoint is restored and normal execution resumes 37 Runahead Execution (II) Small Window: Load 2 Miss Load 1 Miss Compute Stall Compute Miss 1 Stall Miss 2 Runahead: Load 1 Miss Compute Load 2 Miss Runahead Miss 1 Load 1 Hit Load 2 Hit Compute Saved Cycles Miss 2 38 Runahead Execution (III) Advantages + Very accurate prefetches for data/instructions (all cache levels) + Follows the program path + Simple to implement, most of the hardware is already built in Disadvantages -- Extra executed instructions Limitations -- Limited by branch prediction accuracy -- Cannot prefetch dependent cache misses. Solution? -- Effectiveness limited by available Memory Level Parallelism Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” IEEE Micro Jan/Feb 2006. 39 Performance of Runahead Execution 1.3 1.2 No prefetcher, no runahead Only prefetcher (baseline) Only runahead Prefetcher + runahead 12% Micro-operations Per Cycle 1.1 1.0 22% 0.9 12% 15% 0.8 35% 0.7 22% 0.6 16% 0.5 52% 0.4 13% 0.3 0.2 0.1 0.0 S95 FP00 INT00 WEB MM PROD SERV WS AVG 40 Sun ROCK Cores Load miss in L1 cache starts parallelization using 2 HW threads Ahead thread Behind thread Executes deferred instructions and re-defers them if necessary Memory-Level Parallelism (MLP) Checkpoints state and executes speculatively Instructions independent of load miss are speculatively executed Load miss(es) and dependent instructions are deferred to behind thread Run ahead on load miss and generate additional load misses Instruction-Level Parallelism (ILP) Ahead and behind threads execute independent instructions from different points in program in parallel 41 ROCK Pipeline 42 More Powerful Cores in Sun ROCK Advantages + Higher single-thread performance (MLP + ILP) + Better cache miss tolerance Can reduce on-chip cache sizes Disadvantages - Bigger cores Fewer cores Lower parallel throughput (in terms of threads). How about each thread’s response time? - More complex than Niagara cores (but simpler than conventional out-of-order execution) Longer design time? 43 More Powerful Cores in Sun ROCK Chaudhry talk, Aug 2008. 44 More Powerful Cores in Sun ROCK Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor,” ISCA 2009 45