Computer Architecture Fall 2006 Lecture 30. CMPs & SMTs Adapted from Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] Lecture 30 Fall Review: Multiprocessor Basics Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors? # of Proc Communication Message passing 8 to 2048 model Shared NUMA 8 to 256 address UMA 2 to 64 Physical connection Lecture 30 Network 8 to 256 Bus 2 to 36 Fall CMP: Multiprocessors On One Chip By placing multiple processors, their memories and the IN all on one chip, the latencies of chip-to-chip communication are drastically reduced ARM multi-chip core Configurable # of hardware intr Private IRQ Interrupt Distributor Per-CPU aliased peripherals Configurable between 1 & 4 symmetric CPUs Private peripheral bus CPU CPU CPU CPU Interface Interface Interface Interface CPU L1$s CPU L1$s CPU L1$s CPU L1$s Snoop Control Unit Primary AXI R/W 64-b bus Lecture 30 I & D CCB 64-b bus Optional AXI R/W 64-b bus Fall Multithreading on A Chip Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor Lecture 30 Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly) The memory can be shared through virtual memory mechanisms Hardware must support efficient thread context switching Fall Types of Multithreading on a Chip Fine-grain – switch threads on every instruction issue Round-robin thread interleaving (skipping stalled threads) Processor must be able to switch threads on every clock cycle Advantage – can hide throughput losses that come from both short and long stalls Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses) Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss - Pipeline must be flushed and refilled on thread switches Lecture 30 Fall Multithreaded Example: Sun’s Niagara (UltraSparc T1) 1.2 GHz 1.0 GHz Cache (I/D/L2) 32K/64K/ (8M external) 16K/8K/3M Issue rate 4 issue 1 issue Pipe stages 14 stages 6 stages BHT entries 16K x 2-b None TLB entries 128I/512D 64I/64D Memory BW 2.4 GB/s ~20GB/s Transistors 29 million 200 million Power (max) 53 W Lecture 30 <60 W 4-way MT SPARC pipe Clock rate 4-way MT SPARC pipe 64-b 4-way MT SPARC pipe 64-b 4-way MT SPARC pipe Data width 4-way MT SPARC pipe Niagara 4-way MT SPARC pipe Ultra III 4-way MT SPARC pipe Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) 4-way MT SPARC pipe Crossbar I/O shared funct’s 4-way banked L2$ Memory controllers Fall Niagara Integer Pipeline Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient Fetch Thrd Sel Decode RegFile x4 I$ Inst bufx4 ITLB Thrd Sel Mux Thrd Sel Mux Decode Thread Select Logic Execute ALU Mul Shft Div Memory D$ DTLB Stbufx4 WB Crossbar Interface Instr type Cache misses Traps & interrupts Resource conflicts PC logicx4 From MPR, Vol. 18, #9, Sept. 2004 Lecture 30 Fall Simultaneous Multithreading (SMT) A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and threadlevel parallelism (TLP) Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP) With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them - Need separate rename tables (ROBs) for each thread - Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle Intel’s Pentium 4 SMT called hyperthreading Lecture 30 Supports just two threads (doubles the architecture state) Fall Threading on a 4-way SS Processor Example Coarse MT Fine MT SMT Issue slots → Thread A Thread B Time → Thread C Thread D Lecture 30 Fall Multicore Xbox360 – “Xenon” processor Lecture 30 To provide game developers with a balanced and powerful platform Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache 165M transistors total 3.2 Ghz Near-POWER ISA 2-issue, 21 stage pipeline, with 128 128-bit registers Weak branch prediction – supported by software hinting In order instructions Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything else An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM 337M transistors, 10MB framebuffer 48 pixel shader cores, each with 4 ALUs Fall Xenon Diagram Core 1 Core 2 L1D L1I L1D L1I L1D L1I XMA Dec Core 0 1MB UL2 MC1 MC0 512MB DRAM BIU/IO Intf Lecture 30 SMC GPU DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control 3D Core 10MB EDRAM Video Out Analog Chip Video Out Fall The PS3 “Cell” Processor Architecture Composed of a Non-SMP Architecture 234M transistors @ 4Ghz 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s 512KB L2 $ - Massively high bandwidth (200GB/s) bus connects it to everything else The PPE is strangely similar to one of the Xenon cores - Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT The real differences lie in the SPEs (21M transistors each) - An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256KB “scratchpad” – 14M transistors – Direct mapped for low latency - 4 vector units per SPE, 1 of everything else – 7M trans. Lecture 30 Fall How to make use of the SPEs Lecture 30 Fall What about the Software? Lecture 30 Makes use of special IBM “Hypervisor” Like an OS for OS’s Runs both a real time OS (for sound) and non-real time (for things like AI) Software must be specially coded to run well The single PPE will be quickly bogged down Must make use of SPEs wherever possible This isn’t easy, by any standard What about Microsoft? Development suite identifies which 6 threads you’re expected to run Four of them are DirectX based, and handled by the OS Only need to write two threads, functionally Fall Next Lecture and Reminders Next lecture Review for Final Reminders Lecture 30 Final is Tuesday, December 12 from 8-9:50 AM in ITT 322 Fall