Multithreaded Processors Dezső Sima Spring 2007 (Ver. 2.1) Dezső Sima, 2007 Overview • 1. Introduction • 2. Overview of multithreaded cores • 3. Thread scheduling • 4. Case examples • 4.1. Coarse grained multithreaded cores • 4.2. Fine grained multithreaded cores • 4.3. SMT cores 1. Introduction 1. Introduction (1) Aim of multithreading to raise performance (beyond superscalar or EPIC execution) by introducing and utilizing finer grained parallelism than multitasking at execution. Thread flow of control (in superscalars: dynamic sequence of instructions to be executed that are managed as an entity during instruction scheduling for dispatching or issue.) 1. Introduction (2) Sequential programm ing P1 Multitasked programming P1 Multithreaded programming P1 fork() Process / Thread Management Example T1 CreateThread() P2 exec() T2 fork() Create Process() T3 P2 P2 T4 P3 T5 exec() T6 P3 join() Figure 1.1: Principle of sequential-, multitasked- and multithreaded programming 1. Introduction (3) Main features of multithreading Threads • belong to the same process, • share usually a common address space (else multiple address translation paths (virtual to real) need to be maintained concurrently), • are executed concurrently (simultaneously (i.e. overlapped by time sharing) or in parallel), depending on the impelmentation of multithreading. Main tasks of thread management • creation, control and termination of individual threads, • context swithing between threads, • maintaining multiple sets of thread states. Basic thread states • thread program state (state of the ISA) including: PC, FX/FP architectural registers, state registers, • thread microstate (supplementary state of the microarchitecture) including: rename register mappings, branch history, ROB etc. 1. Introduction (4) Implementation of multithreading (while executing multithreaded apps/OSs) Software multithreading Hardware multithreading Execution of multithreaded apps/OSs on a single threaded processor simultaneously (i.e. by time sharing) Execution of multithreaded apps/OSs on a multithreaded processor concurrently Maintaining multiple threads simultaneously by the OS Maintaining multiple threads concurrently by the processor Multithreaded OSs Multithreaded processors Fast context swithing between threads required. 1. Introduction (5) Multithreaded processors Multicore processors Multithreaded cores (SMP: Symmetric Multiprocessing CMP: Chip Multiprocessing) Chip Core L2/L3 MT core Core L2/L3 L3/Memory L3/Memory 1. Introduction (6) Requirement of software multithreading Maintaining multiple thread program states concurrently by the OS, including: PC, FX/FP architectural registers, state registers Core enhancements needed in multithreaded cores • Maintaining multiple thread program states concurrently by the processor, including: PC, FX/FP architectural registers, state registers • Maintaning multiple thread microstates, pertaining to: rename register mappings, the RAS (Return Address Stack), theROB, etc. • Providing increased sizes for scarce or sensitive resorces, such as: the instruction buffer, store queue,in case of merged arch. and rename registers appropriatly large file sizes (FX/FP) etc. Options to provide multiple states • Implementing individual per thread structures, like 2 or 4 sets of FX registers, • Implementing tagged structures, like a tagged ROB, a tagged buffer etc. 1. Introduction (7) Multicore processors Multithreade d cores Additional complexity ~ (60 – 80) % ~ (2 – 10) % Additional gain (in gen. purp. apps) ~ (60 – 80) % ~ (0 – 30) % 1. Introduction (8) Multithreaded OSs • Windows NT • OS/2 • Unix w/Posix • most OSs developed from the 90’s on Introduction (9) Multitasked programs Key Issues Key Features Description Sequential programs Single process on a single processor No issues with parallel programs Sequential bottleneck Multithreaded programs Hardware multithreading Software implementation Software multithreading Multithreaded Multiple processes on software on a single a single processor threaded processor using time sharing using time sharing on a multithreaded core on a multicore proc. Multithreaded software on a multithreaded core Multithreaded software on a multicore processor Multiple programs with quasi-parallel execution Multiple programs with quasi-parallel execution Simultaneous True parallel execution of threads execution of threads Private address spaces Shared process address spaces Threads share address space Threads share address space Thread context switch needed No thread context switches needed (except coarse grained MT) No thread context switches needed Solutions for fast context switching Thread state management and context switching Thread scheduling Intra-core communication Figure 1.2: Contrasting sequential-, multitasked- and multithreaded execution (2) Introduction (10) Multitasked programs Software multithreading Legacy OS support Low Software Development Software implementation OS Support Hardware multithreading Performance Level Sequential programs Multithreaded programs No API level support on a multithreaded core on a multicore proc. Traditional Unix Most modern OS’s (Windows NT/2000, OS/2, Unix w/Posix) Most modern OS’s (Windows NT/2000, OS/2, Unix w/Posix) Most modern OS’s (Windows NT/2000, OS/2, Unix w/Posix) Low-medium High Higher Highest Process and thread management API Explicit threading API OpenMP Process and thread management API Explicit threading API OpenMP Process and thread management API Explicit threading API OpenMP Process life cycle management API Figure 1.3: Contrasting sequential-, multitasked- and multithreaded execution (2) 2. Overview of multithreaded cores 2. Overview of multithreaded cores (1) 8CMT QCMT 1/06 5/05 DCMT Pentium EE 840 (Smithfield) 90 nm/2*103 mm2 230 mtrs./130 W 2-way MT/core 11/02 SCMT Pentium 4 (Prescott) 130 nm/146 mm2 55 mtrs./82 W 2-way MT 2H 2002 65 nm/2*81 mm2 2*188 mtrs./130 W 2-way MT/core 02/04 Pentium 4 (Northwood B) 1H Pentium EE 955/965 (Presler) 90 nm/112 mm2 125 mtrs./103 W 2-way MT 1H 2H 2003 1H 2H 2004 1H 2H 2005 Figure 2.1: Intel’s multithreaded desktop families 1H 2H 2006 2. Overview of multithreaded cores (2) 8CMT QCMT 6/06 10/05 DCMT Xeon 5000 (Dempsey) Xeon DP 2.8 (Paxville DP) 65 nm/2*81 mm2 90 nm/2*135 mm2 2*188 mtrs./95/130 W 2*169 mtrs./135 W 2-way MT/core 2-way MT/core SCMT 2/02 11/03 Pentium 4 (Prestonia-A) Pentium 4 (Irwindale-A) 130 nm/146 mm2 55 mtrs./55 W 2-way MT 1H 130 nm/135 mm2 169mtrs./110 W 2-way MT 2H 2002 6/04 1H 2H 2003 Pentium 4 (Nocona) 90 nm/112 mm2 125 mtrs./103 W 2-way MT 1H 2H 2004 1H 2H 2005 Figure 2.2.: Intel’s multithreaded Xeon DP-families 1H 2H 2006 2. Overview of multithreaded cores (3) 8CMT QCMT 11/05 DCMT 8/06 Xeon 7000 (Paxville MP) Xeon 7100 (Tulsa) 90 nm/2*135 mm2 65 nm/435 mm2 2*169 mtrs./95/150 W 1328 mtrs./95/150 W 2-way MT/core 2-way MT/core SCMT 3/05 3/04 3/02 Pentium 4 (Gallatin) Pentium 4 (Potomac) 130 nm/310 mm2 178/286 mtrs./77 W 2-way MT 90 nm/339 mm2 675 mtrs./95/129 W 2-way MT Pentium 4 (Foster-MP) 180 nm/ n/a 108 mtrs./64 W 2-way MT 1H 2H 2002 1H 2H 2003 1H 2H 2004 1H 2H 2005 Figure 2.3.: Intel’s multithreaded Xeon MP-families 1H 2H 2006 2. Overview of multithreaded cores (4) 8CMT QCMT 7/06 DCMT 9x00 (Montecito) 90 nm/596 mm2 1720 mtrs./104 W 2-way MT/core SCMT 1H 2H 2002 1H 2H 2003 1H 2H 2004 1H 2H 1H 2005 Figure 2.4.: Intel’s multithreaded EPIC based server family 2H 2006 2. Overview of multithreaded cores (5) 2007 8CMT POWER6 65 nm/341 mm2 750 mtrs./~100W 2-way MT/core QCMT 10/05 5/04 DCMT POWER5+ POWER5 90 nm/230 mm2 276 mtrs./70 W 2-way MT/core 130 nm/389 mm2 276 mtrs./80W (est.) 2-way MT/core 5/04 2006 SCMT RS 64 IV (Sstar) Cell BE PPE 90 nm/221* mm2 234* mtrs./95* W 2-way MT (*: entire proc.) 180 nm/n/a 44 mtrs./n/a 2-way MT 1H 2H 2000 ~ ~ 1H 2H 2004 1H 2H 2005 1H 2H 2006 Figure 2.5.: IBM’s multithreaded server families 1H 2H 2007 2. Overview of multithreaded cores (6) 2007 11/2005 8CMT UltraSPARC T1 (Niagara) UltraSPARC T2 (Niagara II) 90 nm/379 mm2 279 mtrs./63 W 4-way MT/core 65 nm/342 mm2 72 W (est.) 8-way MT/core 2008 QCMT APL SPARC64 VII (Jupiter) 65 nm/464 mm2 ~120 W 2-way MT/core 2007 DCMT APL SPARC64 VI (Olympus) 90 nm/421 mm2 540 mtrs./120 W 2-way MT/core SCMT 1H 2H 2004 1H 2H 2005 1H 2H 2006 1H 2H 1H 2007 Figure 2.6: Sun’s and Fujitsu’s multithreaded server families 2H 2008 2. Overview of multithreaded cores (7) 5/05 8CMT XLR 5xx 90 nm/~220 mm2 333 mtrs./10-50 W 4-way MT/core QCMT DCMT SCMT 1H 2H 2002 1H 2H 2003 1H 2H 2004 1H 2H 2005 Figure 2.7: RMI’s multithreaded XLR family (scalar RISC) 1H 2H 2006 2. Overview of multithreaded cores (8) 8CMT QCMT DCMT 2003 SCMT Alpha 21464 (V8) 130 nm/ n/a 250 mtrs./10-50 W 4-way MT Cancelled 6/2001 1H 2H 2002 1H 2H 2003 1H 2H 2004 1H 2H 2005 Figure 2.8: DEC’s/Compaq’s multithreaded processor 1H 2H 2006 2. Overview of multithreaded cores (9) Underlying core(s) Scalar core(s) Superscalar core(s) VLIW core(s) SUN UltraSPARC T1 (2005) (Niagara) up to 8 cores/4T IBM RS64 IV (2000) (SStar) Single-core/2T SUN MAJC 5200 (2000) Quad-core/4T (dedicated use) RMI XLR 5xx (2005) 8 core/4T Pentium 4 based processors Single-core/2T (2002-) Dual-core/2T (2005-) Intel Montecito (2006) Dual-core/2T DEC 21464 (2003) Single-core/4T IBM POWER5 (2005) Dual-core/2T PPE of Cell BE (2006) Single-core/2T Fujitsu SPARC64 VI / VII Dual-core/Quad-core/2T 3. Thread scheduling 3. Thread scheduling (1) Dispatch slots Thread scheduling in software multithreading on a traditional supercalar processor Thread1 Context switch Thread2 Clock cycles The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread and loading the state of the thread to be executed next). Figure 3.1: Thread scheduling assuming software multithreading on a 4-way superscalar processor 3. Thread scheduling (2) Dispatch slots Thread scheduling in multicore processors (CMP-s) Thread1 Thread2 Clock cycles Both t-way superscalar cores execute different threads independently. Figure 3.2: Thread scheduling in a dual core processor 3. Thread scheduling (3) Thread scheduling in multithreaded cores Coarse grained MT Dispatch/issue slots 3. Thread scheduling (4) Clock cycles Thread1 Context switch Thread2 Threads are switched by means of rapid, HW-supported context switches. Figure 3.3: Thread scheduling in a 4-way coarse grained multithreaded processor 3. Thread scheduling (5) Coarse grained MT Scalar based Superscalar based IBM RS64 IV (2000) (SStar) Single-core/2T VLIW based SUN MAJC 5200 (2000) Quad-core/4T (dedicated use) Intel Montecito (2006?) Dual-core/2T 3. Thread scheduling (6) Thread scheduling in multithreaded cores Coarse grained MT Fine grained MT Dispatch/issue slots 3. Thread scheduling (7) Clock cycles Thread1 Thread2 Thread3 Thread4 The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle.. Figure 3.4: Thread scheduling in a 4-way fine grained multithreaded processor 3. Thread scheduling (8) Fine grained MT Round robin selection policy Scalar based Superscalar based Priority based selection policy VLIW based Scalar based Superscalar based SUN UltraSPARC T1 (2005) (Niagara) up to 8 cores/4T PPE of Cell BE (2006) single-core/2T VLIW based 3. Thread scheduling (9) Thread scheduling in multithreaded cores Coarse grained MT Fine grained MT Simultaneous MT (SMT) Dispatch/issue slots 3. Thread scheduling (10) Clock cycles Thread1 Thread2 Thread3 Thread4 Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle. SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington). Figure 3.5: Thread scheduling in a 4-way symultaneous multithreaded processor 3. Thread scheduling (11) SMT cores Scalar based Superscalar based Pentium 4based proc.s Single-core/2T (2002-) Dual-core/2T (2005-) DEC 21464 (2003) Dual-core/4T (canceled in 2001) IBM POWER5 (2005) Dual-core/2T VLIW based 4. Case examples 4.1. Coarse grained multithreading 4.2. Fine grained multithreading 4.3. SMT multithreading 4.1 Coarse grained multithreaded processors 4.1.1. IBM RS64 IV 4.1.2. SUN MAJC 5200 4.1.3. Intel Montecito 4.1. Coarse grained multithreaded processors Thread scheduling in multithreaded cores Coarse grained MT Fine grained MT Simultaneous MT (SMT) 4.1.1. IBM RS 64 IV (1) Microarchitecture 4-way superscalar, dual-threaded. Used in IBM’s iSeries and pSeries commercial servers. Optimized for commercial server workloads, such as on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning). Characteristics of server workloads: • • • large working sets, poor locality of references and frequently occurring task switches • high cache miss rates, • Memory bandwidth and latency strongly limits performance. need for wide instruction and data fetch bandwidth, need for large L1 $s, using multithreading to hide memory latency. 4.1.1. IBM RS 64 IV (2) Main microarchitectural features of the RS64 IV to support commercial workloads: • 128 KB L1 D$ and L1 I$, • instruction fetch width: 8 instr./cycle, • dual-threaded core. 4.1.1. IBM RS 64 IV (3) IERAT: Effective to real address translation cache (2x64 entries) 6XX bus Figure 4.1.1: Microarchitecture of IBM’s RS 64 IV Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898 4.1.1. IBM RS 64 IV (4) Multithreading policy (strongly simplified) Coarse grained MT with two Ts; a foreground T and a background T. The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs. Subsequently, a T switch is performed and the background T begins to execute. After the long latency event is serviced, a T switch occurs back to the foreground T. Both single threaded and multithreaded modes of execution. Threads can be allocated different priorities by explicit instructions. Implementation of multithreading Dual architectural states maintained for: • GPRs, FPRs, CR (condition reg.), CTR (count reg.), • spec. purpose priviledged mode reg.s, such as the MSR (machine state reg..) • status and control reg.s, such as T priority. Each T executes in its own effective address space (an unusual feature of multithreaded cores). Units used for address translation need to be duplicated, such as the SRs (Segment Address Reg.s) Thread Swith Buffer holds up to 8 instructions from the background T, to shorten context swithching by eliminating the latency of the I$ For multithreading additionally needed die area: ~ + 5 % die area 4.1.1. IBM RS 64 IV (5) Figure 4.1.2: Thread switch on data cache miss in IBM’s RS 64 IV Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898 4.1.2. SUN MAJC 5200 (1) Aim: Dedicated use, high-end graphics, networking with wire-speed computational demands. Microarchitecture: • up to 4 processors on a die, • each processor has 4 FUs (Functional Units); 3 of them are identical, one is enhanced, • each FU has its private logic and register set (e.g. 32 or 64 regs., • the 4 FUs of a processor share a set of global regs., e.g. 64 regs., • all registers are unified (not splitted to FX/FP files), • any FU can process any data type. Each processor is a 4-wide VLIW and can be 4-way multithreaded. 4.1.2. SUN MAJC 5200 (2) Figure 4.1.3: General view of SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc 4.1.2. SUN MAJC 5200 (3) Figure 4.1.4: The principle of private, unified register files associated with each FU Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc 4.1.2. SUN MAJC 5200 (4) Threading Each processor with its 4 FUs can be operated in a 4-way multithreaded mode (called Vertical Multithreading by Sun) Implementation of 4-way multithreading: by executing each T by one of the 4 FUs („Vertical multithreading”) Thread switch Following a cache miss, the processor saves the T state and begins to process the next T. Example Comparison of program execution without and with multithreading on a 4-wide VLIW Considered program: • • • • It consists of 100 instructions, on average 2.5 instrs./cycle executed on average, giving birth to a cache miss after each 20 instructions. Latency of serving a cache miss: 75 cycles. 4.1.2. SUN MAJC 5200 (5) Figure 4.1.5: Execution for subsequent cache misses in a single threaded processor Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc 4.1.2. SUN MAJC 5200 (6) Figure 4.1.6: Execution for subsequent cache misses in SUN’s MAJC 5200 Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc 4.1.3. Intel Montecito (1) Aim: High end servers Main differences between Itanium2 and Montecito • • • • Split L2 caches, larger unified L3 cache, duplicated architectural states for FX/FP-registers, branch and predicate registers, next address register maintained. (Foxton technology for power management/frequency boost, planned but not implemented). Additional support for dual-threading (duplicated microarchitectural states) • • • the branch prediction structures provide T tagging, per thread return address stacks, per thread ALATs (Advance Load Address Table) Additional core area needed for multithreading: ~ 2 %. 4.1.3. Intel Montecito (2) Figure 4.1.7: Microarchitecture of Intel’s Itanium 2 Source: McNairy, C., „Itanium 2”, IEEE Micro, March/April 2003, Vol. 23, No. 2, pp. 44-55 4.1.3. Intel Montecito (3) Figure 4.1.8: Microarchitecture of Intel’s Montecito (ALAT: Advanced Load Address Table) Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20 4.1.3. Intel Montecito (4) Thread switches 5 event types cause thread switches, such as L3 cache misses, programmed switched hints. Total switch penalty: 15 cycles Example for thread switching If control logic detects that a thread doesn’t make progress, a thread switch will be initiated. 4.1.3. Intel Montecito (5) Figure 4.1.9: Thread switch in Intel’s Montecito vs single thread execution Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20 4.2 Fine grained multithreaded processors 4.2.1. SUN Ultrasparc T1 4.2.2. PPE of Cell BE 4.2. Fine grained multithreaded processors Thread scheduling in multithreaded cores Coarse grained MT Fine grained MT Simultaneous MT (SMT) 4.2.1. SUN UltraSPARC T1 (1) Aim Commercial server applications, such as • • • • web servicing, transaction processing, ERP (Enterprise Resource Planning), DSS (Decision Support Systems) Characteristics of commercial server applications • large working sets, • poor locality of memory references. • high cache miss rates, • low prediction accuracy for data dependent branches. Memory latency strongly limits performance. Multithreading to hide memory latency. 4.2.1. SUN UltraSPARC T1 (2) Structure • 8 scalar cores, 4-way multithreaded each. • All 32 threads share an L2 cache of 3 MB, built up of 4 banks, 4.2.1. SUN UltraSPARC T1 (3) Figure 4.2.1: Block diagram of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29 4.2.1. SUN UltraSPARC T1 (4) Structure • 8 scalar cores, 4-way multithreaded each. • All 32 threads share an L2 cache of 3 MB, built up of 4 banks, • 4 memory channels with on chip DDR2 memory controllers. It runs under Solaris. 4.2.1. SUN UltraSPARC T1 (5) Figure 4.2.2: SUN’s UltraSPARC T1 chip Source: www.princeton.edu/~jdonald/research/hyperthreading/romanescu_niagara.pdf 4.2.1. SUN UltraSPARC T1 (6) Processor Elements (Sparc pipes): • Scalar FX-units, 6-stage pipeline • all Processor Elements share a single FP-unit 4.2.1. SUN UltraSPARC T1 (7) Figure 4.2.3: Microarchitecture of the core of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29 4.2.1. SUN UltraSPARC T1 (8) Processor Elements (Sparc pipes): • Scalar FX-units, 6-stage pipeline • all Processor Elements share a single FP-unit Each thread of a processor element has its private: • • • • PC-logic register file, instruction buffer, store buffer. 4.2.1. SUN UltraSPARC T1 (9) Figure 4.2.4: Microarchitecture of the core of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29 4.2.1. SUN UltraSPARC T1 (10) Processor Elements (Sparc pipes): • Scalar FX-units, 6-stage pipeline • all Processor Elements share a single FP-unit Each thread of a processor element has its private: • • • • PC-logic, register file, instruction buffer, store buffer. No thread switch penalty! 4.2.1. SUN UltraSPARC T1 (11) Thread switch: Threads are switched on a per cycle basis. Selection of threads: In the thread select pipeline stage • the thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread from the instruction buffer into the pipeline for execution, and • fetches the following instr. of the same thread into the instruction buffer. 4.2.1. SUN UltraSPARC T1 (12) Figure 4.2.5: Microarchitecture of the core of SUN’s UltraSPARC T1 Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29 4.2.1. SUN UltraSPARC T1 (13) Thread switch: Threads are switched on a per cycle basis. Selection of threads: In the thread select pipeline stage • the thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread from the instruction buffer into the pipeline for execution, and • fetches the following instr. of the same thread into the instruction buffer. Thread selection policy: the least recently used policy. Threads become unavailable due to: • long-latency instructions, such as loads, branches, multiplies, divides, • pipeline stalls because of cache misses, traps, resource conflicts. 1. Example: • all 4 threads are available. 4.2.1. SUN UltraSPARC T1 (14) t0-add t1-sub Figure 4.2.6: Thread switch in the SUN’s UltraSPARC T1 when all threads are available Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29 4.2.1. SUN UltraSPARC T1 (15) 2. Example: •There are only 2 threads available, •speculative execution of instructions following a load. (Data referenced by a load instruction arrive in the 3. cycle after decoding, assuming a cache hit. So, after issuing a load the thread becomes unavailable for the next two subsequent cycles.) 4.2.1. SUN UltraSPARC T1 (16) ld data available t1-sub t0 yet unavailable Figure 4.2.7: Thread switch in the SUN’s UltraSPARC T1 when only two threads are available (Thread t0 issues a ld instruction and becomes unavailable for two cycles. The add instruction from thread t0 is speculatively switched into the pipeline assuming a cache hit.) Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29 4.2.2. Cell BE Overview of the Cell BE Processor components Multithreading the PPE Programming models Implementation of the Cell BE Overview of the Cell BE (1) Cell BE Objective: Speeding up game/multimedia apps. Used: In the PlayStation 3 (PS3) and in the QS20 Blade Server Goal: 100 times the PS 2 performance. History CELL BE: Collaborative effort from Sony, IBM and Toshiba Summer 2000: End 2000: March 2001: Spring 2004: Summer 2004: Febr. 2005: Oct. 2005: Nov. 2005: Febr. 2006: High level architectural discussions Architectural concept Design Center opened in Austin TX. Single Cell BE operational 2-way SMP operational First technical disclosures Mercury announces Cell Blade Open Source SDK & Simulator published IBM announced Cell Blade QS20 Cell BE at NIK May 2007: QS20 arrives at NIK within IBM’s loan program Overview of the Cell BE (2) Main functional units of the Cell BE • 9 cores; the PPE (Power Processing Element), a dual threaded, dual issue 64-bit Power PC compliant processor and 8 SPEs (Synergistic Processing Elements),single threaded, dual issue 128-bit SIMD processors. • the EIB (Element Interconnection Bus, an on-chip interconnetion network, • the MIC (Memory Interface Controller), a Memory Controller supporting dual Rambus XDR channels and • the BIC (Bus Interface Controller) that interfaces the Rambus Flex IO bus. Overview of the Cell BE (3) SPE: Synergistic Procesing Element SPU: Synergistic Processor Unit SXU: Synergistic Execution Unit LS: Local Store of 256 KB SMF: Synergistic Mem. Flow Unit EIB: Element Interface Bus PPE: Power Processing Element PPU: Power Processing Unit PXU: POWER Execution Unit MIC: Memory Interface Contr. BIC: Bus Interface Contr. XDR: Rambus DRAM Figure 4.2.8: Block diagram of the Cell BE [4.2.2.1] Overview of the Cell BE (4) Unique features of the Cell BE a) Heterogeneous MCP rather than being a symmetrical MCP (as usual implementations) The PPE • is optimized to run a 32/64-bit OS • controls usually the SPEs, • complies with the 64-bit PowerPC ISA. The SPEs • are optimized to run compute intensive SIMD apps., • operate usually under the control of the PPE, • run their individual apps. (threads), • have full access to a coherent shared memory including the memory mapped I/O-space, • can be programmed in C/C++. Contrasting the PPE and the SPEs • the PPE is more adept at control-intensive tasks and quicker in task switching, • the SPEs are more adept at compute intensive tasks and slower at task switcing. Overview of the Cell BE (5) b) The SPEs have an unusual storage architecture, as • SPEs operate in connection with a local store (LS) of 256 KB, i.e. o they fetch instructions from their private LS and o their Load/Store-instructions access their LS rather than the main store, • The LS has no associated cache. • SPEs access main memory (effective address space) by DMA commands, i.e. DMA commands move data and instructions between main store and the private LS, while DMA commands can be batched (up to 16 commands). Overview of the Cell BE (6) Although the PPE and the SPEs have coherent access to main memory, the Cell BE is not a traditional shared-memory multiprocessor as SPEs operate in connnection with the LS rather than with the main memory. Processor components of the Cell BE (1) PPE (Power Processing Element) [4.2.2.2] • Fully compliant 64-bit Power processor (Architecture Specification 2.02) • fc = 3.2 GHz (11 FO4 design, 23 pipeline stages). • Dual-issue, in-order, two-way (fine grained) multithreaded core. • Conventional cache architecture of 32 KB I$, 32 KB D$, 512 KB unified L2. Processor components of the Cell BE (2) Instructions Unit FX Execution Unit Vector Scalar Unit Vector/Media Execution) Figure 4.2.9: Main functional units of the PPE [4.2.2.3] Processor components of the Cell BE (3) Main components of the PPE • • • IU (Instruction unit) predecodes instructions while loading them from the L2 cache to the L1 cache, fetches 4 instructions per cycle alternating between the two threads from the L1 instr. cache into two instruction buffers (each one for each thread), dispatches instructions from the two instruction buffers to the shared decode, dependency checking and issue pipeline according to the thread scheduling rules. Microcode Engine Instructions that are either difficult to implement in hardware or are rarely used can be split into a few simple PowerPC instructions and are stored in a ROM (such as Load string or several Condition Register (CR) instructions. Most microcoded instructions are split into two or three microcoded instruction. The Microcode Engine inserts microcoded instructions from one thread into the instruction flow with a delay of 11 clock cycles. The Microcode Engine stalls dispatching from the instruction buffers until the last microcode of the microcoded instruction is dispatched. The next dispatch cycle belongs to the thread that did not invoke the Microcode Engine. Shared decode, dependency checking and issue pipeline Receives dispatched instructions (up to two in each cycle from the same thread), it decodes them, checks for dependencies, and issues instructions for execution according to the issue rules. Processor components of the Cell BE (4) • XU (FX Execution Unit) 32x64-bit register file/thread FXU (FX Unit) LSU (L/S Unit) BRU (Branch Unit) Per thread branch prediction (6 bit global history, 4 K x 2 bit history table) • VSU (Vector Scalar Unit) VMX/FPU issue queue (Vector-Media Execution Unit/FP Unit) called also as the VSU (Vector-Scalar Unit) issue queue (two entries) VMX (Vector-Media Execution Unit), called also as the VXU (Vector Execution Unit) o o o 32x128 bit vector register file/thread, simple, complex, permute and single-precision FP subunits, 128-bit SIMD instructions with varying data width (2x64-bit, 4x32-bit, 8x16-bit, 16x8-bit, 128x1-bit). FPU (FP Unit): • • 32x64 bit register file/thread 10-stage double precision pipeline Processor components of the Cell BE (5) Basic operation of the PPE Instr. fetch • • Instruction fetch operates autonomously in order to keep each thread’s instruction buffer full with useful instructions that are likely to be needed. 4 instr./cycle are fetched strictly alternating between the two threads from the L1 I$ to the private Instruction Buffers of the threads. • The fetch address is taken from the Instruction Fetch Address Registers associated with each thread (IFAR0, IFAR1). The IFARs are distinct from the Program Counters (PC) associated with both threads; the PCs track the actual program flow while the IFARs track the predicted instruction execution flow. • Accessing the taken path after a predicted-taken branch requires 8 cycles. Processor components of the Cell BE (6) Instruction dispatch • Moves up to two instructions either from one of the Instruction Buffers or the Microcode Engine (complex instructions) to the shared decode, dependency check and issue pipeline. • Instruction dispatch is governed by the dispatch rules (thread scheduling rules). • The dispatch rules take into account thread priority and stall conditions (see Section 5.34?). • Each pipeline stage beyond the dispath point contains instructions from one thread only. Instruction decode and dependency checking Decoding of up to two instructions from the same tread in each cycle and checking for dependencies Processor components of the Cell BE (7) Pipeline stages Units (IFAR: Instr. Fetch Addr.) IC: Instruction cache 4 IB: Instruction buffer ibuf: Instr. Buffer 2 ID: Instruction decode IS: Instruction issue IU: Instruction Unit VSU: Vector Scalar Unit VXU: Vector Execution Unit FPU: FP Execution Unit BRU: Branch Unit XU: FX Execution Unit FXU: FX Execution Unit LSU: L/S Execution Unit Figure 4.2.10: Instruction flow in the PPE [4.2.2.4] Processor components of the Cell BE (8) Instruction issue at the pipeline stage IS2 • • Forwarding up to two PowerPC or vector/SIMD multimedia extension instructions per cycle from the IS2 pipeline stage for execution to the VSU (VMX/FPU) issue queue (up to two instr./cycle) or the BRU, LSU, FXU execution units (up to one instr./cycle per execution unit). Any issue combinations are allowed, except two instructions to the same unit with a few restrictions. (See Figure 4.2.11 for the valid issue combinations.) Note that valid resp. invalid issue combinations result from the underlying microarchitecture, as shown in Figure 4.2.13. • Instructions are issued in each cycle from the same thread. • Instruction issue can be stalled at the IS2 pipeline stage for various reasons, like invalid issue combinations, full VSU issue queue. Instruction issue from the VSU (VMX/FPU) issue queue Forwarding up to two VMX or FPU instructions to the respective execution units. Note that instructions kept in the issue queue are already prearranged for execution, i.e. they obey the issue restrictions summarized in Figure 4.2.11. Processor components of the Cell BE (9) (younger instr.) (older instr.) Figure 4.2.11: Valid issue combinations (designated as pink squares) [4.2.2.4] Type 1 instructions: VXU simple, VXU complex, VXU FP and FPU arithmetic instructions, Type 2 instructions: VXU load, VXU store, VXU permute, FPU load and FPU store instructions. Processor components of the Cell BE (10) Figure 4.2.12: Pipeline stages of the PPE [4.2.2.3] Processor components of the Cell BE (11) EIB data ring for internal communication [4.2.2.2] • Four 16 byte data rings, supporting multiple transfers • 96B/cycle peak bandwidth • Over 100 outstanding requests • 300+ GByte/sec @ 3.2 GHz Processor components of the Cell BE (12) SPE [4.2.2.2] • SPEs optimized for data-rich operation • are allocated by the PPE • SPEs are not intended to run an OS Main Components a) b) c) d) SPU (Synergistic Processing Unit) MFC (Memory Flow Controller) LS (Local Store) AUC (Atomic Unit and Cache) SPE Processor components of the Cell BE (13) a) SPU Overview • Dual-issue superscalar RISC core supporting basically a 128-bit SIMD ISA. • The SIMD ISA provides FX, FP and logical operations on 2x64-bit, 4x32-bit, 8x16-bit, 16x8-bit and 128x1-bit data. • In connnection with the MFC the SPU support also a set of commands for - performing DMA transfers, - interprocessor messaging and - synchronization. • The SPU executes instructions from the LS (256 KB), • Instructions reference data from the 128x128-bit unified register file, • The Register file fetches/delivers data from/to the LS by L/S instructions, • The SPU moves instructions and data between the main memory and the local store by requesting DMA transfers from its MFC. (Up to 16 outstanding DMA requests allowed). Processor components of the Cell BE (14) SPU Even pipe LS Odd pipe MFC Figure 4.2.13: Block diagram of the SPU [4.2.2.3] Processor components of the Cell BE (15) Main components of the SPU • • Instruction issue unit – instruction line buffer Fetches 32 instructions per LS request from the LS into the Instruction line buffer. Instruction fetching is supported by hardware prefetching. Pefetching requires 15 cycles to fill the instruction line buffer. Fetched instructions are decoded and issued (up to two instructions per cycle) according to the issue rules. Register file Unified Register file of 128 registers each 128-bit wide. • Result forwarding and staging Instructions are staged in an operand staging network for up to 6 additinal cycles to achieve that all execution units write their results in the Register file in the same pipeline stage. (See Figure 4.2.19). Processor components of the Cell BE (16) • Execution units Execution units are organised into two pipelines. The even pipeline includes o the Fixed-point unit and o the Floating-point unit. The o o o o odd pipeline includes the Channel unit, Branch unit, Load/Store unit and the Permute unit. Processor components of the Cell BE (17) Basic operation of the SPU Instruction issue • The SPU issues up to two instructions per cycle from a 2-instructions wide issue window, called the fetch group. • Fetch groups are aligned to doubleword boundaries, i.e. the first instruction is at an even and the second one at an odd word address. (Words are 4-Byte long). • An instruction becomes issueable when no register dependencies or resource conflicts, e.g. busy execution units, exist. • Instructions are issued in program order, that is - if the first instruction of a fetch group can be issued to the even pipeline and the second instruction to the odd pipeline both instructions are issued in the same cycle, - in all other cases instruction issue needs two cycles such that instructions are issued in program order to the pertaining pipeline (see Figure 4.2.14). • Register or resource conflicts stall instruction issue. • A new fetch group is loaded after both instructions of the current fetch group are issued. Processor components of the Cell BE (18) Figure 4.2.14: Instruction issue example [4.2.2.4] (Assuming that instruction issue is not constrained by register or resource conflicts) Processor components of the Cell BE (19) SPU channels • An SPU communicates with its associated MFC as well as (via its MFC) with the PPE, other SPEs and devices (such as a decrementer) through its channels. MMIO: Memory-Mapped I/O Registers SLC: SPU Load and Store Unit SSC: SPU Channel and DMA Unit Figure 4.2.15: The channel interface between the SPU and the MFC [4.2.2.4] Processor components of the Cell BE (20) • • SPU channels are unidirectional interfaces for sending commands (such as DMA commands) to the MFC, owned by the SPU or sending/receiving up to 32-bit long messages between the SPU and the PPE or other SPEs. SPU channels are implemented in and managed by the MFC. • Each channel has - a corresponding capacity (maximum message entries) and - count (remaining available message entries). The channel count - decrements when ever a channel instruction (rdch or wrch)is issued, and - increments whenever an operation associated with the channnel completes. The channel count of „0” means - empty for read only channels and - full for write only channels. Processor components of the Cell BE (21) • The SPU can read or write its channels by three instructions; the read channel (rdch), write channel (wrch) and read channel count (rchcnt) instructions. Figure 4.2.16: Assembler instruction mnemonics and their corresponding C-language intrinsics of the channel instructions available for the SPU [4.2.2.4] (Intrinsics represent in-line assembly code segments in the form of C-language function calls). Processor components of the Cell BE (22) • The channel instructions or DMA commands evoked by channel instructions are enqueued for execution in the MFC for purposes like initiating DMA transfers between the SPE’s LS and the main storage, queuring DMA and SPU status, sending or receiving up to 32-bit long mailbox messages primarily between the SPU and the PPE or sending or receiving up to 32-bit long signal-notification messages between the SPU and the PPE or other SPEs. • The PPE and other devices in the system including other SPEs, can also access the channels through the MFC’s memory mapped I/O (MMIO) registers and queues, which are visible to software in the main storage space. Processor components of the Cell BE (23) Figure 4.2.17: SPE channels and associated MMIO registers (1) [4.2.2.4] Processor components of the Cell BE (24) Figure 4.2.18: SPE channels and associated MMIO registers (2) [4.2.2.4] Processor components of the Cell BE (25) Figure 4.2.19: Pipeline stages of the SPUs [4.2.2.1] Processor components of the Cell BE (26) b) Memory Flow Controller (MFC) [4.2.2.2] • acts as a specialized co-processor for its associated SPU by executing autonomously its own command set and • serves as the SPU’s interface, via the EIB to main storage and other processor elements, such as other SPEs or system devices. Processor components of the Cell BE (27) MMIO: Memory-Mapped I/O Registers SLC: SPU Load and Store Unit SSC: SPU Channel and DMA Unit Figure 4.2.20: Block diagram of the MFC [4.2.2.4] Processor components of the Cell BE (28) The MFC as a specialized co-processor It executes three types of commands • DMA commands • DMA List commands and • synchronization commands. • DMA commands (put, get) • can be initaiated by both the PPE and the SPU, • move up to 16 KByte of data between the LS and the main storage, • supports transfer sizes of 1, 2, 4, 8. 16 bytes and multiples of 16 bytes, • access main store by using main storage effective addresses, • can be tagged with a 5-bit tag (tag group ID) to allow special handling within the tag group, such a to enforce ordering of the DMA commands. • DMA list commands (put, get commands with the command modifier l) • can be initiated only by the SPU, • consist of up to 2 K 8-byte long list elements, • each list element specifies a DMA transfer • used to move data between a contiguous area in the LS and possible noncontiguous area in the effective address space implementing scatter-gather functions between main storage and the LS). Processor components of the Cell BE (29) • Synchronization comands • used basically to control the order of storage accesses, • include atomic commands (a form of semaphores), send signal commands and barrier commands. Operation of the MFC • • The MFC maintains two separate command queues - the 16-entry SPU comand queue for commands from the SPU associated with the MFC, and - the 8-entry proxi command queue for commands from the PPE, other SPEs and devices. The MFC supports out-of-order execution of DMA commands. Processor components of the Cell BE (30) The MFC as the interface between the SPU and the main storage, the PPE and other devices • supports storage protection on the main storage side while performing DMA transfers, • maintains synchronization between main storage and the LS, • performs intercore communication functions, such as mailbox and signal-notification messaging with the PPE, other SPEs and devices. Processor components of the Cell BE (31) Intercore communication tools of the MFC • • three mailboxes, primarily intended for holding up to 32-bit long messages from/to the SPE: - one four-deep mailbox for receiving mailbox messages and - two one-deep mailbox for sending mailbox messages. two signal notification channels for receiving signals sent basically by the PPE. Processor components of the Cell BE (32) Figure 4.2.21: Contrasting mailboxes and signals [4.2.2.4] Processor components of the Cell BE (33) c) Local Store [4.2.2.2] • Single-port SRAM cell. • Executes DMA reads/writes and instruction prefetches via 128-Byte wide read/write ports • Executes instruction fetches and load/stores via 128-bit read/write ports. • Asynchronous, coherent DMA commands are used to move instructions and data between the local store and system memory. • DMA transfers between the LS and the main storage are executed by the SMF’s DMA unit • A 128-Byte long DMA read or write requires 16 processor cycles to forward data on the EIB. SPE Processor components of the Cell BE (34) d) The Atomic Update and Cache unit [4.2.2.2] The Atomic Unit • executes atomic operations (a form of mutual-exclusion (mutex) operations) invoked by the MFC, • supports Page Table lookups and • maintains cache coherency by supporting snoop operations. The Atomic Cache six 128-byte cache lines of data to support atomic operations and Page Table accesses. SPE Processor components of the Cell BE (35) Broadband Interface Controller (BIC) [4.2.2.2] • Provides a wide connection to external devices • Two configurable interfaces (50+GB/s @ 5Gbps) Configurable number of bytes Coherent (BIF) and/or I/O (IOIFx) protocols • Supports two virtual channels per interface • Supports multiple system configurations Memory Interface Controller (MIC) • Dual XDRTM controller (25.6GB/s @ 3.2Gbps) • ECC support • Suspend to DRAM support Multithreading the PPE (1) Scheduling of PPE threads Thread scheduling depends both on • • thread states thread priorities • single threaded or dual threaded mode of execution Multithreading the PPE (2) 1. Thread states • Privilege states • Suspended/enabled state • Blocked/not blocked state a) Privilege States • • Hypervisor state • most privileged • allows to run a meta OS that manages logical partitions in which multiple OS instances can run • some system operations require the initiating thread to be in hypervisor state Supervisor state • • is the state in which an OS instance is intended to run Problem state (user state) • is the state in which an application is intended to rum Multithreading the PPE (3) (HV: Hypervisor, PR: Problem) Figure 4.2.22: Bits of the Machine State Register (MSR) defining the privilege state of a thread [4.2.2.4] Multithreading the PPE (4) b) Suspended/enabled State • a thread in the hypervisor state can change its state from enabled to suspended. • Two bits of the Control Register (CTRL[TE0], [TE1]) define whether a thread is in the suspended or enabled state. c) Blocked/stalled State • Blocking - occurs at the instruction dispatch stage if the thread selection rule favours the other thread, or due to a special „nop” instruction, - stops only one of the two threads. • Stalling - occurs at the instruction issue stage due to dependencies - stops both threads. - for very long latency conditions, such as L1 cache misses, or devide instructions, stalling both threads is avoided, by -- flushing instructions younger than the stalled instruction, -- instructions starting with the stalled instruction are refetched and -- the thread is stalled at the dispatch stage until the stall condition is removed, but then the other thread can be continued to dispatch. Multithreading the PPE (5) 2. Thread priorities • determines dispatch priority • four priority levels thread disabled low priority medium priority high priorith • priority levels are specified by a 2-bit field (TP field) of the TSRL register (Thread Status Regiter Local) • Software, in particular OS software, sets thread priorities (according to the throughput requirements of the programs running in the threads.) E.g. a foreground/background thread priority scheme can be set, to favor one thread over the other when allocating instruction dispatch slots. • A thread must be in the hypervisor or supervisor state to set its priority to high. Multithreading the PPE (6) Usual thread priority combinations The combination high priority thread/low priority thread is not expected to be used, as in this case the PPE would never dispatch instructions from the low priority thread unless the high priority thread was unable to dispatch. Figure 4.2.23:Usual thread priority combinations [4.2.2.4] Multithreading the PPE (7) Example (1): Scheduling in case of the medium priority/medium priority setting Basic scheduling rules • The PPE attempts to utilize all available dispatch slots. • Thread scheduling is fair (round robin scheduling). • If a thread under consideration is unable to dispatch an instruction in a given slot, the other thread will be allowed to dispatch even if it was selected for dispatch on the previous attempt. Note:The same scheduling applies when both threads are set to high priority. Figure 4.2.24: Thread scheduling when both priorities are set to medium [4.2.2.4] Multithreading the PPE (8) Example (2): Scheduling in case of the low priority/medium priority setting Basic scheduling rules • The PPE attempts to utilize most available dispatch slots for the medium priority thread (this setting is appropriate to run a low-priority program in the background) • Assuming a duty cycle of 5 (TSRL[DISP_COUNT] = 5) instructions from thread 1 are dispatched on four out of five cycles while instructions from thread 0 are dispatched only on one out of five cycles. • If a thread under consideration is unable to dispatch an instruction in a given slot, the other thread will be allowed to dispatch even if it was selected for dispatch on the previous attempt. Figure 4.2.25: Thread scheduling when one thread runs at medium priority while the other at low priority [4.2.2.4] Multithreading the PPE (9) Example (3): Scheduling in case of the low priority/low priority setting Basic scheduling rules • The PPE attempts to dispatch only once every duty cycle (TSCR[DISP_COUNT]) cycles. (With high values of DISP-COUNT the PPE will mostly idle, which will reduce power comsuption and heat production while keepint both threads alive.) • Thread scheduling is fair (round robin scheduling) • Assuming a duty cycle of 5, both threads are scheduled only once every 5 cycles. • If a thread under consideration is unable to dispatch an instruction in a given slot, the other thread will be allowed to dispatch even if it was selected for dispatch on the previous attempt. Figure 4.2.26: Thread scheduling when both priorities are set to low [4.2.2.4] Multithreading the PPE (10) 3. Single threaded/dual threaded mode of execution • In single threaded mode all resources are allocated to a single thread, this reduces the turnaround time of the thread. • Software can change the operating mode of the PPE between single threaded and dual threaded mode only in the hypervisor state . Multithreading the PPE (11) Software controlled thread behaviour software can use various schemes to controll thread behaviour, including • enabling and suspending a thread, • • by setting thread priorities to control instruction dispatch policy, executing a special nop instruction to cause temporary dispatch blocking, • switching the state of the PPE between single threaded and multithreaded mode. Multithreading the PPE (12) Core enhancements for multithreading • Duplicated architectural states for 32 GPRs 32 FPRs 32 Vector Registers (VRs) Condition Register (CR) Count Register (CTR) Link Register (LR) FX Exception Register (XER) FP Status and Control Register (FPSCR) Vector Status and Control Register (VSCR) Decrementer (DEC) Multithreading the PPE (13) • Duplicated microarchitectural states for Branch History Table (BHT) with global branch history (to allow independent and simultaneous branch prediction for both threads) Internal registers associated with exceptions and interrupt handling, such as Machine State Register (MR) Machine Status Save/Restore Registers (SRR0, SRR1) Hipervisor Machine Status Save/Restore Registers (HSRR0, HSRR1) FP Status and Control Register (FPSCR) etc. (to allow concurrent exception and interrupt handling) Multithreading the PPE (14) • Duplicated queues and arrays Segment lookaside buffer (SLB) Instruction buffer queue (Ibuf) (to allow each thread to dispatch regardles of any dispatch stall in the other thread) Link stack queue • Shared resources Hardware execution units The instruction fetch control (because the I$ has only one read port and so fetching must alternate between threads every cycle). Virtual memory mapping (as both threads always execute in the same logical partitioning context) Most large arrays and queues, such as caches that consume significant amount of chip area Multithreading the PPE (15) The programming model assumes the choice of an appropriate SPU configuration Basic SPU configurations • Application specific SPU accelerators, • Multi-stage pipeline SPU configuration or • Parallel-stages SPU configuration; Multithreading the PPE (16) Application specific SPU accelerators [4.2.2.5] Multithreading the PPE (17) Multi-stage SPU pipeline configuration [4.2.2.5] Programming models (1) Parallel-stages SPU configuration [4.2.2.5] Programming models (2) Basic approach for creating an application • Programmer chooses the appropriate SPU configuration according to the features of an application, such as • • Graphics processing Audio processing MPEG Encoding/Decoding Encryption/Decryption Programmer writes/uses SPU „libraries” either for Application specific SPU accelerators, Multi-stage pipeline SPU configuration or Parallel-stages SPU configuration; e.g. for; Main application in PPE, invokes SPU bound services by creating SPU threads RPC like function calls I/O device like interfaces (FIFO/command queue) • One ore more SPUs cooperate in the presumed SPU configuration to execute the tasks required. Programming models (3) • Acceleration provided by OS or application libraries • Application portability maintained with platform specific libraries Programming models (4) Example Aim • showing the cooperation between PPE and SPE Program • Actual goal: To calculate distance travelled in a car • It asks for: elapsed time speed Program structure • There are two program codes, one for the PPE and one for the SPE. • The PPE does the user input, then it calls the SPE executable which calculates the distance and then returns with the result. • The result is then given to the user by the PPE. Programming models (5) Example PPE Loading program and data to main store Main Store copying data from MS to LS (mfc_get) notifying SPE of work to be done (create_spu_thread) SPE 1 Local Store 1 Accessing data SPE n Local Store n Programming models (6) Main Store PPE Execution of the SPE thread SPE 1 Local Store 1 SPE n Local Store n Programming models (7) PPE Loading results from main store Main Store copying data from LS to MS (mfc_get) SPE notifies PPE „job is finished” by sending a message SPE 1 SPE n Updating results Local Store 1 Local Store n Programming models (8) #include <stdio.h> #include <libspe.h> extern spe_program_handle_t calculate_distance_handle; typedef struct { float speed; //input parameter float num_hours; //input parameter float distance; //output parameter float padding; //pad the struct a multiple of 16 bytes } program_data; External SPE program (next slide) Define the data structure passed to the SPE task int main() { program_data pd __attribute__((aligned(16))); //aligned for transfer printf("Enter the speed in miles/hr: "); scanf("%f", &pd.speed); printf("Enter the number of hours you have been driving: "); scanf("%f", &pd.num_hours); Data input speid_t spe_id = spe_create_thread(0, &calculate_distance_handle, &pd,NULL, -1, 0); spe_wait(spe_id, NULL, 0); Create the thread and wait for it to finish printf("The distance travelled is %f miles.\n", pd.distance); } return 0; Data output Programming models (9) #include <spu_mfcio.h> typedef struct { float speed; //input parameter float num_hours; //input parameter float distance; //output parameter float padding; //pad the struct a multiple of 16 bytes } program_data; Define the data structure to communicate with the SPE int main(unsigned long long spe_id, unsigned long long program_data_ea, unsigned long long env) { program_data pd __attribute__((aligned(16))); int tag_id = 0; mfc_get(&pd, program_data_ea, sizeof(pd), tag_id, 0, 0); mfc_write_tag_mask(1<<tag_id); mfc_read_tag_status_any(); pd.distance = pd.speed * pd.num_hours; } Copy data from MS to LS Wait for completition Calculate the result mfc_put(&pd, program_data_ea, sizeof(program_data), tag_id, 0, 0); mfc_write_tag_mask(1<<tag_id); Copy data from LS to MS mfc_read_tag_status_any(); Wait for completition return 0; Implementation of the Cell BE (1) Implemetation alternatives Figure 4.2.27: Cell system configuration options [4.2.2.3] Implementation of the Cell BE (2) Figure: Cell BE Blade Roadmap Source: Brochard L., A Cell History,” Cell Workshop, April, 2006 http://www.irisa.fr/orap/Constructeurs/Cell/Cell%20Short%20Intro%20Luigi.pdf Implementation of the Cell BE (3) Motherboard of the Cell Blade (QS20) Figure 4.2.28: Motherboard of the Cell Blade (QS20) [4.2.2.5] References Cell BE [4.2.2.1] Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006, http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf [4.2.2.2] Hofstee P., „Tutorial: Hardware and Software Architectures for the CELL BROADBAND ENGINE processor”, IBM Corp., September 2005 http://www.crest.gatech.edu/conferences/cases2005/pdf/Cell-tutorial.pdf [4.2.2.3] Kahle J.A., „Introduction to the Cell multiprocessor”, IBM J. Res & Dev Vol. 49, 2005, pp. 584-604 http://www.research.ibm.com/journal/rd/494/kahle.pdf [4.2.2.4]: Cell Broadband Engine Programming Handbook Vers. 1.1, Apr. 2007, IBM Corp. [4.2.2.5] Cell BE Overview, Course code: L1T1H1-02, May 2006, IBM Corp. 4.3 SMT multithreaded processors 4.3.1. Intel Pentium 4 4.3.2. Alpha 21464 (V8) 4.3.3. IBM Power5 4.3. Simultaneously multithreaded processors Thread scheduling in multithreaded cores Coarse grained MT cores Fine grained MT cores SMT cores 4.3.1. Intel Pentium 4 (1) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: • Duplicated architectural state, including • • • • • instruction pointer, the general purpose regs., the control regs., the APIC (Advanced Programable Interrupt Controller) regs., some machine state regs. 4.3.1. Intel Pentium 4 (2) Figure 4.3.1. Intel Pentium 4 and the visible processor resources duplicated to support hyperthreading technology. Hyperthreading requires duplication of additional miscellaneous pointers and control logic, but these are too small to point out. Source: Koufaty D. and Marr D.T. „Hyperthreading Technology in the Netburst Microarchitecture, IEEE. Micro, Vol. 23, No.2, March-April 2003, pp. 56-65. 4.3.1. Intel Pentium 4 (3) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: • Duplicated architectural state, including • • • • • instruction pointer, the general purpose regs., the control regs., the APIC (Advanced Programable Interrupt Controller) regs., some machine state regs. • Further enhancements to support MT (thread microstate): • • • • • • TC-entries (Trace cache) are tagged, BHB (Branch History Buffer) is duplicated, Global History Table is tagged, RAS (Return Address Stack) is duplicated, Rename tables are duplicated, ROB is tagged. 4.3.1. Intel Pentium 4 (4) Figure 4.3.2: SMT pipeline in Intel’s Pentium 4/HT Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture”, Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16 4.3.1. Intel Pentium 4 (5) Intel designates SMT as Hyperthreading (HT) Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp. (called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002. Additions for implementing MT: • Duplicated architectural state, including • • • • • instruction pointer, the general purpose regs., the control regs., the APIC (Advanced Programable Interrupt Controller) regs., some machine state regs. • Further enhancements to support MT (thread microstate): • • • • • • TC-entries (Trace cache) are tagged, BHB (Branch History Buffer) is duplicated, Global History Table is tagged, RAS (Return Address Stack) is duplicated, Rename tables are duplicated, ROB is tagged. Additional die area required for MT: less than 5 %. Single thread/dual thread modes: To prevent single thread performance degradation: in single thred mode partitioned resources are recombined. 4.3.2. Alpha 21464 (V8) (1) 8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. In 2001 all Alpha intellectual property rights were sold to Intel. Core enhancements for 4-way multithreading: • Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Alpha 21264Alpha 21464 GPRs FPRs 80 80 512 Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243 4.3.2. Alpha 21464 (V8) (2) SMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire PC Register Map Regs Dcache Regs Icache Better answers Figure 4.3.3: SMT pipeline in the Alpha 21464 (V8) Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com 4.3.2. Alpha 21464 (V8) (3) 8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. In 2001 all Alpha intellectual property rights were sold to Intel. Core enhancements for 4-way multithreading: • Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Alpha 21264Alpha 21464 GPRs FPRs 80 80 512 • Providing replicated (4 x) thread microstates for: Register Maps, Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243 4.3.2. Alpha 21464 (V8) (4) SMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire PC Register Map Regs Dcache Regs Icache Better answers Figure 4.3.4: SMT pipeline in the Alpha 21464 (V8) Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com 4.3.2. Alpha 21464 (V8) (5) 8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line. In 2001 all Alpha intellectual property rights were sold to Intel. Core enhancements for 4-way multithreading: • Providing replicated (4 x) thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): Alpha 21264Alpha 21464 GPRs FPRs 80 80 512 • Providing replicated (4 x) thread microstates for: Register Maps, Additional core area needed for SMT: ~ 6 % Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243 4.3.3. IBM POWER5 (1) POWER5 enhancements vs the POWER4: • on-chip memory control, 4.3.3. IBM POWER5 (2) Fabric Controller Figure 4.3.5: POWER4 and POWER5 system structures Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47. 4.3.3. IBM POWER5 (3) POWER5 enhancements vs the POWER4: • on-chip memory control, • separate L3/memory attachment, 4.3.3. IBM POWER5 (4) Fabric Controller Figure 4.3.6: POWER4 and POWER5 system structures Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47. 4.3.3. IBM POWER5 (5) POWER5 enhancements vs the POWER4: • on-chip memory control, • separate L3/memory attachment, • dual threaded. 4.3.3. IBM POWER5 (6) Figure 4.3.7: Microarchitecture of IBM’s POWER5 Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 4.3.3. IBM POWER5 (7) Figure 4.3.8: IBM POWER5 Chip Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 4.3.3. IBM POWER5 (8) Core enhancements for multithreading: • Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): POWER4 GPRs FPRs 80 72 POWER5 120 120 4.3.3. IBM POWER5 (9) Figure 4.3.9: SMT pipeline of IBM’s POWER5 Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 4.3.3. IBM POWER5 (10) Core enhancements for multithreading: • Providing duplicated architectural states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): POWER4 GPRs FPRs POWER5 80 72 • Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB) 120 120 4.3.3. IBM POWER5 (11) Figure 4.3.10: SMT pipeline of IBM’s POWER5 Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 4.3.3. IBM POWER5 (12) Core enhancements for multithreading: • Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): POWER4 GPRs FPRs POWER5 80 72 120 120 • Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB) • Providing increased (in fact duplicated) sizes for scarce or sensitive resorces, such as: Instruction Buffer, Store Queue 4.3.3. IBM POWER5 (13) Figure 4.3.11: SMT pipeline of IBM’s POWER5 Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 4.3.3. IBM POWER5 (14) Core enhancements for multithreading: • Providing duplicated thread states for: PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files): POWER4 GPRs FPRs POWER5 80 72 120 120 • Providing duplicated thread microstates for: Return Address Stack, Group Completion (ROB) • Providing increased (duplicated) size for scarce or sensitive resorces, such as: Instruction Buffer, Store Queue Additional core area needed for SMT: ~ 10 % 4.3.3. IBM POWER5 (15) Unbalanced execution of threads: (an enhancement of the single mode/dual mode thred execution model) • Threads have 8 priority levels (0...7) controlled by HW/SW, • the decode rate of each thread will be controlled according to the associated priority Figure 4.3.12: Unbalanced execution of threads in IBM’s POWER5 Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003 4.3.3. IBM POWER5 (16) Development effort: • Concept phase: • High level design phase: • Implementation phase: ~ 10 persons/ 4 month ~ 50 persons/ 6 month ~ 200 persons/ 12-18 month Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003