Cell Broadband Processor Daniel Bagley Meng Tan Agenda General Intro History of development Technical overview of architecture Detailed technical discussion of components Design choices Other processors like the cell Programming for the cell History of Development Sony Playstation2 • Announce March 1999 • Released March 2000 in Japan • 128bit “Emotion Engine” • 294mhz, MIPS CPU • Single Precision FP Optimizations • 6.2gflops History Continued Partnership between Sony, Toshiba, IBM Summer of 2000 – High level development talks Initial goal of 1000x PS2 Power March 2001, Sony-IBM-Toshiba design center opened $400m investment. Overall Goals for Cell High performance in multimedia apps Real time performance Power consumption Cost Available by 2005 Avoid memory latency issues associated with control structures The Cell itself Power PC based main core (PPE) Multiple SPEs On die memory controller Inter-core transport bus High speed IO Cell Die Layout Cell Implementation Cell is an architecture Preliminary PS3 Implementation • 1 PPE • 7 SPE (1 Disabled for yield increase) • 221 mm² die size on a 90 nm process • Clocked at 3-4ghz • 256GFLOPS Single Precision @ 4ghz Why a Cell Architecture Follows a trend in computing architecture Natural extension of dual and multicore Extremely low hardware overhead Software controllable Specialized hardware more useful for multimedia Possible Uses Playstation3 (Obviously) Blade servers (IBM) • Amazing single precision FP performance • Scientific applications Toshiba HDTV products Power Processing Element PowerPC instruction set with AltiVec Used for general purpose computing and controlling SPE’s Simultaneous Multithreading Separate 32 KB L1 Caches and unified 512 KB L2 Cache PPE (cont.) Slow but power efficient PowerPC instruction set implementation Two issue in-order instruction fetch Conspicuous lack of instruction window Compare to conventional PowerPC implementations (G5) Performance depends on SPE utilization Synergistic Processing Element (SPE) Specialized hardware Meant to be used in parallel • (7 on PS3 implementation) On chip memory (256kb) No branch prediction In-order execution Dual issue SPE Architecture 0.99µm2 on 90nm Process 128 registers (128 bits wide) • Instructions assumed to be 4x 32bit Variant of VMX instruction set • Modified for 128 registers On chip memory is NOT a cache SPE Execution Dual issue, in-order Seven execution units Vector logic 8 single precision operations per cycle Significant performance hit for double precision SPE Execution Diagram SPE Local Storage Area NOT a cache 256kb, 4 x 64kb ECC single port SRAM Completely private to each SPE Directly addressable by software Can be used as a cache, but only with software controls No tag bits, or any extra hardware SPE LS Scheduling Software controlled DMA DMA to and from main memory Scheduling a HUGE problem • Done primarily in software • IBM predicts 80-90% usage ideally Request queue handles 16 simultaneous requests • Up to 16 kb transfer each • Priority: DMA, L/S, Fetch Fetch / execute parallelism SPE Control Logic Very little in comparison Represents shift in focus Complete lack of branch prediction • Software branch prediction • Loop unrolling • 18 cycle penalty Software controlled DMA SPE Pipeline Little ILP, and thus little control logic Dual issue Simple commit unit (no reorder buffer or other complexities) Same execution unit for FP/int SPE Summary Essentially small vector computer Based on Altivec/VMX ISA • Extensions for DMA and LS management • Extended for 128x 128bit registerfile Uniquely suited for real time applications Extremely fast for certain FP operations Offload a large amount on to compiler / software. Element Interconnect Bus 4 concentric rings connecting all Cell elements 128-bit wide interconnects EIB (cont.) Designed to minimize coupling noise Rings of data traveling in alternating directions Buffers and repeaters at each SPE boundary Architecture can be scaled up with increased bus latency EIB (cont.) Total bandwidth at ~200GB/s EIB controller located physically in center of chip between SPE’s Controller reserves channels for each individual data transfer request Implementation allows for SPE extension horizontally Memory Interface Rambus XDR memory to keep Cell at full utilization 3.2 Gbps data bandwidth per device connected to XDR interface Cell uses dual channel XDR with four devices and 16-bit wide buses to achieve 25.2 GB/s total memory bandwidth Input / Output Bus Rambus FlexIO Bus IO interface consists of 12 unidirectional byte lanes Each lane supports 6.4 GB/s bandwidth 7 outbound lanes and 5 inbound lanes Design Choices In-order execution • Abandoning ILP • ILP – 10-20% increase per generation • Reducing control logic • Real time responsiveness Cache Design • Software configuration on SPE • Standard L2 cache on PPE Cell Programming Issues No Cell compiler in existence to manage utilization of SPE’s at compile time SPE’s do not natively support context switching. Must be OS managed. SPE’s are vector processors. Not efficient for general-purpose computation. PPE’s and SPE’s use different instruction sets. Cell Programming (cont.) Functional Offload Model Simplest model for Cell programming Optimize existing libraries for SPE computation Requires no rebuild of main application logic which runs on PPE Cell Programming (cont.) Device Extension Model Take advantage of SPE DMA Use SPE’s as interfaces to external devices Cell Programming (cont.) Computational Acceleration Model Traditional super-computing methods using Cell Shared memory or message passing paradigm for accelerating inherently parallel math operations Can overwrite intensive math libraries without rewriting applications Cell Programming (cont.) Streaming model Use Cell processor as one large programmable pipeline Partition algorithms into logically sensible steps. Execute each separately, in serial, on separate processors. Cell Programming (cont.) Asymmetric Thread Runtime Model Abstract Cell architecture away from programmer. Use OS to use processors to each run different threads. Sample Performance Demonstration physics engine for real-time game http://www.research.ibm.com/cell/w hitepapers/cell_online_game.pdf 182 Compute to DMA ratio on SPE’s For the right tasks, Cell architecture can be extremely efficient.