Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek 1/21 Background • Joint collaboration of IBM/Sony/Toshiba (STI) First implementation of the architecture in 2005 • Develop a new/next-gen processor Initially for Play Station 3 Others, multimedia application (Blu-ray, HDTV) Server systems Supercomputers 2/21 Synergistic Processing Element 3/21 Power Processor Element (PPE) • The PPE is a 64 bit, "Power Architecture“ capable of running POWER or PowerPC binaries Acts as the controller for the 8 SBEs 4/21 Element Interconnect Bus • Connects various on chip elements PPE , 8 SPEs, memory controller (MIC) & off-chip I/O interfaces • Data-ring structure with control of a bus 4 unidirectional rings but 2 rings run counter direction to other 2 Worst-case maximum latency is only half distance of the ring • Each ring is 16 bytes wide and runs at half the core clock frequency (core clock freq ~3.2 GHz) 5/21 Synergistic Processing Elements • An SPE is composed of a Synergistic Processing Unit and a Memory Flow controller. SPU is a SIMD, RISC-based processor (3.2 GHZ) SPU’s ISA a cross between VMX and the PS2’s Emotion Engine. • Single Instruction Multiple Data (SIMD) organization Multiple processing elements that perform the same operation on multiple data simultaneously. • Statically scheduled (compiler plays big role) Also no dynamic (branch) prediction hardware (relies on compiler generated hints) • Each SPE consists of: 128 x 128 register Local Store (SRAM) DMA unit FP, LD/ST, Permute, Branch Unit (each pipelined) 6/21 SPE Architecture 7/21 Copyright: IBM SPU Architecture Overview • 128 General Purpose Registers (each 128 bits) • Support for 16-bit (half-word) and 32-bit (word) signed Integers and 8-bit unsigned Integers. • Support for single-precision (32-bit) and double-precision (64-bit) floating-point data. • No condition register. • Local storage. SPU load/store transfers quad-words between GPRs and storage. Storage size can vary but address space limited to 4 GB. • Channel interface to external devices. GPRs channel interface Up to 128 channels • Supports up to 128 special-purpose registers 8/21 Data Layout in Registers The leftmost word (bytes 0, 1, 2, and 3) of a register is called the preferred slot. When instructions use or produce scalar operands or addresses, the values are in the preferred slot. A set of store assist instructions is available to help store bytes, halfwords, words, and doublewords. 9/21 SPE Local Store • Each SPE has local on-chip memory a.k.a Local Store(LS) Instruction and Data store Visible to PPE and can be addressed directly Does not operate like cache • Data/instructions are transferred between LS and system memory/other SPE’s LS using DMA unit 128 bytes at a time(transfer rate of 0.5 terabytes/sec) DMA transactions are coherent 10/21 SPU ISA Instructions • 32 Bits in length • 6 basic instruction formats RR Instruction Format: RRR Instruction Format: RI7 Instruction Format: 11/21 SPU ISA Instructions (cont) RI10 Instruction Format: RI16 Instruction Format: RI18 Instruction Format: 12/21 Types of Instructions • • • • • • • • • Memory – Load/Store Constant-Formation Integer and Logical Shift and Rotate Compare, Branch, Halt Hint for Branch Floating Point Control Channel 13/21 Memory – Load/Store Instructions •Size of local storage address space is (up to) 2^32 bytes = 4GB •Local storage is byte-addressed •Load/Store inst combine operands from one or two regs and/or an immediate value to form the effective addr of the memory operand. •Only aligned 16-byte-long quadwords can be loaded and stored. Therefore, the right-most 4 bits of an effective address are always ignored and are assumed to be zero. 14/21 Memory – Load/Store Instructions Example: Load Quad-word (RR format) 15/21 Constant-Formation Instruction •Loads immediate values to target register Example: Immediate Load Word 16/21 Integer and Logical Instructions •Full compliment of arithmetic functions ex. Add, Subtract, Multiply, Generate carry, Generate borrow, Average, Sum, … •Logical functions: And, Or, XOR, Nand, Nor, Equivelent, … •Both Reg and Immediate instruction formats Examples: Add Word 17/21 Integer and Logical Instructions (cont) And 18/21 Shift and Rotate Instructions Shift Left halfword 19/21 Shift and Rotate Instructions Rotate halfword 20/21 Compare, Branch, and Halt Instructions •Conditional Branch -No condition code register -Utilize GPR value usually set by a compare instruction Register value set to all 1’s for all 0’s based on compare result -Logical compare instructions treat the operands as unsigned integers • Halt instructions -Stops execution when tested condition is met -The stop is not precise. As a result, execution cannot generally be restarted. 21/21 Compare, Branch, and Halt Instructions (cont) Compare Equal Word Branch if not Zero Word 22/21 Compare, Branch, and Halt Instructions (cont) Halt If Greater Than 23/21 SPU ISA Purpose is to achieve high performance on critical workloads for game, media, and broadband systems. Key SPU Workloads: • Graphics pipeline which includes subdivision and rendering. • Stream processing, which includes encoding, decoding, encryption, and decryption • Modeling, witch includes game physics Implementations of the SPU ISA achieve better performance to cost ratios than general-purpose processors because the SPU ISA implementations require half the power and half the chip area for equivalent performance. 24/21 SPU ISA and the 4 Principles 1. Simplicity favors regularity - All instructions are the same length. All Immediate instructions follow a similar format (fields in a common location). Register-type instructions can vary in format depending on number of registers used. Register block is 128x128(bit) - 2. Smaller is faster - 3. Make the common case fast - 4. Large number of GPRs and SPRs 32-bit instructions Single precision floating point calculations Good design demands good compromises - Large register size facilitates SIMD computations 25/21 Summary (of Cell) • Cell processor architecture is optimized for digital media and entertainment • Facilitating convergence between supercomputing and entertainment – desire for realism. • Enables new classes of applications. 26/21 Programming the cell is challenging Issues • Dividing program among different cores • Creating instructions in a different language for the 8 SPEs than for the PowerPC core. • Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs • SPU local store needs to perform coherent DMA access for accessing system memory 27/21 Compiling and Binding of a program on CELL 28/21 Copyright: IBM Questions? 29/21