Cell Processor

advertisement
Cell Processor
(Cell Broadband Engine Architecture)
Mark Budensiek
1/21
Background
• Joint collaboration of IBM/Sony/Toshiba (STI)
 First implementation of the architecture in 2005
• Develop a new/next-gen processor
 Initially for Play Station 3
 Others, multimedia application (Blu-ray, HDTV)
 Server systems
 Supercomputers
2/21
Synergistic Processing Element
3/21
Power Processor Element (PPE)
• The PPE is a 64 bit, "Power Architecture“
 capable of running POWER or PowerPC binaries
 Acts as the controller for the 8 SBEs
4/21
Element Interconnect Bus
• Connects various on chip elements
 PPE , 8 SPEs, memory controller (MIC) & off-chip I/O
interfaces
• Data-ring structure with control of a bus
 4 unidirectional rings but 2 rings run counter direction
to other 2
 Worst-case maximum latency is only half distance of
the ring
• Each ring is 16 bytes wide and runs at half the
core clock frequency (core clock freq ~3.2 GHz)
5/21
Synergistic Processing Elements
• An SPE is composed of a Synergistic Processing Unit and
a Memory Flow controller.
 SPU is a SIMD, RISC-based processor (3.2 GHZ)
 SPU’s ISA a cross between VMX and the PS2’s Emotion Engine.
• Single Instruction Multiple Data (SIMD) organization
 Multiple processing elements that perform the same operation on
multiple data simultaneously.
• Statically scheduled (compiler plays big role)
 Also no dynamic (branch) prediction hardware (relies on compiler
generated hints)
• Each SPE consists of:




128 x 128 register
Local Store (SRAM)
DMA unit
FP, LD/ST, Permute, Branch Unit (each pipelined)
6/21
SPE Architecture
7/21
Copyright: IBM
SPU Architecture Overview
• 128 General Purpose Registers (each 128 bits)
• Support for 16-bit (half-word) and 32-bit (word) signed
Integers and 8-bit unsigned Integers.
• Support for single-precision (32-bit) and double-precision
(64-bit) floating-point data.
• No condition register.
• Local storage. SPU load/store transfers quad-words
between GPRs and storage. Storage size can vary but
address space limited to 4 GB.
• Channel interface to external devices. GPRs  channel
interface
 Up to 128 channels
• Supports up to 128 special-purpose registers
8/21
Data Layout in Registers
The leftmost word (bytes 0, 1, 2, and 3) of a register is
called the preferred slot. When instructions use or produce
scalar operands or addresses, the values are in the preferred
slot. A set of store assist instructions is available to help
store bytes, halfwords, words, and doublewords.
9/21
SPE Local Store
• Each SPE has local on-chip memory a.k.a Local Store(LS)
 Instruction and Data store
 Visible to PPE and can be addressed directly
 Does not operate like cache
• Data/instructions are transferred between LS and system
memory/other SPE’s LS using DMA unit
 128 bytes at a time(transfer rate of 0.5 terabytes/sec)
 DMA transactions are coherent
10/21
SPU ISA Instructions
• 32 Bits in length
• 6 basic instruction formats
RR Instruction Format:
RRR Instruction Format:
RI7 Instruction Format:
11/21
SPU ISA Instructions (cont)
RI10 Instruction Format:
RI16 Instruction Format:
RI18 Instruction Format:
12/21
Types of Instructions
•
•
•
•
•
•
•
•
•
Memory – Load/Store
Constant-Formation
Integer and Logical
Shift and Rotate
Compare, Branch, Halt
Hint for Branch
Floating Point
Control
Channel
13/21
Memory – Load/Store Instructions
•Size of local storage address space is (up to) 2^32 bytes = 4GB
•Local storage is byte-addressed
•Load/Store inst combine operands from one or two regs and/or an
immediate value to form the effective addr of the memory operand.
•Only aligned 16-byte-long quadwords can be loaded and stored.
Therefore, the right-most 4 bits of an effective address are always
ignored and are assumed to be zero.
14/21
Memory – Load/Store Instructions
Example:
Load Quad-word (RR format)
15/21
Constant-Formation Instruction
•Loads immediate values to target register
Example:
Immediate Load Word
16/21
Integer and Logical Instructions
•Full compliment of arithmetic functions ex. Add, Subtract,
Multiply, Generate carry, Generate borrow, Average, Sum, …
•Logical functions: And, Or, XOR, Nand, Nor, Equivelent, …
•Both Reg and Immediate instruction formats
Examples:
Add Word
17/21
Integer and Logical Instructions (cont)
And
18/21
Shift and Rotate Instructions
Shift Left halfword
19/21
Shift and Rotate Instructions
Rotate halfword
20/21
Compare, Branch, and Halt Instructions
•Conditional Branch
-No condition code register
-Utilize GPR value usually set by a compare instruction
Register value set to all 1’s for all 0’s based on compare result
-Logical compare instructions treat the operands as unsigned integers
• Halt instructions
-Stops execution when tested condition is met
-The stop is not precise. As a result, execution cannot generally be restarted.
21/21
Compare, Branch, and Halt Instructions (cont)
Compare Equal Word
Branch if not Zero Word
22/21
Compare, Branch, and Halt Instructions (cont)
Halt If Greater Than
23/21
SPU ISA
Purpose is to achieve high performance on critical
workloads for game, media, and broadband
systems.
Key SPU Workloads:
• Graphics pipeline which includes subdivision and rendering.
• Stream processing, which includes encoding, decoding, encryption,
and decryption
• Modeling, witch includes game physics
Implementations of the SPU ISA achieve better performance to cost ratios
than general-purpose processors because the SPU ISA
implementations require half the power and half the chip area for
equivalent performance.
24/21
SPU ISA and the 4 Principles
1.
Simplicity favors regularity
-
All instructions are the same length. All Immediate instructions follow a
similar format (fields in a common location). Register-type instructions
can vary in format depending on number of registers used.
Register block is 128x128(bit)
-
2.
Smaller is faster
-
3.
Make the common case fast
-
4.
Large number of GPRs and SPRs
32-bit instructions
Single precision floating point calculations
Good design demands good compromises
-
Large register size facilitates SIMD computations
25/21
Summary (of Cell)
• Cell processor architecture is optimized for digital
media and entertainment
• Facilitating convergence between supercomputing
and entertainment – desire for realism.
• Enables new classes of applications.
26/21
Programming the cell is challenging
Issues
• Dividing program among different cores
• Creating instructions in a different language for
the 8 SPEs than for the PowerPC core.
• Need to think in terms of SIMD nature of dataflow
to get maximum performance from SPUs
• SPU local store needs to perform coherent DMA
access for accessing system memory
27/21
Compiling and Binding of a program on CELL
28/21
Copyright: IBM
Questions?
29/21
Download