Cell Broadband Processor Daniel Bagley Meng Tan

History of Development
Sony Playstation2
• Announce March 1999
• Released March 2000 in Japan
• 128bit “Emotion Engine”
• 294mhz, MIPS CPU
• Single Precision FP Optimizations
• 6.2gflops
History Continued
Partnership between Sony, Toshiba,
 Summer of 2000 – High level
development talks
 Initial goal of 1000x PS2 Power
 March 2001, Sony-IBM-Toshiba
design center opened
 $400m investment.
Overall Goals for Cell
High performance in multimedia apps
 Real time performance
 Power consumption
 Cost
 Available by 2005
 Avoid memory latency issues
associated with control structures
The Cell itself
Power PC based
main core (PPE)
 Multiple SPEs
 On die memory
 Inter-core
transport bus
 High speed IO
Cell Die Layout
Cell Implementation
Cell is an architecture
 Preliminary PS3 Implementation
• 1 PPE
• 7 SPE (1 Disabled for yield increase)
• 221 mm² die size on a 90 nm process
• Clocked at 3-4ghz
• 256GFLOPS Single Precision @ 4ghz
Why a Cell Architecture
Follows a trend in computing
 Natural extension of dual and multicore
 Extremely low hardware overhead
 Software controllable
 Specialized hardware more useful for
Possible Uses
 Blade servers (IBM)
• Amazing single
precision FP
• Scientific applications
Toshiba HDTV
Power Processing Element
PowerPC instruction set with AltiVec
 Used for general purpose computing
and controlling SPE’s
 Simultaneous Multithreading
 Separate 32 KB L1 Caches and
unified 512 KB L2 Cache
PPE (cont.)
Slow but power efficient PowerPC
instruction set implementation
 Two issue in-order instruction fetch
 Conspicuous lack of instruction
 Compare to conventional PowerPC
implementations (G5)
 Performance depends on SPE
Synergistic Processing Element (SPE)
Specialized hardware
 Meant to be used in
• (7 on PS3
On chip memory (256kb)
 No branch prediction
 In-order execution
 Dual issue
SPE Architecture
0.99µm2 on 90nm Process
 128 registers (128 bits wide)
• Instructions assumed to be 4x 32bit
Variant of VMX instruction set
• Modified for 128 registers
On chip memory is NOT a cache
SPE Execution
Dual issue, in-order
 Seven execution units
 Vector logic
 8 single precision operations per
 Significant performance hit for
double precision
SPE Execution Diagram
SPE Local Storage Area
NOT a cache
 256kb, 4 x 64kb ECC single port
 Completely private to each SPE
 Directly addressable by software
 Can be used as a cache, but only
with software controls
 No tag bits, or any extra hardware
SPE LS Scheduling
Software controlled DMA
DMA to and from main memory
Scheduling a HUGE problem
• Done primarily in software
• IBM predicts 80-90% usage ideally
Request queue handles 16 simultaneous
• Up to 16 kb transfer each
• Priority: DMA, L/S, Fetch
Fetch / execute parallelism
SPE Control Logic
Very little in comparison
 Represents shift in focus
 Complete lack of branch prediction
• Software branch prediction
• Loop unrolling
• 18 cycle penalty
Software controlled DMA
SPE Pipeline
Little ILP, and thus
little control logic
 Dual issue
 Simple commit
unit (no reorder
buffer or other
 Same execution
unit for FP/int
SPE Summary
Essentially small vector computer
Based on Altivec/VMX ISA
• Extensions for DMA and LS management
• Extended for 128x 128bit registerfile
Uniquely suited for real time applications
Extremely fast for certain FP operations
Offload a large amount on to compiler /
Element Interconnect Bus
4 concentric rings connecting all Cell
 128-bit wide interconnects
EIB (cont.)
Designed to minimize coupling noise
 Rings of data traveling in alternating
 Buffers and repeaters at each SPE
 Architecture can be scaled up with
increased bus latency
EIB (cont.)
Total bandwidth at ~200GB/s
 EIB controller located physically in
center of chip between SPE’s
 Controller reserves channels for each
individual data transfer request
 Implementation allows for SPE
extension horizontally
Memory Interface
Rambus XDR memory to keep Cell at
full utilization
 3.2 Gbps data bandwidth per device
connected to XDR interface
 Cell uses dual channel XDR with four
devices and 16-bit wide buses to
achieve 25.2 GB/s total memory
Input / Output Bus
Rambus FlexIO Bus
 IO interface consists of 12
unidirectional byte lanes
 Each lane supports 6.4 GB/s
 7 outbound lanes and 5 inbound
Design Choices
In-order execution
• Abandoning ILP
• ILP – 10-20% increase per generation
• Reducing control logic
• Real time responsiveness
Cache Design
• Software configuration on SPE
• Standard L2 cache on PPE
Cell Programming Issues
No Cell compiler in existence to manage
utilization of SPE’s at compile time
SPE’s do not natively support context
switching. Must be OS managed.
SPE’s are vector processors. Not efficient
for general-purpose computation.
PPE’s and SPE’s use different instruction
Cell Programming (cont.)
Functional Offload Model
 Simplest model for Cell programming
 Optimize existing libraries for SPE
 Requires no rebuild of main
application logic which runs on PPE
Cell Programming (cont.)
Device Extension Model
 Take advantage of SPE DMA
 Use SPE’s as interfaces to external
Cell Programming (cont.)
Computational Acceleration Model
 Traditional super-computing methods
using Cell
 Shared memory or message passing
paradigm for accelerating inherently
parallel math operations
 Can overwrite intensive math
libraries without rewriting
Cell Programming (cont.)
Streaming model
 Use Cell processor as one large
programmable pipeline
 Partition algorithms into logically
sensible steps. Execute each
separately, in serial, on separate
Cell Programming (cont.)
Asymmetric Thread Runtime Model
 Abstract Cell architecture away from
 Use OS to use processors to each run
different threads.
Sample Performance
Demonstration physics engine for
real-time game
 http://www.research.ibm.com/cell/w
 182 Compute to DMA ratio on SPE’s
 For the right tasks, Cell architecture
can be extremely efficient.