INF5062:
Programming asymmetric multi-core processors
29/8 - 2008
Disclaimer
Asymmetric and heterogeneous multi-core processors
− This is a developing terminology
− Asymmetric
• Multi-core chips
• Entirely identical instruction set
• Asymmetry in speeds, frequencies, power consumption, etc.
− Heterogenous
• Multi-core chips
• Heterogeneous instruction sets
“Heterogeneous” is nowadays better; the course name will change next year
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Overview
Course topic and scope
Background for the use and parallel processing using heterogeneous multi-core processors
Examples of heterogeneous architectures
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
People
Håvard Espeland (TA) email: haavares @ ifi
Pål Halvorsen email: paalh @ ifi
Carsten Griwodz email: griff @ ifi
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
About INF5062: Topic & Scope
Content: The course gives …
− … an overview of heterogeneous multi-core processors in general and three variants in particular
(architectures and use)
− … an introduction to working with heterogenous multi-core processors
• Intel IXP 2400 network processor card
• nVIDIA’s G80 family of GPUs and the CUDA programming framework
• The Cell Broadband Engine Architecture
− … some ideas of how to use/program heterogeneous multi-core processors
(regular and guest lectures)
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
About INF5062: Topic & Scope
Tasks:
An important part of the course are lab-assignments where the students program each of the three examples of heterogeneous multi-core processors
1.
On the Intel IXP
• Protocol statistics – download and run wwpingbump to give processor, interface and protocol statistics and then extend it
2.
On the Cell processor
• Video encoding – download Motion JPEG compression software and improve its performance by using the Cell processor’s SBEs
3.
On the nVIDIA graphics cards
• Video encoding – the same goal as above, but exploit the parallelity of the
G80 architecture
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
About INF5062: Exam
Prerequisite – mandatory assignments:
− Lab assignment 2: solve task 2 or 3
− Present it to the class
2 graded assignments (counting 33% each):
− Lab assignment 1: solve task 1
• Deliver code
• Make a demonstration to the class
• Explain your design and code
− Lab assignments 3: solve task 2 and 3
• Deliver code
• Demonstrate that lab assignment to the class that was not shown in the mandatory assignment
• Explain your design and code
Final oral exam (counting 33%): early December 2008
− Content of the lectures
− Content of lab assignments
− The experimental platforms used
− The own code
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Available Resources
Resources will be placed at
− http://www.ifi.uio.no/~griff/INF5062
− Login:
− Password: inf5062 ixp
− Manuals, papers, code example, …
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Background and Motivation:
Motivation: Intel View
Soon >billion transistors integrated
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Motivation: Intel View
Soon >billion transistors integrated
Clock frequency can still increase
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Motivation: Intel View
Soon >billion transistors integrated
Clock frequency can still increase
Future applications will demand TIPS
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Motivation: Intel View
Soon >billion transistors integrated
Clock frequency can still increase
Future applications will demand TIPS
Power? Heat?
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Motivation: Intel View
Soon >billion transistors integrated
Clock frequency can still increase
Future applications will demand TIPS
Power? Heat?
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Motivation
“Future applications will demand TIPS”
“Think platform beyond a single processor”
“Exploit concurrency at multiple levels”
“Power will be the limiter due to complexity and leakage”
University of Oslo
Distribute workload on multiple cores
INF5062, Pål Halvorsen and Carsten Griwodz
Background and Motivation:
Symmetric Multi-Core Processors
QuickTime™ and a
decompressor are needed to see this picture.
Intel Dual-Core Xeon
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Symmetric Multi-Core Processors
QuickTime™ and a
decompressor are needed to see this picture.
University of Oslo
Phenom X4
INF5062, Pål Halvorsen and Carsten Griwodz
Symmetric Multi-Core Processors
QuickTime™ and a
decompressor are needed to see this picture.
University of Oslo
UltraSparc
INF5062, Pål Halvorsen and Carsten Griwodz
Intel Multi-Core Processors
Symmetric multi-processors allow multi-threaded applications to achieve higher performance at less die area and power consumption than single-core processors
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Symmetric Multi-Core Processors
Good
− Growing computational power
Problematic
− Growing die sizes
− Some cores used much more than others
− Individual cores frequently unused
− Many core parts frequently unused
Why not spread the load better?
− Functions exist only once per core
− Parallel programming is hard
Asymmetric multi-core processors
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Asymmetric Multi-Core Processors
Asymmetric multi-processors consume power and provide increased computational power only on demand
University of Oslo
Highly parallel Moderately parallel Sequential
INF5062, Pål Halvorsen and Carsten Griwodz
Honogeneous Multi-Core Processors
Operating systems scale only to a limited number of threads
Where does the increase in core numbers stop?
How many applications need more than 64 threads?
Performance?
software limits are the issue
Application-specific engines?
Intel Academic Forum 2006: YES
But: Programming model?????
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Motivation
“Future applications will demand TIPS”
“Think platform beyond a single processor”
“Exploit concurrency at multiple levels”
“Power will be the limiter due to complexity and leakage”
University of Oslo
Distributed workload on multiple cores
+ simple processors are easier to program
+consume less energy
heterogeneous multi-core processors
INF5062, Pål Halvorsen and Carsten Griwodz
Background and Motivation:
Co-Processors
The original IBM PC included a socket for an Intel 8087 floating point co-processor (FPU)
− 50-fold speed up of floating point operations
Intel kept the co-processor up to i486
− 486DX contained an optimized i487 block
− Still separate pipeline (pipeline flush when starting and ending use)
− Communication over an internal bus
Commodore Amiga was one of the earlier machines that used multiple processors
− Motorola 680x0 main processor
− Blitter (block image transferrer - moving data, fill operations, line drawing, performing boolean operations)
− Copper (Co-Processor - change address for video RAM on the fly)
− And finally: IBM PowerPC - relegate the 680x0 to a co-processor job
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Graphics Processing Units (GPUs)
University of Oslo
2D buss connector
& memory hub
GPU: a dedicated graphics rendering device
First GPUs,
3D
80s: for early 2D operations
Amiga and Atari used a blitter,
Amiga had also the copper
90s: 3D hardware for game consoles like PS and N64
3dfx Voodoos 3D add-on card for PCs
INF5062, Pål Halvorsen and Carsten Griwodz
Graphics Processing Units (GPUs)
University of Oslo buss connector
& memory hub
INF5062, Pål Halvorsen and Carsten Griwodz
New powerful GPUs, e.g.,:
Nvidia GeForce GX280
30 400 MHz core
1 GB memory memory BW: 141+ GBps
1.296 GHz
similar to other manufacturers …
General Purpose Computing on GPU
The
− high arithmetic precision
− extreme parallel nature
− optimized, special-purpose instructions
− available resources
− …
… of the GPU allows for general, non-graphics related operations to be performed on the GPU
Generic computing workload is off-loaded from CPU and to GPU
More generically:
Heterogeneous multi-core processing
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Background and Motivation:
nVIDIA G92
nVIDIA GT280 (latest and greatest)
− 1.4 billion transistors
− 240 shaders
− 512 bit memory bus
(GDDR3)
− 141,7 GB/s memory bandwidth
− 933 Gflops
− PCI Express 2.0
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
nVIDIA G92
Stream Multiprocessor
− Per TPC (3 clusters)
• 16bK cache (in TEX)
− Per SM
• 16 kB level 1 cache
• 64 kB shared memory
− Global
• 256 kB level 2 cache
Number of stream multiprocessors
− 1 - Quadro NVS 130M
− 16 - GeForce 8800 Ultra / GTX
− 30 - GeForce GTX 280
− 4x30 - Tesla S1070
Streaming Multiprocessor (SM)
Instruction Fetch
Instruction L 1 Cache
Thread / Instruction Dispatch
L1 Fill
Shared Memory
SP0 RF0 RF4 SP4
S
F
U
SP1 RF1 RF5 SP5
S
F
U
SP2 RF2 RF6 SP6
SP3 RF3 RF7 SP7
Constant L1 Cache
Load Texture
L1 Fill
Load from Memory
Store to
Store to Memory
Work
Control
Results
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Memory Bandwidth for CPU and GPU
Marketed as GPGPUs
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Background and Motivation:
STI (Sony, Toshiba, IBM) Cell
Motivation for the Cell
− Cheap processor
− Energy efficient
− For games and media processing
− Short time-to-market
Conclusion
− Use a multi-core chip
− Design around an existing, powerefficient design
− Add simple cores specific for game and media processing requirements
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
STI (Sony, Toshiba, IBM) Cell
Cell is a 9-core processor
− combining a light-weight generalpurpose processor with multiple co-processors into a coordinated whole
− Power Processing Element (PPE)
• conventional Power processor
• not supposed to perform all operations itself, acting like a controller
• running conventional OSes
• 16 KB instruction/data level 1 cache
• 512 KB level 2 cache
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
STI (Sony, Toshiba, IBM) Cell
− Synergistic Processing
Elements (SPE)
• specialized co-processors for specific types of code, i.e., very high performance vector processors
• local stores
• can do general purpose operations
• the PPE can start, stop, interrupt and schedule processes running on an SPE
− Element Interconnect Bus (EIB)
• internal communication bus
• connects on-chip system elements:
PPE & SPEs
the memory controller (MIC)
two off-chip I/O interfaces
• 25.6 GBps each way
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
STI (Sony, Toshiba, IBM) Cell
− memory controller
• Rambus XDRAM interface to
Rambus XDR memory
• dual channels at 12.8 GBps
25.6 GBps
− I/O controller
• Rambus FlexIO interface which can be clocked independently
• dual configurable channels
• maximum ~ 76.8 GBps
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
STI (Sony, Toshiba, IBM) Cell
− Cell has in essence traded running everything at moderate speed for the ability to run certain types of code at high speed
− used for example in
• Sony PlayStation 3 :
3.2 GHz clock
7 SPEs for general operations
1 SPE for security for the OS
• Toshiba home cinema :
decoding of 48 HDTV MPEG streams
dozens of thumbnail videos simultaneously on screen
• IBM blade centers :
3.2 GHz clock
Linux ≥ 2.6.11
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
Background and Motivation:
Review of General Data Path on
Conventional Computer Hardware Architectures sending: application user space kernel space transport
(TCP/UDP) communication system link receiving: application communication system forwarding: application communication system
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
IXA: Internet Exchange Architecture
IXA
− a broad term to describe the Intel network architecture
− HW & SW, control- & data plane
IXP : Internet Exchange Processor
− processor that implements IXA
− IXP1200 is the first IXP chip
(4 versions)
− IXP2xxx has now replaced the first version
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
IXA: Internet Exchange Architecture
IXP1200 basic features
− 1 embedded 232 MHz StrongARM
− 6 packet 232 MHz µengines
− onboard memory
− 4 x 100 Mbps Ethernet ports
− multiple, independent busses
− low-speed serial interface
− interfaces for external memory and I/O busses
− …
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
IXA: Internet Exchange Architecture
IXP2400 basic features
− 1 embedded 600 MHz XScale
− 8 packet 600 MHz µengines
− onboard memory
− 3 x 1 Gbps Ethernet ports
− multiple, independent busses
− low-speed serial interface
− interfaces for external memory and I/O busses
− …
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz
IXP1200 Architecture
RISC processor:
- StrongARM running Linux
- control, higher layer protocols and exceptions
- 232 MHz
Access units:
- coordinate access to external units
University of Oslo
Scratchpad:
- on-chip memory
- used for IPC and synchronization
INF5062, Pål Halvorsen and Carsten Griwodz
Microengines:
- low-level devices with limited set of instructions
- transfers between memory devices
- packet processing
- 232 MHz
IXP1200 IXP2400
IXP1200
SRAM bus
SRAM
SRAM access
FLASH
MEMORY
MAPPED
I/O
PCI bus
PCI access
SCRATCH memory multiple independent internal buses
DRAM
University of Oslo
SDRAM access
IX access
DRAM bus
IX bus
INF5062, Pål Halvorsen and Carsten Griwodz
Embedded
RISK CPU
(StrongARM) microengine 1 microengine 2 microengine 3 microengine 4 microengine 5 microengine 6
IXP2400 Architecture
-
Coprocessors
IXP2400
-
-
-
4 timers SRAM general purpose I/O pins bus external JTAG connections (in-circuit tests)
-
-
…
SRAM access coprocessor
-
-
Slowport
FLASH shared inteface to external units used for FlashRom during bootstrap
SCRATCH memory slowport access
DRAM
SDRAM access
PCI bus
RISC processor:
- StrongArm XScale
- 233 MHz 600 MHz
PCI access
Embedded
RISK CPU
( XScale ) multiple independent internal microengine 1 microengine 2 microengine 3
-
microengine 4 forms fast path for transfers microengine 5
8
MSF access
…
- 233 MHz 600 MHz microengine 8
DRAM bus
-
Receive/transmit buses shared bus separate busses
University of Oslo receive bus
INF5062, Pål Halvorsen and Carsten Griwodz transmit bus
Background and Motivation:
The End: Summary
Heterogeneous multi-core processors are already everywhere
Challenge : programming
− Need to know the capabilities of the system
− Different abilities in different cores
− Memory bandwidth
− Memory sharing efficiency
− Need new methods to program the different components
Next time: how to start programming the Intel IXP
University of Oslo
INF5062, Pål Halvorsen and Carsten Griwodz