Multi-Core System Lecture 11: Multi-Core and GPU Multi-core computers Multithreading

advertisement
2015-12-08
Lecture 11: Multi-Core and GPU




Multi-core computers
Multithreading
GPUs
General Purpose GPUs
Zebo Peng, IDA, LiTH
1
TDTS 08 – Lecture 11
Multi-Core System

Integration of multiple processor cores on a single chip.
 To provide a cheap parallel computer solution.
 To increase the computation power in a PC platform.

Multi-core processor is a special kind of multiprocessors:
 All processors are on the same chip  also called Chip
Multiprocessor.
 MIMD: Different cores execute different codes (threads)
operating on different data.
 A shared memory multiprocessor: all cores share the
same memory via some cache system.
Zebo Peng, IDA, LiTH
2
TDTS 08 – Lecture 11
1
2015-12-08
Why Multi-Core?

Limitations of single core architectures:
 High power consumption due to high clock rates (ca. 2-3% power
increase per 1% performance increase).
 Heat generation (cooling is expensive).
 Limited parallelism (Instruction Level Parallelism only).
 Design time and complexity increased due to complex methods to
increase ILP.

Many new applications are multithreaded, suitable for multi-core.
 Ex. Multimedia applications.

General trend in computer architecture -> shift towards more
parallelism.

Much faster cache coherency circuits, in a single chip.

Smaller in physical size than SMP.
Zebo Peng, IDA, LiTH
3
TDTS 08 – Lecture 11
A Simplified Multi-Core Model
core
core
core
core
L1
L1
Inst Data
L1
L1
Inst Data
L1
L1
Inst Data
L1
L1
Inst Data
Shared Bus
On Chip
Main Memory
Zebo Peng, IDA, LiTH
4
TDTS 08 – Lecture 11
2
2015-12-08
Alternative Models
Zebo Peng, IDA, LiTH
5
TDTS 08 – Lecture 11
Intel Core i7
Zebo Peng, IDA, LiTH
6

Four (or six) identical
x86 processors.

Each with its own L1
split, and L2 unified
cache.

Shared L3 cache to
speed up inter-proc.
communication

High speed link
between processor
chips.

Up to 4 GHz clock
frequency.
TDTS 08 – Lecture 11
3
2015-12-08
Superscalar vs. Multi-Core
6-way superscalar
Multi-core processor with 4 identical
2-way superscalar processors
(total 8 degree of parallelism)
Quadratically increases with issue width
Multi-core processors can run at higher clock rate.
Zebo Peng, IDA, LiTH
7
TDTS 08 – Lecture 11
Single Core vs. Multi Core
Performance
MULTI-CORE
10X
SINGLE CORE
3X
2008+
2004
2000
Normalized Performance vs. Initial Intel® Pentium® 4 Processor
Source: Intel
Zebo Peng, IDA, LiTH
8
TDTS 08 – Lecture 11
4
2015-12-08
Intel Polaris
80 cores with tFLOPS (1012) performance
on a single chip (1st chip to do so).

Mesh network-on-a-chip.

Frequency target at 5 GHz.

Workload-aware power management:
Peak of 1.01 Teraflops at 62 watts.
 Peak power efficiency of 19.4
gFLOPS/Watt (almost 10X better than
supercomputers).
Short design time (regularity).
Zebo Peng, IDA, LiTH
single tile
2.0mm
 Chip voltage & frequency control.

I/O Area
1.5mm
 Instructions to make any core sleep or
wake as Apps demand.

12.64mm
21.72mm

Technology
65nm, 1 poly, 8 metal (Cu)
Transistors
100 Million (full-chip)
1.2 Million (tile)
PLL2 (full-chip)
Die Area
275mm
TAP
3mm2 (tile)
I/O# Area
C4 bumps
8390
9
TDTS 08 – Lecture 11
Innovations on Intel’s Polaris

Rapid design – The tiled-design approach allows
designers to use smaller cores that can easily be
repeated across the chip.

Network-on-a-chip – The cores are connected in a 2D
mesh network that implement message-passing.
 This scheme is much more scalable.

Fine-grain power management:
 The individual computing engines and data routers in
each core can be activated or put to sleep based on
the performance required by the applications.
Zebo Peng, IDA, LiTH
10
TDTS 08 – Lecture 11
5
2015-12-08
Lecture 11: Multi-Core and GPU




Multi-core computers
Multithreading
GPUs
General Purpose GPUs
Zebo Peng, IDA, LiTH
11
TDTS 08 – Lecture 11
Thread-Level Parallelism (TLP)

This is parallelism on a coarser level than instruction-level
parallelism (ILP).

Instruction stream divided into smaller streams (threads) to be
executed in parallel.

Thread has its own instructions and data.
 It is part of a parallel program or independent programs.
 Each thread has all state (instructions, data, PC, etc.) needed to
execute.

Multi-core architectures can exploit TLP efficiently.
 Use multiple instruction streams to improve the throughput of
computers that run several programs.

TLP are more cost-effective to exploit than ILP.
Zebo Peng, IDA, LiTH
12
TDTS 08 – Lecture 11
6
2015-12-08
Multithreading Approaches

Interleaved (fine-grained)





Processor deals with several thread contexts at a time.
Switching thread at each clock cycle (hardware support needed).
If a thread is blocked, it is skipped.
Hide latency of both short and long pipeline stalls.
Blocked (coarse-grained)
 Thread executed until an event causes delay (e.g., cache miss).
 Relieves the need to have very fast thread switching.
 No slow down for ready-to-go threads.

Simultaneous (SMT)
 Instructions simultaneously issued from multiple threads to
execution units of a superscalar processor.

Chip multiprocessing
 Each processor handles separate threads.
Zebo Peng, IDA, LiTH
13
TDTS 08 – Lecture 11
Scalar Processor Case
Cache miss
Zebo Peng, IDA, LiTH
14
TDTS 08 – Lecture 11
7
2015-12-08
Scalar Processor Approaches

Single-threaded scalar
 Simple pipeline and no multithreading.

Interleaved multithreaded scalar





Easiest multithreading to implement.
Switch threads at each clock cycle.
Pipeline stages kept close to fully occupied.
Hardware needs to switch thread context between cycles.
Blocked multithreaded scalar
 Thread executed until latency event occurs to stop the
pipeline.
 Processor switches to another thread.
Zebo Peng, IDA, LiTH
15
TDTS 08 – Lecture 11
Superscalar Case
Zebo Peng, IDA, LiTH
16
TDTS 08 – Lecture 11
8
2015-12-08
Superscalar Approaches

Classical superscalar
 No multithreading.

Interleaved multithreading superscalar:
 Each cycle, as many instructions as possible issued from single
thread.
 Delays due to thread switches eliminated.
 Number of instructions issued in a cycle is limited by dependencies.

Blocked multithreaded superscalar
 Instructions from one thread are executed at a time.
 Blocked multithreading used.
Zebo Peng, IDA, LiTH
17
TDTS 08 – Lecture 11
SMT and Chip Multiprocessing

Simultaneous
multithreading (SMT):
 Issue multiple
instructions at a time.
 One thread may fill all
horizontal slots.
 Instructions from two
or more threads may
be issued to fill all
horizontal slots.
 With enough threads,
it can issue maximum
number of instructions
on each cycle.
•
Zebo Peng, IDA, LiTH
18
Best efficiency!
TDTS 08 – Lecture 11
9
2015-12-08
SMT and Chip Multiprocessing

Chip multiprocessing:
 Multiple processors.
 Each can be itself a
superscalar
processor, with
multiple instruction
issues (e.g., 2 issues).
 Each processor is
assigned a thread (or
a process).
• Control is
simplified!
Zebo Peng, IDA, LiTH
19
TDTS 08 – Lecture 11
Multithreading Paradigms
Execution Time
FU1 FU2 FU3 FU4
Unused
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Conventional
Superscalar
Single
Threaded
Zebo Peng, IDA, LiTH
Fine-grained
Coarse-grained
Simultaneous
Multithreading Multithreading
Multithreading
(cycle-by-cycle (Block Interleaving)
(SMT)
Interleaving)
20
Chip
Multiprocessor
(CMP)
TDTS 08 – Lecture 11
10
2015-12-08
Programming for Multi-Core

There must be many threads or processes:
 Multiple applications running on the same machine.
• Multi-tasking is becoming very common.
• OS software tends to run many threads as a part of its
normal operation.
 An application may also have multiple threads.
• In most cases, it must be specifically written or
compiled.

OS scheduler should map threads to different cores:
 To balance the work load; or
 To avoid hot spots due to heat generation.
Zebo Peng, IDA, LiTH
21
TDTS 08 – Lecture 11
Lecture 11: Multi-Core and GPU




Multi-core computers
Multithreading
GPUs
General Purpose GPUs
Zebo Peng, IDA, LiTH
22
TDTS 08 – Lecture 11
11
2015-12-08
Graphics Processing Unit  Why?

Graphics applications are:
 Rendering of 2D or 3D images with complex optical effects;
 Highly computation intensive;
 Massively parallel;
 Data stream based.

General purpose processors (CPU) are:
 Designed to handle huge data volumes;
 Serial, one operation at a time;
 Control flow based;
 High flexible, but not very well adapted to graphics.

GPUs are the solutions!
Zebo Peng, IDA, LiTH
23
TDTS 08 – Lecture 11
Graphic Processing Example
Input from CPU
Host interface
Yellow parts are
programmable
Vertex processing
Triangle setup
Pixel processing
Memory Interface
64bits to
memory
Zebo Peng, IDA, LiTH
64bits to
memory
64bits to
memory
24
64bits to
memory
TDTS 08 – Lecture 11
12
2015-12-08
Graphic Processing Feature
1) Process pixels in parallel


Zebo Peng, IDA, LiTH
Large degree of data-parallel:
 2.1M pixels per frame (HDTV)
=> lots of work
 All pixels are independent
=> no synchronization needed
 Lots of spatial locality
=> regular memory access
Great speedups can be achieved
 Limited only by the amount of
hardware
25
TDTS 08 – Lecture 11
Graphic Processing Feature
2) Focus on throughput, not latency

Each pixel can take a long time to process, as long as we
process many of them at the same time.

Great scalability
 Lots of simple parallel processors
 Low clock speed
Latency-optimized (fast, serial)
Zebo Peng, IDA, LiTH
Throughput-optimized (slow, parallel)
26
TDTS 08 – Lecture 11
13
2015-12-08
GPU Examples
Lots of Small Parallel SIMD Processors
Limited Interconnect
Limited Memory
Nvidia G80
Lots of Memory Controllers
Very Small Caches
Zebo Peng, IDA, LiTH
AMD 5870
27
TDTS 08 – Lecture 11
GPU Examples
Hardware thread schedulers
Nvidia G80
Fixed-function Logic
AMD 5870
Zebo Peng, IDA, LiTH
28
TDTS 08 – Lecture 11
14
2015-12-08
CPU vs. GPU Philosophy
L2
BP
I$
LM
I$
LM
I$
LM
I$
LM
I$
LM
I$
LM
I$
LM
I$
L2
L1
BP
L2
BP
LM
L1
L2
L1
BP
L1
4 CPU Massive Cores: Big caches,
branch predictors, out-of-order, multipleissue, speculative execution…
2 IPC per core, 8 IPC total @3GHz
Zebo Peng, IDA, LiTH
8*8 Simple GPU Cores: No big caches,
in-order, single-issue, …
 1 IPC per core, 64 IPC total @1.5GHz
Source: David Black-Schaffer
29
TDTS 08 – Lecture 11
CPU vs. GPU Chips
Intel Nehalem 4-core
AMD Radeon 5870
130W, 263mm2
106 gFLOPS (Single Precision)
Big caches (8MB)
Out-of-order execution
0.8 gFLOPS/W
188W, 334mm2
2720 gFLOPS (Single Precision)
Small caches (<1MB)
Hardware thread scheduling
14.5 gFLOPS/W
Source: David Black-Schaffer
Zebo Peng, IDA, LiTH
30
TDTS 08 – Lecture 11
15
2015-12-08
CPU vs. GPU Architecture

CPUs are latency-optimized
 Each thread runs as fast as possible, but only a few threads.

GPUs are throughput-optimized
 Each thread may take a long time, but thousands of threads.

CPUs have a few massive cores.
GPUs have hundreds of simple cores.

CPUs excel at irregular control-intensive work

 Lots of hardware for control, few ALUs.

GPUs excel at regular computation-intensive work
 Lots of ALUs for computation, little hardware for control.
The best is to combine both architectures in the same machine!
Zebo Peng, IDA, LiTH
31
TDTS 08 – Lecture 11
Lecture 11: Multi-Core and GPU




Multi-core computers
Multithreading
GPUs
General Purpose GPUs
Zebo Peng, IDA, LiTH
32
TDTS 08 – Lecture 11
16
2015-12-08
GPU – Specialized Processor

Very efficient for
 Fast parallel floating point processing.
 SIMD (Single Instruction Multiple Data) operations.
 High computation per memory access.

Not as efficient for
 Double precision computations.
 Logical operations on integer data.
 Random access, memory-intensive operations.
 Branching-intensive operations.
Zebo Peng, IDA, LiTH
33
TDTS 08 – Lecture 11
GPU Instruction Flow

GPU control units fetch 1 instruction per cycle, and share it
with 8 processor cores.
 SIMD (AMD GPU = 4-way; Intel’s Larabee = 16-way).

What if they don’t all want the same instruction?
 Divergent execution, due to branches, for example.
LM
Zebo Peng, IDA, LiTH
I$
34
TDTS 08 – Lecture 11
17
2015-12-08
Divergent Execution
Thread
Instructions
1
2
if
3
el
4
5
6
thread
t0 t1 t2 t3 t4 t5 t6 t7
Cycle 0
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
if (…)
do 3
else
do 4
Fetch:
Fetch:
Fetch:
Fetch:
Fetch:
Fetch:
Fetch:
Fetch:
Fetch:
1
2
if
3
el
5
4
6
5
1
2
if
3
1
2
if
3
1
2
if
3
1
2
if
3
1
2
if
3
1
2
if
3
1 1
2 2
if if
t7 stalls
3
el
5 5 5 5 5 5 5
4
6 6 6 6 6 6 6
5
Divergent execution can dramatically hurt performance!
Source: David Black-Schaffer
Zebo Peng, IDA, LiTH
35
TDTS 08 – Lecture 11
Reduce Branch Divergence

Software-based optimizations can be used.

Iteration delaying:
 target a divergent branch enclosed by a loop.
 execute loop iterations that take the same branch
direction.
 delay those that take the other direction until later
iterations.
 execute the delayed operations in future iterations
where more threads take the same direction.
Zebo Peng, IDA, LiTH
36
TDTS 08 – Lecture 11
18
2015-12-08
GPGPU

Make GPU more general by having more
programmable components.

Implement GPU and CPU (multi-core) in the same
machine.

Provide hardware support on GPU for non-graphics
applications.

Include multiple parallelism scheme in the same
architecture.
Zebo Peng, IDA, LiTH
37
TDTS 08 – Lecture 11
NVIDIA Tesla Architecture
Zebo Peng, IDA, LiTH
38
TDTS 08 – Lecture 11
19
2015-12-08
GPU Architecture Features

Massively Parallel, 1000s of processors (today).

Cost efficiency:
 Graphics driving demand up, supply up, leading to low price.

Power efficiency:
 Fixed Function Hardware = area & power efficient.
 No speculation: more processing, less leaky cache.
 SIMD computation.

Computing power = Frequency * Transistors
 (Moore’s law) 2.
 GPUs: 1.7X (pixels) to 2.3X (vertices) annual growth.
 CPUs: 1.4X annual growth.
Zebo Peng, IDA, LiTH
39
TDTS 08 – Lecture 11
Performance Improvement
1400
T12
1200
NVIDIA GPU
Intel CPU
GFLOPs
1000
GT200
800
G80
600
400
G70
200
NV30
NV40
3GHz Dual
Core P4
3GHz Core2
Duo
3GHz Xeon
Quad
Westmere
0
9/22/2002
2/4/2004
Zebo Peng, IDA, LiTH
6/18/2005
10/31/2006
40
3/14/2008
7/27/2009
TDTS 08 – Lecture 11
20
2015-12-08
CUDA Programming Language

CUDA = Computer Unified Device Architecture.

Minimal extensions to familiar C/C++.

Heterogeneous serial/parallel programming model.
 Serial code executes on CPU.
 Parallel kernels executed across a set of parallel threads on
the GPU.
It enables general-purpose GPU computing

 It also maps well to multi-core CPUs.
It enables heterogeneous systems (i.e., CPU+GPU)

 CPU good at serial computation, GPU at parallel.
Zebo Peng, IDA, LiTH
41
TDTS 08 – Lecture 11
General CUDA Steps

Copy data from CPU to GPU.

Perform SIMD computations (kernel) on GPU.

Copy data back from GPU to CPU.

Execution on host doesn’t wait for kernel to finish on
GPU.

General rules:
 Minimize data transfer between CPU & GPU
 Maximize number of threads on GPU
Zebo Peng, IDA, LiTH
42
TDTS 08 – Lecture 11
21
2015-12-08
Summary


All computers are now parallel computers!
Multi-core processors represent an important new trend in
computer architecture.
 Decreased power consumption and heat generation.
 Minimized wire lengths and interconnect latencies.

They enable true thread-level parallelism with great energy
efficiency and scalability.

Graphics requires a lot of computation and a huge amount of
bandwidth  GPUs.
GPUs are coming to general purpose computing, since they
deliver huge performance with small power.


Massively parallel computing are becoming a commodity
technology.
Zebo Peng, IDA, LiTH
43
TDTS 08 – Lecture 11
22
Download