Parallel Computer Architecture Concepts Outline

advertisement
Outline
This image cannot currently be display ed.
Parallel Computer
Architecture Concepts
TDDD93 Lecture 1
Christoph Kessler
PELAB / IDA
Linköping university
Sweden
2015
Lecture 1: Parallel Computer Architecture Concepts
n Parallel computer, multiprocessor, multicomputer
n SIMD vs. MIMD execution
n Shared memory vs. Distributed memory architecture
n Interconnection networks
n Parallel architecture design concepts
l Instruction-level parallelism
l Hardware multithreading
l Multi-core and many-core
l Accelerators and heterogeneous systems
l Clusters
n Implications for programming and algorithm design
2
Traditional Use of Parallel Computing:
in HPC
Parallel Computer
3
Example: Weather Forecast
• 3D Space discretization (cells)
• Time discretization (steps)
• Start from current observations
(sent from weather stations)
• Simulation step by evaluating
weather model equations
4
(very simplified…)
• Air pressure
• Temperature
• Humidity
• Sun radiation
• Wind direction
• Wind velocity
•…
cell
5
Parallel Computing for HPC Applications
n High Performance Computing
l
Much computational work
(in FLOPs, floatingpoint operations)
l Often, large data sets
l E.g. climate simulations, particle physics, engineering,
sequence matching or proteine docking in bioinformatics, …
n Single-CPU computers and even today’s multicore processors
cannot provide such massive computation power
NSC Triolith
n Aggregate LOTS of computers
à Clusters
l Need scalable algorithms
l Need to exploit multiple
levels of parallelism
6
1
Parallel Computer Architecture Concepts
Classification by Control Structure
Classification of parallel computer architectures:
n by control structure
n by memory organization
l
in particular, Distributed memory vs. Shared memory
…
n by interconnection network topology
op
op
op2
1
7
op
3
op
4
8
Classification by Memory Organization
Interconnection Networks (1)
P
9
R
10
Interconnection Networks (2):
Simple Topologies
P
P
P
fully connected
Interconnection Networks (3):
Hypercube
P
P
P
Inductive definition:
11
12
2
Instruction Level Parallelism (1):
Pipelined Execution Units
More about Interconnection Networks
n Fat-Tree, Butterfly, … See TDDC78
n Switching and routing algorithms
n Discussion of interconnection network properties
l
l
l
l
l
l
l
Cost (#switches, #lines)
Scalability
(asymptotically, cost grows not much faster than #nodes)
Node degree
Longest path (à latency)
Accumulated bandwidth
Fault tolerance (node or switch failure)
…
13
SIMD computing
with Pipelined Vector Units
e.g., vector supercomputers
Cray (1970s, 1980s), Fujitsu, …
14
Instruction-Level Parallelism (2):
VLIW and Superscalar
n Multiple functional units in parallel
n 2 main paradigms:
l
VLIW (very large instruction word) architecture ^
4Parallelism is explicit, progr./compiler-managed (hard)
l Superscalar architecture à
4Sequential instruction stream
4Hardware-managed dispatch
4power + area overhead
n ILP in applications is limited
l typ. < 3...4 instructions can be issued simultaneously
l Due to control and data dependences in applications
n Solution: Multithread the application
and the processor
16
15
Hardware Multithreading
SIMD Instructions
”vector register”
n “Single Instruction stream,
Multiple Data streams”
op
SIMD unit
single thread of control flow
l restricted form of data parallelism
4 apply the same primitive operation
(a single instruction) in parallel to
multiple data elements stored contiguously
l SIMD units use long “vector registers”
4 each holding multiple data elements
Common today
l MMX, SSE, SSE2, SSE3,…
l Altivec, VMX, SPU, …
Performance boost for operations on shorter data types
Area- and energy-efficient
Code to be rewritten (SIMDized) by programmer or compiler
Does not help (much) for memory bandwidth
l
n
E.g.,
data
dependence
n
n
n
n
P
P
17
P
P
18
3
The Memory Wall
Moore’s Law
n Performance gap CPU – Memory
(since 1965)
Exponential increase in transistor density
n Memory hierarchy
n Increasing cache sizes shows diminishing returns
l
Costs power and chip area
4
GPUs spend the area instead on many simple cores with little memory
l
Relies on good data locality in the application
n What if there is no / little data locality?
l Irregular applications,
e.g. sorting, searching, optimization...
n Solution: Spread out / overlap memory access delay
l Programmer/Compiler: Prefetching, on-chip pipelining,
SW-managed on-chip buffers
l Generally: Hardware multithreading, again!
Data from Kunle Olukotun, Lance Hammond, Herb Sutter,
20Burton Smith, Chris Batten, and Krste Asanoviç
19
Moore’s Law vs. Clock Frequency
The Power Issue
n Power = Static (leakage) power + Dynamic (switching) power
n Dynamic power ~ Voltage2 * Clock frequency
where Clock frequency approx. ~ voltage
• #Transistors / mm 2 still
growing exponentially
according to Moore’s Law
à Dynamic power ~ Frequency3
n Total power ~ #processors
• Clock speed flattening out
~3GHz
2003
21
More transistors + Limited frequency
Þ More cores
22
Conclusion: Moore’s Law Continues,
But ...
Single-processor Performance Scaling
16,0
Parallelism
14,0
Throughput incr. 55%/year
Log2 Speedup
12,0
Device speed
8,0
Limit: Clock rate
6,0
4,0
2,0
0,0
Data from Kunle Olukotun, Lance Hammond, Herb Sutter,
Burton Smith, Chris Batten, and Krste Asanoviç
23
Assumed increase
17%/year possible
10,0
Pipelining
Limit: RISC ILP
RISC/CISC CPI
Source: Doug Burger, UT Austin 2005
90 nm
24
65 nm
45 nm
32nm
22nm
4
Solution for CPU Design:
Multicore + Multithreading
Main features of a multicore system
n There are multiple computational cores on the same chip.
n Single-thread performance does not improve any more
l
ILP wall
l Memory wall
l Power wall
n but we can put more cores on a chip
l And hardware-multithread the cores to hide memory latency
l All major chip manufacturers produce multicore CPUs today
n The cores might have (small) private on-chip memory
modules and/or access to on-chip memory shared by several
cores.
n The cores have access to a common off-chip main memory
n There is a way by which these cores communicate with each
other and/or with the environment.
25
26
Standard CPU Multicore Designs
Some early dual-core CPUs (2004/2005)
n Standard desktop/server CPUs have a few cores
with shared off-chip main memory
l
On-chip cache (typ., 2 levels)
4L1-cache
core
core
core
L1$
L1$
L1$
L1$
L2$
mostly core-private
L2$
Interconnect / Memory interface
4L2-cache
often shared by
groups of cores
l
core
SMT
P0
P1
P0
P1
P0
P1
L1$ D1$
L1$ D1$
L1$ D1$
L1$ D1$
L1$ D1$
L1$ D1$
L2$
L2$
L2$
Memory Ctrl
main memory (DRAM)
Memory access interface shared by all or groups of cores
n Caching à multiple copies of the same data item
L2$
Memory Ctrl
Memory Ctrl
Main memory
Main memory
IBM Power5
(2004)
AMD Opteron
Dualcore (2005)
Main memory
n Writing to one copy (only) causes inconsistency
Intel Xeon
Dualcore(2005)
$ = ”cache”
L1$ = ”level-1 instruction cache”
D1$ = ”level-1 data cache”
L2$ = ”level-2 cache” (uniform)
n Shared memory coherence mechanism to enforce automatic
updating or invalidation of all copies around
à More about shared-memory architecture, caches, data locality,
consistency issues and coherence
protocols in TDDC78/TDDD56
27
SUN/Oracle SPARC T Niagara (8 cores)
28
SUN / Oracle SPARC-T5 (2012)
Sun UltraSPARC ”Niagara”
Niagara T1 (2005):
8 cores, 32 HW threads
Niagara T2 (2008):
8 cores, 64 HW threads
Niagara T3 (2010):
16 cores, 128 HW threads
T5 (2012):
16 cores, 128 HW threads
P0
P1
P2
P3
P4
P5
P6
P7
L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$ L1$ D1$
Niagara T1 (2005):
8 cores, 32 HW threads
L2$
Memory Ctrl
Memory Ctrl
Memory Ctrl
Memory Ctrl
Main memory
Main memory
Main memory
Main memory
29
28nm process, 16 cores x 8 HW threads, L3 cache on-chip,
On-die accelerators for common encryption algorithms
30
5
Scaling Up: Network-On-Chip
Example: Cell/B.E.
n Cache-coherent shared memory (hardware-controlled) –
n An on-chip network (four parallel unidirectional rings)
does not scale well to many cores
l power- and area-hungry
l signal latency across whole chip
l not well predictable access times
n Idea: NCC-NUMA – non-cache-coherent, non-uniform memory
access
l Physically distributed on-chip [cache] memory,
l on-chip network, connecting PEs or coherent ”tiles” of PEs
l global shared address space,
l but software responsible
for maintaining coherence
n Examples:
l STI Cell/B.E.,
l Tilera TILE64,
l Intel SCC, Kalray MPPA 31
Towards Many-Core CPUs...
(IBM/Sony/Toshiba 2006)
interconnect the master core, the slave cores and the main
memory interface
n LS = local on-chip memory, PPE = master, SPE = slave
32
Towards Many-Core Architectures
n Tilera TILE64 (2007): 64 cores, 8x8 2D-mesh on-chip network
n Intel, second-generation many-core research processor:
Mem-controller
48-core (in-order x86) SCC
”Single-chip cloud computer”, 2010
l
l
No longer fully cache coherent
over the entire chip
P
MPI-like message passing
over 2D mesh network on chip
I/O
C
I/O
R
1 tile: VLIW-processor
+ cache + router
(Image simplified)
33
Source: Intel
34
Kalray MPPA-256
Intel Xeon PHI (since late 2012)
n 16 tiles
n Up to 61 cores, 244 HW threads, 1.2 Tflops peak performance
with 16 VLIW compute cores each
plus 1 control core per tile
n Message passing network on chip
n Simpler x86 (Pentium) cores (x 4 HW threads),
with 512 bit wide SIMD vector registers
n Can also be used as a coprocessor, instead of a GPU
n Virtually unlimited array extension
by clustering several chips
n 28 nm CMOS technology
n Low power dissipation, typ. 5 W
35
36
6
”General-purpose” GPUs
Nvidia Fermi (2010): 512 cores
• Main GPU providers for laptop/desktop
Nvidia, AMD(ATI), Intel
• Example:
NVIDIA’s 10-series GPU (Tesla, 2008)
has 240 cores
1 Fermi C2050 GPU
1 ”shared-memory
multiprocessor” (SM)
I-cache
Scheduler
Dispatch
Register file
SM
(Images removed)
32 Streaming
processors
(cores)
L2
• Each core has a
•
•
•
•
Floating point / integer unit
Logic unit
Move, compare unit
Branch unit
Nvidia Tesla C1060:
933 GFlops
• Cores managed by thread manager
Load/Store units
Special function units
1 Streaming
Processor
(SP)
• Thread manager can spawn
and manage 30,000+ threads
• Zero overhead thread switching
37
FPU
64K configurable L1cache/
shared memory
IntU
38
GPU Architecture Paradigm
The future will be heterogeneous!
n Optimized for high throughput
Need 2 kinds of cores – often on same chip:
l
In theory, ~10x to ~100x higher throughput than CPU is
possible
n Massive hardware-multithreading hides memory access latency
n Massive parallelism
n GPUs are good at data-parallel computations
l
multiple threads executing the same instruction on different
data, preferably located adjacently in memory
n For non-parallelizable code:
Parallelism only from running several serial applications
simultaneously on different cores
(e.g. on desktop: word processor, email, virus scanner, …
… not much more)
à Few (ca. 4-8) ”fat” cores
(power-hungry, area-costly, caches, out-of-order issue, )
for high single-thread performance
n For well-parallelizable code:
(or on cloud servers)
39
à on hundreds of simple cores
(power + area efficient)
(GPU-/SCC-like)
40
Heterogeneous / Hybrid Multi-/Manycore
Heterogeneous / Hybrid Multi-/Manycore Systems
Key concept: Master-slave parallelism, offloading
n Cell/B.E.
n General-purpose CPU (master) processor controls execution
of slave processors by submitting tasks to them and
transfering operand data to the slaves’ local memory
à Master
offloads computation to the slaves
n Slaves often optimized for heavy throughput computing
l
n GPU-based system:
Offload
heavy
computation
Master could do something else while waiting for the result,
or switch to a power-saving mode
n Master and slave cores might reside
CPU
on the same chip (e.g., Cell/B.E.)
or on different chips (e.g., most GPU-based systems today)
n Slaves might have access to off-chip main memory (e.g., Cell)
Device
memory
Data
transfer
GPU
Main
memory
or not (e.g., today’s GPUs)
41
42
7
Multi-GPU Systems
Reconfigurable Computing Units
n Connect one or few general-purpose (CPU) multicore
n FPGA – Field Programmable Gate Array
processors with shared off-chip memory to several GPUs
n Increasingly popular in high-performance computing
l
Cost and (quite) energy effective if offloaded computation
fits GPU architecture well
L2
L2
Main Memory
(DRAM)
43
Example: Beowulf-class PC Clusters
44
Cluster Example:
Triolith (NSC, 2012 / 2013)
Capability cluster
(fast network
for parallel applications)
with off-the-shelf CPUs
(Xeon, Opteron, …)
Final configuration:
1200 HP SL230 compute
nodes, each equipped with 2
Intel E5-2660 (2.2 GHz
Sandybridge) processors
with 8 cores each
19200 cores in total
Theoretical peak
performance of 338 Tflops/s
NSC Triolith (Source: NSC)
Mellanox Infiniband network
45
46
The Challenge
Can’t the compiler fix it for us?
n Today, basically all computers are parallel computers!
n Automatic parallelization?
l
Single-thread performance stagnating
Dozens, hundreds of cores
l Hundreds, thousands of hardware threads
l Heterogeneous (core types, accelerators)
l Data locality matters
l Clusters for HPC, require message passing
n Utilizing more than one CPU core requires thread-level parallelism
n One of the biggest software challenges: Exploiting parallelism
l Every programmer will eventually have to deal with it
l All application areas, not only traditional HPC
4 General-purpose, graphics, games, embedded, DSP, …
l Affects HW/SW system architecture, programming languages,
algorithms, data structures …
l Parallel programming is more error-prone
(deadlocks, races, further sources of inefficiencies)
4 And thus more expensive and time-consuming
l
47
l
at compile time:
static analysis – not effective for pointer-based
languages
4Requires
4needs
4ok
l
programmer hints / rewriting ...
for few benign special cases:
–
(Fortran) loop SIMDization,
–
extraction of instruction-level parallelism, …
at run time (e.g. speculative multithreading)
4High
overheads, not scalable
n More about parallelizing compilers in TDDD56 + TDDC78
48
8
And worse yet,
The Challenge
n A lot of variations/choices in hardware
n Bad news 1:
l
Many will have performance implications
l
No standard parallel programming model
4portability
Many programmers (also less skilled ones)
need to use parallel programming in the future
n Bad news 2:
issue
n Understanding the hardware will make it easier to make
programs get high performance
l
Performance-aware programming gets more important
also for single-threaded code
l
Adaptation leads to portability issue again
There will be no single uniform parallel programming model
as we were used to in the old sequential times
à Several competing general-purpose
and domain-specific languages
and their concepts will co-exist
n How to write future-proof parallel programs?
49
50
What we learned in the past…
… and what we need now
n Sequential von-Neumann model
n Parallel programming!
programming, algorithms, data structures, complexity
l
Sequential / few-threaded languages: C/C++, Java, ...
not designed for exploiting massive parallelism
time
l
Parallel algorithms and data structures
Analysis / cost model: parallel time, work, cost; scalability;
l Performance-awareness: data locality, load balancing, communication
l
time
T(n) = O ( n log n )
T(n,p) = O ( (n log n)/p + log p )
number of
processing
units used
problem size
51
problem size
52
This image cannot currently be display ed.
Questions?
9
Download