Multicores, Multiprocessors, and Clusters

advertisement
Multicores, Multiprocessors, and
Clusters
Hiroaki Kobayashi
7/18/2012
Contents
l 
l 
l 
l 
l 
Introduction
The Difficulty of Creating Parallel Processing Programs
Shared Memory Multiprocessors
Clusters and Other Message-Passing Multiprocessors
Cache Coherence on Shared-Memory Multiprocessors
n  Snooping protocol for Cache Coherence on Shared Memory
Multiprocessors
n  Directory-based Protocol for Cache Coherence on MessagePassing Multiprocessors
l 
l 
l 
l 
SISD, MIMD, SIMD, SPMD, and Vector
Introduction to Multiprocessor Network Topologies
Multiprocessor Benchmarks
Summary
Computer Science 2012
Hiroaki Kobayashi
2
Introduction (1/2)
l 
Goal of multiprocessor is to create powerful
computers simply by connecting many existing
smaller ones.
n  Replace large inefficient processors with many smaller,
efficient processors
n  Scalability, availability, power efficiency
l 
Job-level (or process-level) parallelism
n  high throughput for independent jobs
l 
Parallel processing program
n  Single program runs on multiple processors
n  Short turnaround time
Computer Science 2012
Hiroaki Kobayashi
3
Introduction (2/2)
l 
A cluster is composed of microprocessors
housed in many independent servers or
PCs
n  Now very popular as search engine, Web
servers, email servers and database server in
data centers
l 
Cluster System
@Cyberscience Center
Multicore/Manycore microprocessors
n  Chips with multiple processors (cores)
n  Difficulty in seeking higher clock rates and
improved CPI on single processor due to the
power problem.
n  Number of cores is expected to double every
two years
INTEL Nehalem-EP
Processor with 4-core
Programmers who care about performance must become parallel programmers!
Computer Science 2012
Hiroaki Kobayashi
4
Concurrency vs Parallel
Hardware/Software Categorization
Software
Sequential
Serial
Hardware
Parallel
Concurrent
Matrix Multiply written in C,
and compiled for and running
on an Intel Pentium 4
Windwos Visata
Operating
System running
on an Intel
Pentium 4
Matrix Multiply written in C,
and compiled for and running
on an Intel Xeon E5345
(Clovertown)
Windwos Visata
Operating
System running
on an Intel Xeon
E5345
(Clovertown)
Challenge: making effective use of parallel hardware for sequential or
concurrent software
Computer Science 2012
Hiroaki Kobayashi
5
Difficulty of Creating Parallel Processing Programs
l 
l 
Parallel software is the problem
Need to get significant performance improvement
n  Otherwise, just use a faster uniprocessor, since it’s easier!
n  Uniprocessor design technique such as superscalar and outof-order execution take advantage of ILP, normally without
the involvement of the programmer
l 
Difficulties in design of parallel programs
n  Partitioning (as equally as possible)
n  Coordination/scheduling for load balancing
n  Synchronization and communications overheads
The difficulties get worse as the number of processors increase…
Computer Science 2012
Hiroaki Kobayashi
6
Speedup Challenge
Amdahl’s Law says
n 
n 
n 
Sequential part can limit speedup
Example: 100 processors, 90× speedup?
n 
Tnew = Tparallelizable/100 + Tsequential
n 
1
Speedup =
= 90
(1− Fparallelizable ) + Fparallelizable /100
n 
Solving: Fparallelizable = 0.999
Need sequential part to be 0.1% of original
time
Computer Science 2012
Hiroaki Kobayashi
7
Scaling Example (1/2)
l 
Workload: sum of 10 scalars, and 10 × 10
matrix sum
n Speed up from 10 to 100 processors
l 
l 
Single processor: Time = (10 + 100) × tadd
10 processors
n Time = 10 × tadd + 100/10 × tadd = 20 × tadd
n Speedup = 110/20 = 5.5 (55% of potential)
l 
100 processors
n Time = 10 × tadd + 100/100 × tadd = 11 × tadd
n Speedup = 110/11 = 10 (10% of potential)
l 
Assumes load can be balanced across
processors
Computer Science 2012
Hiroaki Kobayashi
8
Scaling Example (2/2)
l 
l 
l 
What if matrix size is 100 × 100?
Single processor: Time = (10 + 10000) × tadd
10 processors
n Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
n Speedup = 10010/1010 = 9.9 (99% of potential)
l 
100 processors
n Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
n Speedup = 10010/110 = 91 (91% of potential)
l 
Assuming load balanced
Computer Science 2012
Hiroaki Kobayashi
9
Strong vs Weak Scaling
Getting good speed-up on a multiprocessor while
keeping the problem size fixed is harder than getting
good speed-up by increasing the size of the problem
l 
Strong scaling: problem size fixed
n  As in example
l 
Weak scaling: problem size proportional to number of
processors
n  10 processors, 10 × 10 matrix
u 
Time = 20 × tadd
n  100 processors, 32 × 32 matrix
u 
Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
n  Constant performance in this example
Computer Science 2012
Hiroaki Kobayashi 10
Speed-up Challenge: Balancing Load
In the previous example, each processor has 1% of the
work to achieve the speed-up of 91 on the larger
problem with 100 processors.
l  If one processor’s load is higher than all the rest, show
the impact on speed-up. Calculate at 2% and 5%
u  If one processor has 2% of 10,000 and other 99 will
share the remaining 9800
l 
n  Execution time=Max(9800t/99, 200t/1)+10t=210t
n  Speed-up drops to 10,010t/210t=48
u 
If one processor has 5% of 10,000 and other 99 will
share the remaining 9500
n  Execution time=Max(9500t/99, 500t/1)+10t=510t
n  Speed-up drops to 10,010t/510t=20
Computer Science 2012
Hiroaki Kobayashi 11
Shared Memory Multiprocessors
l 
SMP: shared memory multiprocessor
n Hardware provides single physical
address space for all processors
n Synchronize shared variables using locks
n Memory access time
u 
UMA (uniform) vs. NUMA (nonuniform)
Uniform Memory
Access Multiprocessor
Computer Science 2012
Hiroaki Kobayashi 12
Non Uniform Memory Access Multiprocessor
Single address space
l 
l 
l 
Lower latency to nearby memory
Higher scalability
More efforts for efficient parallel programming
Computer Science 2012
Hiroaki Kobayashi 13
Synchronization and Lock on SMP
l 
Synchronization: The process of coordinating the
behavior of two or more processes, which may be
running on different processors.
n  As processors operating in parallel will normally share data,
they also need to coordinate when operating on shared data.
l 
Lock: a synchronization device that allows access to
data to only one process at a time.
n  Other processors interested in shared data must wait until
the original processor unlocks the variable.
Computer Science 2012
Hiroaki Kobayashi 14
Example: Sum Reduction (1/2)
l 
Sum 100,000 numbers on 100 processor UMA
n  Each processor has ID: 0 ≤ Pn ≤ 99
n  Partition 1000 numbers per processor
n  Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];
l 
Now need to add these partial sums
n  Reduction: divide and conquer
n  Half the processors add pairs, then quarter, …
n  Need to synchronize between reduction steps
Computer Science 2012
Hiroaki Kobayashi 15
Example: Sum Reduction (2/2)
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Computer Science 2012
Hiroaki Kobayashi 16
Message-Passing Multiprocessors
n 
n 
n 
n 
Each processor has private physical address space
Hardware sends/receives messages between processors
Multicomputers, clusters, Network of Workstations (NOW)
Suited for job-level parallelism and applications with little
communication
n  Web search, mail servers, file servers…
Computer Science 2012
Hiroaki Kobayashi 17
Loosely-Coupled Clusters
l 
Network of independent computers
n  Each has private memory and OS
n  Connected using I/O system
u 
l 
E.g., Ethernet/switch, Internet
Suitable for applications with independent tasks
n  Web servers, databases, simulations, …
l 
l 
High availability, scalable, affordable
Problems
n  Administration cost (prefer virtual machines)
n  Low interconnect bandwidth
u 
c.f. processor/memory bandwidth on an SMP
Computer Science 2012
Hiroaki Kobayashi 18
Sum Reduction on Message-Passing Multiprocessor (1/2)
l 
l 
Sum 100,000 on 100 processors
First distribute 1000 numbers to each
n  The do partial sums
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
l 
Reduction
n  Half the processors send, other half receive and add
n  The quarter send, quarter receive and add, …
Computer Science 2012
Hiroaki Kobayashi 19
Sum Reduction on Message-Passing Multiprocessor (2/2)
l 
Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
n  Send/receive also provide synchronization
n  Assumes send/receive take similar time to addition
Computer Science 2012
Hiroaki Kobayashi 20
Grid Computing
l 
Separate computers interconnected by long-haul
networks
n  E.g., Internet connections
n  Work units farmed out, results sent back
l 
Can make use of idle time on PCs
n  E.g., SETI@home
n  Aslo named volunteer computing
n  5+ million computers, 200+ countries, 257Teraflop/s for
searching extraterrestrial intelligence in 2006
n  http://www.planetary.or.jp/setiathome/home_japanese.html
Computer Science 2012
Hiroaki Kobayashi 21
Parallelism and Memory Hierarchies: Cache
Coherence
l 
Suppose two CPU cores share a physical
address space
n Write-through caches
Time Event
step
CPU A’s
cache
CPU B’s
cache
0
Memory
0
1
CPU A reads X
0
0
2
CPU B reads X
0
0
0
3
CPU A writes 1 to X
1
0
1
inconsistency
Computer Science 2012
Hiroaki Kobayashi 22
Cache Coherence Protocol
l 
Operations performed by caches in multiprocessors
to ensure coherence
n  Migration of data to local caches
u 
Reduces bandwidth for shared memory
n  Replication of read-shared data
u 
l 
Reduces contention for access
Snooping protocols
n  Each cache monitors bus reads/writes
l 
Directory-based protocols
n  Caches and memory record sharing status of blocks in a
directory
Computer Science 2012
Hiroaki Kobayashi 23
Snoop-based and Directory-based Systems
Broadcast bus
Snoop-based system
Computer Science 2012
Directory-based system
Hiroaki Kobayashi 24
Invalidating Snooping Protocols l 
Cache gets exclusive access to a block when it is to
be written
n  Broadcasts an invalidate message on the bus
n  Subsequent read in another cache misses
u 
Owning cache supplies updated value
n  If two processors attempt to write the same data
simultaneously, one of them wins the race, causing the other
processor’s copy to be invalidated.
u 
Enforce write serialization.
activity
CPU
Bus activity
CPU A’s
cache
CPU B’s
cache
Memory
0
CPU A reads X
Cache miss for X
0
CPU B reads X
Cache miss for X
0
0
0
CPU A writes 1 to X
Invalidate for X
1
0
0
Cache miss for X
1
1
Computer Science 2012
CPU B read X
0
Hiroaki Kobayashi
25
1
Coherence Misses
l 
l 
Coherence Misses: a new class of cache misses due
to sharing data on a multiprocessor
True Sharing
n  A write to the shared data on the cache by one processor
causes an invalidate to the entire block that includes the
shared data cached on other processors, and a miss occurs
when another processor accesses a modified word in that
block.
l 
False Sharing
n  A write to the shared data on the cache by one processor
causes an invalidate to the entire block that includes the
shared data cached on other processors, and a miss occurs
even when another processor accesses a different word but
in the same block.
Computer Science 2012
Hiroaki Kobayashi 26
Contributing Causing of Memory Cycles on SMP
Computer Science 2012
Hiroaki Kobayashi 27
Classifications Based on Instruction and Data Streams
n 
Flynn’s Classification (1966)
Data Streams
Single
Instruction Single
Streams
Multiple
n 
Multiple
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
SPMD: Single Program Multiple Data
n  A parallel program on a MIMD computer
n  Conditional code for different processors
Computer Science 2012
Hiroaki Kobayashi 28
SIMD
l 
Operate elementwise on vectors of data
n  E.g., MMX and SSE instructions in x86
u 
l 
Multiple data elements in 128-bit wide registers
All processors execute the same instruction at the
same time
n  Each with different data address, etc.
l 
l 
l 
l 
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel applications
Weakest in case or switch statement
Computer Science 2012
Hiroaki Kobayashi 29
Vector Processors
l 
l 
Highly pipelined function units
Stream data from/to vector registers to units
n  Data collected from memory into registers
n  Results stored from registers to memory
l 
Example: Vector extension to MIPS
n  32 × 64-element registers (64-bit elements)
n  Vector instructions
lv, sv: load/store vector
u  addv.d: add vectors of double
u  addvs.d: add scalar to each element of vector of double
u 
l 
Significantly reduces instruction-fetch bandwidth
Computer Science 2012
Hiroaki Kobayashi 30
Characteristics of Vector Architecture
l 
Vector registers
n  Each vector register is a fixed-length
bank holding a single vector and must
have at least two read ports and one
write port.
l 
Vector functional units
n  Each unit is fully pipelined and can
start a new operation on every clock
cycle. A control unit is needed to
detect hazards, both from conflicts for
the functional unit and from conflicts
for register accesses.
l 
Vector load-store unit
n  This is a vector memory unit that loads
or stores a vector to or from memory.
Vector loads and stores are fully
pipelined, so that words can be moved
between the vector registers and
memory with a bandwidth of 1 word
per clock cycle, after an initial latency.
Computer Science 2012
From VR1
&VR2
To VR3
Pipelined FP ADD Unit
Hiroaki Kobayashi 31
Example: DAXPY (Y = a x X + Y)
Conventional MIPS code
l.d
$f0,a($sp)
addiu r4,$s0,#512
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
n  Vector MIPS code
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)
n 
Computer Science 2012
;load scalar a
;upper bound of what to load
;load x(i)
;a × x(i)
;load y(i)
;a × x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;load scalar a
;load vector x
;vector-scalar multiply
;load vector y
;add y to product
;store the result
Hiroaki Kobayashi 32
Vector vs. Scalar
l 
Vector architectures and compilers
n  Simplify data-parallel programming
n  Explicit statement of absence of loop-carried dependences
u 
Reduced checking in hardware
n  Regular access patterns benefit from interleaved and burst
memory
n  Avoid control hazards by avoiding loops
l 
More general than ad-hoc media extensions (such as
MMX, SSE)
n  Better match with compiler technology
Computer Science 2012
Hiroaki Kobayashi 33
History of GPU
l 
Early video cards (VGA controller)
n Frame buffer memory with address generation for
video output
l 
3D graphics processing
n Originally high-end computers (e.g., SGI)
n Moore’s Law ⇒ lower cost, higher density
n 3D graphics cards for PCs and game consoles
l 
Graphics Processing Units
n Processors oriented to 3D graphics tasks
n Vertex/pixel processing, shading, texture mapping,
rasterization
Computer Science 2012
Hiroaki Kobayashi 34
Graphics in the System
16GB/s
Computer Science 2012
Hiroaki Kobayashi 35
Graphics Rendering Pipeline
Computer Science 2012
Hiroaki Kobayashi 36
GPU Architecture
l 
Processing is highly data-parallel
n GPUs are highly multithreaded
n Use thread switching to hide memory latency
u 
Less reliance on multi-level caches
n Graphics memory is wide and high-bandwidth
l 
Trend toward general purpose GPUs
n Heterogeneous CPU/GPU systems
n CPU for sequential code, GPU for parallel code
l 
Programming languages/APIs
n DirectX, OpenGL
n C for Graphics (Cg), High Level Shader Language
(HLSL)
n Compute Unified Device Architecture (CUDA)
Computer Science 2012
Hiroaki Kobayashi 37
Trend in Performance between CPU and GPU
Computer Science 2012
Hiroaki Kobayashi 38
Example: NVIDIA Tesla
345.6Gflop/s = 16 MPs x 8SPs x 2Flop x 1.35x109 Streaming
multiprocesso
r
8 × Streaming
processors
Computer Science 2012
Hiroaki Kobayashi 39
Example: NVIDIA Tesla
n 
Streaming Processors
n 
n 
n 
Single-precision FP and integer units
Each SP is fine-grained multithreaded
Warp: group of 32 threads
n 
Executed in parallel,
SIMD style
n 
n 
8 SPs
× 4 clock cycles
Hardware contexts
for 24 warps
n 
Registers, PCs, …
Computer Science 2012
Hiroaki Kobayashi 40
Classifying GPUs
n 
Don’t fit nicely into SIMD/MIMD model
n 
Conditional execution in a thread allows an
illusion of MIMD
n 
n 
But with performance degredation
Need to write general purpose code with care
Instruction-Level
Parallelism
Data-Level
Parallelism
Computer Science 2012
Static: Discovered
at Compile Time
Dynamic: Discovered
at Runtime
VLIW
Superscalar
SIMD or Vector
Tesla Multiprocessor
Hiroaki Kobayashi 41
Interconnection Networks
n 
Network topologies
n 
Arrangements of processors, switches, and links
Bus
Ring
N-cube (N = 3)
2D Mesh
Computer Science 2012
Fully connected
Hiroaki Kobayashi 42
Multi-Stage Networks
Computer Science 2012
Hiroaki Kobayashi 43
Network Characteristics
l 
Performance
n  Latency per message (unloaded network)
n  Throughput
u 
u 
u 
Link bandwidth
  Peak data transfer rate per link
Total network bandwidth
  Collective data transfer rate of all links in the network
Bisection bandwidth
  The bandwidth between two equal parts of a
multiprocessor.
  This measure is for a worst case split of the multiprocessor
n  Congestion delays (depending on traffic)
l 
l 
l 
Cost
Power
Routability in silicon
Computer Science 2012
Hiroaki Kobayashi 44
Parallel Benchmarks (1/2)
l 
l 
l 
l 
l 
Linpack: matrix linear algebra
n  DAXPY routine
n  Weak scaling
n  www.top500.org
SPECrate: parallel run of SPEC CPU programs
n  Job-level parallelism (no communication between them)
n  Throughput metric
n  Weak scaling
SPLASH: Stanford Parallel Applications for Shared Memory
n  Mix of kernels and applications, strong scaling
NAS (NASA Advanced Supercomputing) suite
n  computational fluid dynamics kernels
n  Weak scaling
PARSEC (Princeton Application Repository for Shared Memory
Computers) suite
n  Multithreaded applications using Pthreads and OpenMP
n  Weak scaling
Computer Science 2012
Hiroaki Kobayashi 45
Parallel Benchmarks (2/2)
Computer Science 2012
Hiroaki Kobayashi 46
Concluding Remarks
l 
l 
Goal: higher performance by using multiple
processors
Difficulties
n Developing parallel software
n Devising appropriate architectures
l 
Many reasons for optimism
n Changing software and application environment
n Chip-level multiprocessors with lower latency,
higher bandwidth interconnect
l 
An ongoing challenge for computer architects!
Computer Science 2012
Hiroaki Kobayashi 47
Download