Chapter 7 Introduction

advertisement
Chapter 7
Multicores,
p
, and
Multiprocessors,
Clusters
„
Goal: connecting multiple computers
to get higher
g e performance
pe o a ce
„
„
„
High
g throughput
oug pu for
o independent
depe de jobs
Parallel processing program
„
„
Multiprocessors
Scalability,
y, availability,
y, power
p
efficiencyy
Job-level (process-level) parallelism
„
„
§9.1 Intro
oduction
Introduction
Single program run on multiple processors
Multicore microprocessors
„
Chips with multiple processors (cores)
Chapter 7 — Multicores, Multiprocessors, and Clusters — 2
Hardware and Software
„
Hardware
„
„
„
Software
„
„
„
Serial: e
e.g.,
g Pentium 4
Parallel: e.g., quad-core Xeon e5345
Sequential: e.g., matrix multiplication
Concurrent: e.g., operating system,
multithreaded matrix multiplication
Sequential/concurrent software can run on
serial/parallel hardware
„
Challenge:
g making
g effective use of p
parallel
hardware
Chapter 7 — Multicores, Multiprocessors, and Clusters — 3
„
„
Parallel software is the problem
Need to get significant performance
improvement
„
„
Otherwise,
Oth
i
just
j t use a ffaster
t uniprocessor,
i
since it’s easier!
Diffi lti
Difficulties
„
„
„
Partitioning
Coordination
Communications overhead
§7.2 The Difficulty of Creatin
ng Parallel Processin
ng Programs
Parallel Programming
Chapter 7 — Multicores, Multiprocessors, and Clusters — 4
Amdahl’s Law
„
„
„
Sequential part can limit speedup
Example: 100 processors
processors, 90× speedup?
„
Tnew = Tparallelizable/100 + Tsequential
„
Speedup
„
Solving: Fparallelizable = 0.999
1
(1 Fpparallelizable ) Fpparallelizable /100
90
Need sequential part to be 0
0.1%
1% of original
time
Chapter 7 — Multicores, Multiprocessors, and Clusters — 5
Scaling Example
„
Workload: sum of 10 scalars, and 10 × 10 matrix
sum
„
„
„
Single processor: Time = (10 + 100) × tadd
10 processors
„
„
„
Time = 10 × tadd + 100/10 × tadd = 20 × tadd
Speedup = 110/20 = 5.5 (55% of potential)
100 processors
„
„
„
S
Speed
d up ffrom 10 tto 100 processors
Time = 10 × tadd + 100/100 × tadd = 11 × tadd
Speedup = 110/11
0/ = 10
0 ((10%
0% o
of po
potential)
e a)
Assumes load can be balanced across
processors
Chapter 7 — Multicores, Multiprocessors, and Clusters — 6
Scaling Example (cont)
„
„
„
What if matrix size is 100 × 100?
Single processor: Time = (10 + 10000) × tadd
dd
10 processors
„
„
„
100 processors
„
„
„
Time = 10 × tadd
dd + 10000/10 × tadd
dd = 1010 × tadd
dd
Speedup = 10010/1010 = 9.9 (99% of potential)
Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
Speedup = 10010/110 = 91 (91% of potential)
Assuming load balanced
Chapter 7 — Multicores, Multiprocessors, and Clusters — 7
Strong vs Weak Scaling
„
Strong scaling: problem size fixed
„
„
As in example
Weak scaling: problem size proportional to
n mber of processors
number
„
10 processors, 10 × 10 matrix
„
„
100 processors, 32 × 32 matrix
„
„
Ti
Time
= 20 × tadd
Ti
Time
= 10 × tadd + 1000/100 × tadd = 20 × tadd
Constant performance in this example
Chapter 7 — Multicores, Multiprocessors, and Clusters — 8
„
SMP: shared memory multiprocessor
„
„
„
„
Hardware provides single physical
address space for all processors
Synchronize shared variables using locks
Cache coherence issue
Memory access time
„
UMA (uniform) vs
vs. NUMA (nonuniform)
§7.3 Sha
ared Memo
ory Multiprrocessors
Shared Memory
Chapter 7 — Multicores, Multiprocessors, and Clusters — 9
„
Multi-cores processors:
Core
Core
Cache
Cache
§7.3 Sha
ared Memo
ory Multiprrocessors
Shared Memory
Chapter 7 — Multicores, Multiprocessors, and Clusters — 10
„
„
Each processor has private physical
address
add
ess space
Hardware sends/receives messages
p
between processors
§7.4 Clussters and O
Other Mes
ssage-Pas
ssing Multiprocessors
Message Passing
Chapter 7 — Multicores, Multiprocessors, and Clusters — 11
Loosely Coupled Clusters
„
Network of independent computers
„
„
Each has private memory and OS
Connected using I/O system
„
„
Suitable for applications with independent tasks
„
„
„
E.g., Ethernet/switch, Internet
Web servers, databases, simulations, …
High availability, scalable, affordable
Problems
„
„
Administration cost (prefer virtual machines)
Low interconnect bandwidth
„
c f processor/memory bandwidth on an SMP
c.f.
Chapter 7 — Multicores, Multiprocessors, and Clusters — 12
Grid Computing
„
Separate computers interconnected by
long-haul networks
„
„
„
Can make use of idle time on PCs
„
„
E.g., Internet connections
Work units farmed out,
out results sent back
E.g., SETI@home, World Community Grid
Other parallel computers
„
Supercomputers,
p
p
, vector computers,
p
, GPUs,,
cells, etc.
Chapter 7 — Multicores, Multiprocessors, and Clusters — 13
„
„
Goal: higher performance by using multiple
p
processors
Difficulties
„
„
„
Manyy reasons for optimism
„
„
„
Developing
p gp
parallel software
Devising appropriate architectures
§7.13 Co
oncluding R
Remarks
Concluding Remarks
Changing software and application environment
Chip-level multiprocessors with lower latency,
hi h b
higher
bandwidth
d idth iinterconnect
t
t
An ongoing challenge for computer architects,
software/tool/compiler developers/researchers!
Chapter 7 — Multicores, Multiprocessors, and Clusters — 14
Download