PPT

advertisement
Multiprocessors—
Large vs. Small Scale
Small-Scale MIMD Designs


Memory: centralized with uniform
memory access time (UMA) and bus
interconnect
Examples: SPARCcenter
Large-Scale MIMD Designs


Memory: distributed with non-uniform
memory access time (NUMA) and scalable
interconnect
Examples: Cray T3D, Intel Paragon, CM-5
Communication Models

Shared Memory
– Communication via shared address space
– Advantages:
Ease of programming
 Lower latency
 Easier to use hardware controlled caching


Message passing
– Processors have private memories,
communicate via messages
– Advantages:
Less hardware, easier to design
 Focuses attention on costly non-local operations

Communication Properties

Bandwidth
– Need high bandwidth in communication
– Limits in network, memory, and processor

Latency
– Affects performance, since processor wait
– Affects ease of programming - How to
overlap communication and computation.

Latency Hiding
– How can a mechanism help hide latency?
– Examples: overlap message send with
computation, prefetch
Small-Scale—Shared
Memory

Caches serve to:
– Increase bandwidth
versus bus/memory
– Reduce latency of
access
– Valuable for both
private data and
shared data

What about cache
consistency?
The Problem of Cache
Coherency




Value of X in memory is 1
CPU A reads X – its cache now
contains 1
CPU B reads X – its cache now
contains 1
CPU A stores 0 into X
– CPU A’s cache contains a 0
– CPU B’s cache contains a 1
Multicore Systems
Multicore Computers
(chip multiprocessors)

Combine two or more processors (cores) on a single piece
of silicon

Each core consists of ALU, registers, pipeline hardware, L1
instruction and data caches

Multithreading is used
Pollack’s Rule

Performance increase is roughly proportional to the square
root of the increase in complexity
performance  √complexity

Power consumption increase is roughly linearly
proportional to the increase in complexity
power consumption  complexity
Pollack’s Rule
complexity power performance
1
1
1
4
4
2
25
25
5
100s of low complexity cores, each operating at very low
power
Ex: Four small cores
complexity power performance
4x1
4x1
4
Increasing CPU Performance
Manycore Chip

Composed of hybrid cores
•
•
•
Some general purpose
Some graphics
Some floating point
Exascale Systems



Board composed of multiple
manycore chips sharing memory
Rack composed of multiple
boards
A room full of these racks
Millions of cores
Exascale systems (1018 Flop/s)
Moore’s Law Reinterpreted

Number of cores per chip doubles every 2 years

Number of threads of execution doubles every 2
years
Shared Memory MIMD
P
P
P
P
Shared memory
Bus
•
Memory
•
Single address space
All processes have access
to the pool of shared
memory
Shared Memory MIMD
CU
PE
data

CU
PE
data
CU
PE
data
CU
PE
instruction
Memory
data
Each processor
executes different
instructions
asynchronously,
using different data
Symmetric Multiprocessors
(SMP)
Proc
Proc
L1
L1
…
L2


L2

System bus
Main Memory
I/O
I/O
I/O
MIMD
Shared memory
UMA
Symmetric Multiprocessors
(SMP)
Characteristics:

Two or more similar processors

Processors share the same memory and I/O facilities

Processors are connected by bus or other internal connection
scheme, such that memory access time is the same for each
processor

All processors share access to I/O devices

All processors can perform the same functions

The system is controlled by the operating system
Symmetric Multiprocessors
(SMP)
Operating system:

Provides tools and functions to exploit the parallelism

Schedules processes or threads across all of the processors

Takes care of
•
•
scheduling of threads and processes on processors
synchronization among processors
Multicore Computers
CPU
core 1
L1-I
L1-D
L2
Main Memory
CPU
core n
…
L1-I
L1-D
I/O
I/O
I/O
Dedicated L1 Cache
(ARM11 MPCore)
Multicore Computers
CPU
core 1
L1-I
L1-D
L2
CPU
core n
…
L1-I
L1-D
L2
I/O
Main Memory
I/O
I/O
Dedicated L2 Cache
(AMD Opteron)
Multicore Computers
CPU
core 1
L1-I
L1-D
CPU
core n
…
L1-I
L1-D
L2
I/O
Main Memory
I/O
I/O
Shared L2 Cache
(Intel Core Duo)
Multicore Computers
CPU
core 1
L1-I
L1-D
CPU
core n
…
L2
L1-I
L1-D
L2
L3
I/O
Main Memory
I/O
I/O
Shared L3 Cache
(Intel Core i7)
Multicore Computers
Advantages of Shared L2 cache




Reduced overall miss rate
•
Thread on one core may cause a frame to be brought into the cache, thread on another core
may access the same location that has already been brought into the cache
Data shared by multiple cores is not replicated
The amount of shared cache allocated to each core may be dynamic
Interprocessor communication is easy to implement
Advantages of Dedicated L2 cache

Each core can access its private cache more rapidly
L3 cache

When the amount of memory and number of cores grow, L3 cache provides
better performance
Multicore Computers
On-chip interconnects

Bus

Crossbar
Off-chip communication (CPU-to-CPU or I/O):

Bus-based
Multicore Computers
(chip multiprocessors)

Combine two or more processors (cores) on a single piece
of silicon

Each core consists of ALU, registers, pipeline hardware, L1
instruction and data caches

Multithreading is used
Multicore Computers
Multithreading
A multithreaded processor provides a separate PC for each
thread (hardware multithreading)


Implicit multithreading
•
Concurrent execution of multiple threads extracted from a single sequential
program
Explicit multithreading
•
Execute instructions from different explicit threads by interleaving
instructions from different threads on shared or parallel pipelines
Multicore Computers
Explicit Multithreading




Fine-grained multithreading (Interleaved multithreading)
•
•
Processor deals with two or more thread contexts at a time
Switching from one thread to another at each clock cycle
Coarse-grained multithreading (Blocked multithreading)
•
•
Instructions of a thread are executed sequentially until an event that causes
a delay (eg. cache miss) occurs
This event causes a switch to another thread
Simultaneous multithreading (SMT)
•
•
Instructions are simultaneously issued from multiple threads to the
execution units of a superscalar processor
Thread-level parallelism is combined with instruction-level parallelism (ILP)
Chip multiprocessing (CMP)
•
Each processor of a multicore system handles separate threads
Coarse-grained, Fine-grained,
Symmetric Multithreading, CMP
GPUs
(Graphics Processing Units)
Characteristics of GPUs

GPUs are accelerators for CPUs

SIMD

GPUs have many parallel processors and many concurrent threads
(i.e. 10 or more cores; 100s or 1000s of threads per core)

CPU-GPU combination is an example for heterogeneous computing

GPGPU (general purpose GPU): using a GPU to perform
applications traditionally handled by the CPU
GPUs
Download