Lecture 9: MIMD Architectures Introduction Introduction and classification

advertisement
2015-11-27
Lecture 9: MIMD Architectures


Symmetric multiprocessors
NUMA architecture


Introduction and
classification
Clusters
Zebo Peng, IDA, LiTH
1
TDTS 08 – Lecture 9
Introduction

MIMD: a set of general purpose processors is connected together.

In contrast to a SIMD computer, a MIMD computer can execute
different programs on different processors.
 Flexibility!
P1
m5
P17
m18
P7
P18
m19
P19
m20
m6
m7
m12
P8
m8
P12
P9
m13
m9
P13
P10
m10
P5 m
17
P6
m14
P11 m
11
P14
m15
m21
P21
m24
P20
P33
P28
P23
P31
m22
m34
P22
m23
P22
m23
P23
P34
m40
P35
m41
P37
m42
P38
P36
P39
P23
P24 m25
irregular task parallelism
P32
Zebo Peng, IDA, LiTH
2
TDTS 08 – Lecture 9
1
2015-11-27
Introduction (Cont’d)

MIMD processors work asynchronously, and don’t have to
synchronize with each other.

At any time, different processors may be executing different
instructions on different pieces of data.

They can be built from commodity (off-the-shelf) microprocessors
with relatively little effort.

They are also highly scalable, provided that an appropriate
memory organization is used.

Most current parallel computers are built based on the MIMD
architecture.
Zebo Peng, IDA, LiTH
3
TDTS 08 – Lecture 9
SIMD vs MIMD

A SIMD computer requires less hardware than a MIMD
computer (single control unit).

However, SIMD processors are specially designed, and
tend to be expensive and have long design cycles.

Not all applications are naturally suited to SIMD processors.

Conceptually, MIMD computers cover also SIMD need.
 Having all processors executing the same program (single
program multiple data streams - SPMD).

On the other hand, SIMD architectures are more power
efficient, since one instruction controls many computations.
 For regular vector computation, it is still more appropriate to
use a SIMD machine (e.g., implemented in GPUs).
Zebo Peng, IDA, LiTH
4
TDTS 08 – Lecture 9
2
2015-11-27
MIMD Processor Classification

Centralized Memory: Shared memory located at centralized
location  consisting usually of several interleaved modules 
the same distance from any processor.
 Symmetric Multiprocessor (SMP)
 Uniform Memory Access (UMA)

Distributed Memory: Memory is distributed to each processor
 improving scalability.
 Message Passing Architectures: No processor can directly
access another processor’s memory.
 Hardware Distributed Shared Memory (DSM): Memory is
distributed, but the address space is shared.
• Non-Uniform Memory Access (NUMA)
 Software DSM: A level of OS built on top of message passing
multiprocessor to give a shared memory view to the programmer.
Zebo Peng, IDA, LiTH
5
TDTS 08 – Lecture 9
MIMD with Shared Memory
CU2
IS
IS
PE2
DS
DS
...
...
CUn
PE1
IS
PEn
Tightly coupled, not very scalable.

Typically called Multi-processor.
6
Shared
Memory
DS

Zebo Peng, IDA, LiTH
Bottleneck
CU1
TDTS 08 – Lecture 9
3
2015-11-27
MIMD with Distributed Memory
IS
CU2
PE1
PE2
DS
DS
PEn
LM2
...
...
...
IS
CUn
LM1
Interconnection
Network
IS
CU1
DS
LMn

Loosely coupled with message passing, scalable.

Typically called Multi-computer.
Zebo Peng, IDA, LiTH
7
TDTS 08 – Lecture 9
Shared-Address-Space Platforms
Part (or all) of the memory is accessible to all processors.

Processors interact by modifying data objects stored in this
shared-address-space.
P
P
M

UMA (uniform memory
access): the time taken
by a processor to
access any memory
word is identical.

NUMA, otherwise.
M
...
...
P
Interconnection Network

M
Memory
Zebo Peng, IDA, LiTH
8
TDTS 08 – Lecture 9
4
2015-11-27
Multi-Computer Systems

A multi-computer system is a collection of computers
interconnected by a message-passing network.
 Clusters.

Each processor is an autonomous computer
 With its own local memory.
 There is not a shared-address-space any longer.
 They communicating with each other through the
message passing network.

Can be easily built with off-the-shelf microprocessors.
Zebo Peng, IDA, LiTH
9
TDTS 08 – Lecture 9
MIMD Design Issues

Design decisions related to an MIMD machine are very complex,
since it involves both architecture and software issues.

The most important issues include:
 Processor design (functionality and capability)
 Physical organization
 Interconnection structure
 Inter-processor communication protocols
 Memory hierarchy
 Cache organization and coherency
 Operating system design
 Parallel programming languages
 Application software techniques
Zebo Peng, IDA, LiTH
10
TDTS 08 – Lecture 9
5
2015-11-27
Lecture 9: MIMD Architectures




Introduction and
classification
Symmetric multiprocessors
NUMA architecture
Clusters
Zebo Peng, IDA, LiTH
11
TDTS 08 – Lecture 9
Symmetric Multiprocessors (SMP)

A set of similar processors of comparable capacity.

All processors can perform the same functions (symmetric).

Processors are connected by a bus or other interconnections.

Processors share the same memory.
 A single memory or a pool of memory modules

Memory access time is the same for each processor.

All processors share access to I/O.
 Either through the same channels or different channels
giving paths to the same devices.

Controlled by an integrated operating system.
Zebo Peng, IDA, LiTH
12
TDTS 08 – Lecture 9
6
2015-11-27
SMP Example
Zebo Peng, IDA, LiTH
13
TDTS 08 – Lecture 9
SMP Advantages

High performance
 If similar work can be done in parallel (e.g., scientific
computing).

Good availability
 Since all processors can perform the same functions,
failure of a single processor does not stop the system.

Support incremental growth
 Users can enhance performance by adding additional
processors.

Scalability
 Vendors can offer range of products based on different
number of processors.
Zebo Peng, IDA, LiTH
14
TDTS 08 – Lecture 9
7
2015-11-27
SMP based on Shared Bus
Critical for
performance
Zebo Peng, IDA, LiTH
15
TDTS 08 – Lecture 9
Shared Bus – Pros and Cons

Advantages:
 Simple and cheap.
 Flexibility  Easy to expand the system by attaching
more processors.

Disadvantages:
 Performance limited by bus bandwidth.
 Each processor should have local cache:
• To reduce number of bus accesses.
• Can lead to problems with cache coherence.
– Should be solved in hardware (to be discussed later).
Zebo Peng, IDA, LiTH
16
TDTS 08 – Lecture 9
8
2015-11-27
Multi-Port Memory SMP



Direct access of memory modules by several processors.
Better performance:
 Each processor has dedicated path to memory module.
All the ports can be active simultaneously.
 With the restriction that only one write can be performed into
a memory location at a time.
Zebo Peng, IDA, LiTH
17
TDTS 08 – Lecture 9
Multi-Port Memory SMP
Cache
Cache

Hardware logic required to resolve conflicts (parallel writes).
 Permanently designated priorities to each memory port.


Can configure portions of memory as private to one or more
processors (improved also security).
Write-through cache policy requires to alert other processors
of memory updates.
Zebo Peng, IDA, LiTH
18
TDTS 08 – Lecture 9
9
2015-11-27
Operating System Issues

OS should manage the processor resources so that the user
perceives them as a single system.

It should appear as a single-processor multiprogramming system.

In both SMP and uniprocessor, several processes may be active
at one time.
 OS is responsible for scheduling their execution and allocating
resources.

A user may construct applications with multiple processes without
regard to if a single processor or multiple processors will be
available.

OS supports reliability and fault tolerance.
 Graceful degradation.
Zebo Peng, IDA, LiTH
19
TDTS 08 – Lecture 9
IBM S/390 Mainframe SMP
Processor unit (PU):
Zebo Peng, IDA, LiTH
20

CISC microprocessor

Frequently used
instructions hard
wired

64k L1 unified
cache with 1 cycle
access time
TDTS 08 – Lecture 9
10
2015-11-27
IBM S/390 Mainframe SMP

L2 cache
 384k
Zebo Peng, IDA, LiTH
21
TDTS 08 – Lecture 9
IBM S/390 Mainframe SMP

Bus switching network
adapter (BSN)
 Includes 2M of L3
cache
Zebo Peng, IDA, LiTH
22
TDTS 08 – Lecture 9
11
2015-11-27
IBM S/390 Mainframe SMP

Memory card
 8G per card
Zebo Peng, IDA, LiTH
23
TDTS 08 – Lecture 9
Lecture 9: MIMD Architectures




Introduction and
classification
Symmetric multiprocessors
NUMA architecture
Clusters
Zebo Peng, IDA, LiTH
24
TDTS 08 – Lecture 9
12
2015-11-27
Memory Access Approaches

Uniform memory access (UMA), as in SMP:
 All processors have access to all parts of memory.
 Access time to all regions of memory is the same.
 Access time for different processors is the same.

Non-uniform memory access (NUMA)
 All processors have access to all parts of memory.
 Access time of processor differs depending on region
of memory.
 Different processors access different regions of
memory at different speeds.
Zebo Peng, IDA, LiTH
25
TDTS 08 – Lecture 9
Motivation for NUMA

SMP has practical limit to number of processors.
 Bus traffic limits to between 16 and 64 processors.
 A multi-port memory has limited number of inputs (ca. 10).

In cluster computers each node has its own memory.
 Applications do not see large global memory.
 Coherence maintained by software not hardware (slow speed).

NUMA retains SMP flavour while making large scale multiprocessing possible.
 e.g., Silicon Graphics Origin NUMA machine has 1024 processors
(≥ 1 teraFLOPS peak performance).

Objective is to provide a transparent system-wide memory while
allowing nodes to have their own bus or internal interconnections.
Zebo Peng, IDA, LiTH
26
TDTS 08 – Lecture 9
13
2015-11-27
A Typical NUMA Organization
Zebo Peng, IDA, LiTH

Each node is, in general,
an SMP.

Each node has its own
main memory.

Each processor has its
own L1 and L2 cache.
27
TDTS 08 – Lecture 9
A Typical NUMA Organization
 Nodes
are connected by
some networking facility.
 Each
processor sees a
single addressable
memory space:
Each memory location
has a unique systemwide address.
Zebo Peng, IDA, LiTH
28
TDTS 08 – Lecture 9
14
2015-11-27
A Typical NUMA Organization
Load X

Memory request order:
L1 cache (local to processor)
L2 cache (local to processor)
Main memory (local to node)
Remote memory (going
through the int. network)

All is done automatically and
transparent to the processor.
With very different access time!
Zebo Peng, IDA, LiTH
29
TDTS 08 – Lecture 9
NUMA  Pros and Cons

Better performance at higher levels of parallelism than SMP.

However, performance can break down if too much access to
remote memory, which can be avoided by:
 Designing better L1 and L2 caches to reduce memory access.
• Need good temporal locality of software.
 If software has good spatial locality, data needed for an application
will reside on a small number of frequently used pages.
• They can be initially moved into the local memory.
 Enhancing VM by including in OS a page migration mechanism to
move a VM page to a node that is frequently using it.
•
Design of OS becomes an important issue for NUMA machines.
Zebo Peng, IDA, LiTH
30
TDTS 08 – Lecture 9
15
2015-11-27
Lecture 9: MIMD Architectures




Introduction and
classification
Symmetric multiprocessors
NUMA architecture
Clusters
Zebo Peng, IDA, LiTH
31
TDTS 08 – Lecture 9
Loosely Coupled MIMD - Clusters
Cluster: A set of computers connected over a high-bandwidth local
area network, and used as a parallel computer.

A group of interconnected stand-alone computers.

Work together as a unified resource.

Each computer called a node.
 A node can also be a multiprocessor itself, such as an SMP.

Message passing for communication between nodes.
 There is no longer a shared memory space.

NOW  Network of Workstations, homogeneous cluster.

Grid  Computers connected over a wide area network.
 Cloud computing: a large number of computers connected via a
real-time communication network such as the Internet.
Zebo Peng, IDA, LiTH
32
TDTS 08 – Lecture 9
16
2015-11-27
Cluster Benefits

Absolute scalability:
 Cluster with a very large number of nodes is possible.

Incremental scalability:
 A user can start with a small system and expand it as
needed, without having to go through a major upgrade.

High availability:
 Fault tolerance by nature.

Superior price/performance:
 Can be built with cheap off-the-shelf nodes.
 Supercomputing-class off-the-shelf components are
available.

Amazon, Google, Hotmail, and Yahoo rely on clusters of PCs
to provide services used by millions of users every day.
Zebo Peng, IDA, LiTH
33
TDTS 08 – Lecture 9
Cluster Configurations
The simplest classification of clusters is based on whether the
nodes share access to disks.

Cluster with no shared disk.
 Interconnection is by high-speed link
• LAN  may be shared with other non-cluster computers
• Dedicated interconnection facility
Zebo Peng, IDA, LiTH
34
TDTS 08 – Lecture 9
17
2015-11-27
Cluster Configurations (Cont’d)

Shared-disk cluster
 Still connected by a message link
 Disk subsystem directly linked to multiple computers
 Disks should be made fault-tolerant
• To avoid single point of failure in the system.
RAID (Redundant Array of Independent Disks)
allows simultaneous access to data from
multiple drives, with the help of multiple disk
arrays and data distribution on multiple disks.
Zebo Peng, IDA, LiTH
35
TDTS 08 – Lecture 9
IBM’s Blue Gene Supercomputer

A parallel supercomputer using a massive number of PowerPC
processors and supporting a large memory space.
 65,536 processors give 360 teraFLOPS performance (BG/L).
 Led several times TOP500 and Green500 (energy efficiency) lists.

With standard compilers and message passing environment.
Zebo Peng, IDA, LiTH
36
TDTS 08 – Lecture 9
18
2015-11-27
Blue Gene/L Architecture
Zebo Peng, IDA, LiTH
37
TDTS 08 – Lecture 9
Blue Gene/Q

64-bit PowerPC A2 processor with 4-way simultaneously multithreaded,
running at 1.6 GHz.

The processor cores are linked by a crossbar switch to a 32 MB L2 cache.

16 Processor cores are used for computing, and a 17th core for OS.

A18th core is used as a redundant spare in case one of the other cores is
permanently damaged.

The 18 core chip has a peak performance of 204.8 GFLOPS.

Blue Gene/Q has performance of 20 PFLOPS (world's fastest supercomputer in June 2012), which draws ca. 7.9 megawatts of power.

One of the greenest supercomputers at over 2 GFLOPS per watt (similar
to Tianhe-2: 17.6 MW for 33.86 PF).
Zebo Peng, IDA, LiTH
38
TDTS 08 – Lecture 9
19
2015-11-27
Parallelizing Computation
How to make a single application executing in parallel on a cluster:

Paralleling complier:
 Determine at compile time which parts can be executed in parallel.

Parallelized application:
 Application written from scratch to be parallel.
 Message passing to move data between nodes.
 Hard to program, but have the best end-results.

Parametric computing:
 An application consists of repeated executions of the same
algorithm over different sets of data.
 e.g., simulation with different scenarios.
 Needs tools to organize, run, and manage the jobs.
Zebo Peng, IDA, LiTH
39
TDTS 08 – Lecture 9
Google Applications

Search engines
 require high amount of computation per request.

A single query on Google (on average)
 reads hundreds of megabytes of data.
 takes tens of billions of CPU cycles.

A peak request stream on Google
 thousands of queries per second.
 requires supercomputing capability to handle.

They can be easily parallelized
 different queries can run on different processors.
 a single query can also use multiple processors.
Zebo Peng, IDA, LiTH
40
TDTS 08 – Lecture 9
20
2015-11-27
Google Cluster of PCs
Cluster of off-the-shelf PCs  cheap solution.

 More than 15,000 nodes form a cluster.
 Rather than a smaller number of high-end servers.
Important design considerations:

 Price-performance ratio.
 Energy efficiency.
Several clusters available worldwide.

 To handle catastrophic failures (earthquakes, power failures)
Parametric computing:

 Same algorithm (PageRank to determine the relative importance of a
web page).
 Different sets of data.
Zebo Peng, IDA, LiTH
41
TDTS 08 – Lecture 9
Summary

Two most common MIMD architectures are symmetric
multiprocessors and clusters.
 SMP is an example of tightly coupled system with shared memory.
 Cluster is a loosely coupled system with distributed memory.

More recently, non-uniform memory access (NUMA) systems
are also becoming popular.
 More complex in terms of design and operation.

Several organization styles can be also integrated into a
single system.
 E.g., SMP as a node inside a NUMA machine.

MIMD architecture is highly scalable, therefore they can be
build with a huge number of microprocessors.
Zebo Peng, IDA, LiTH
42
TDTS 08 – Lecture 9
21
Download