Introduction - School of Computer Science

advertisement
Introduction
Prof. Sivarama Dandamudi
School of Computer Science
Carleton University
Why Parallel Systems?
 Increased
 Main
execution speed
motivation for many applications
 Improved
 Multiple
fault-tolerance and reliability
resources provide improved FT and reliability
 Expandability
 Problem
scaleup
 New applications
Carleton University
© S. Dandamudi
2
Three Metrics
 Speedup
 Problem
size is fixed
 Adding
more processors should reduce time
 Speedup
on n processors S(n) is
Time on 1 processor system
Time on n-processor system
 Linear speedup if S(n) = a n for 0 < a 1
 Perfectly
Carleton University
linear speedup if a = 1
© S. Dandamudi
3
Three Metrics
(cont’d)
 Scaleup
 Problem
size increases with system size
 Scaleup on n processors C(n) is
Small problem time on 1-processor system
Larger problem time on n-processor system
 Linear scaleup
= b for 0 < b  1
 Perfectly linear scaleup if b = 1
 C(n)
Carleton University
© S. Dandamudi
4
Three Metrics
(cont’d)
 Efficiency
 Defined
as the average utilization of n processors
 Efficiency of n processors E(n) is related to speedup
S(n)
E(n) =
n
 If
efficiency remains 1 as we add more processors
 We
can get perfectly linear speedups
Carleton University
© S. Dandamudi
5
Three Metrics
(cont’d)
Perfectly linear speedup
Linear speeup
S(n)
Sublinear speedup
Number of processors, n
Carleton University
© S. Dandamudi
6
Three Metrics
(cont’d)
linear scaleup
C(n)
sublinear scaleup
Problem size & Processors
Carleton University
© S. Dandamudi
7
Example 1: QCD Problem
 Quantum
chromodynamics (QCD)
 Predicts
the mass of proton…
 Requires
 On
approximately 3*1017 operations
Cray-1 like system with 100 Mflops
 Takes
about 100 years
 Still takes 10 years on a Pentium that is 10 times faster
 IBM
built a special system (GF-11)
 Provides
about 11 Gflops (peak—sustained about 7 Gflops)
 QCD problem takes only a year or so!
Carleton University
© S. Dandamudi
8
Example 2: Factoring of RSA-129

Factoring of a 129-digit number (RSA-129) into two
primes

RSA stands for
Ronald Rivest of MIT
 Adi Shamir of the Weizmann Institute of Science, Israel,
 Leonard Adleman of USC



In 1977 they announced a new cryptographic scheme
Known as the RSA public-key system
``Cryptography is a never-ending struggle between code
makers and code breakers.'' Adi Shamir
Carleton University
© S. Dandamudi
9
Example 2: Factoring of RSA-129

(cont’d)
RSA-129 =
1143816257578888676692357799761466120102182
9672124236256256184293570693524573389783059
7123563958705058989075147599290026879543541

The two primes are
34905295108476509491478496199038 98133417764638493387843990820577
x
32769132993266709549961988190834 461413177642967992942539798288533

Solved in April 1994 (challenge posted in September 1993)
Needs over 1017 operations
 0.1% of Internet was used
 100% of Internet would have solved the problem in 3 hours

Carleton University
© S. Dandamudi
10
Example 3: Toy Story
 The
production of the Toy Story
 140,000
frames to render for the movie
 Requires about 1017 operations
 Same
 Used
as the RSA-129 problem
dozens of SUN workstations
 Each
SUN about 10 MIPS
Carleton University
© S. Dandamudi
11
Example 4: N-Body Problem
 Simulates
the motion of N particles under the
influence of mutual force fields based on an
inverse square law
 Material
science, astrophysics, etc. all require a variant
of the N-body problem
 To
double the physical accuracy seems require four
times the computation
 Implications
Carleton University
for scaleup
© S. Dandamudi
12
Peak Speeds
Machine
Cray-1
Cray C90
Cray T90
Pentium 4 (3070 MHz)
Athelon XP 1900+
Itanium
Sony Playstation 2
Carleton University
© S. Dandamudi
Mflops
160
1000
2200
3070
3200
6400
6300
13
Applications of Parallel Systems
 A wide
variety
 Scientific
applications
 Engineering applications
 Database applications
 Artificial intelligence
 Real-time applications
 Speech recognition
 Image processing
Carleton University
© S. Dandamudi
14
Applications of Parallel Systems

(cont’d)
Scientific applications
Weather forecasting
 QCD
 Blood flow in the heart
 Molecular dynamics
 Evolution of galaxies


Most problems rely on basic operations of linear algebra
Solving linear equations
 Finding Eigen values

Carleton University
© S. Dandamudi
15
Applications of Parallel Systems
 Weather
(cont’d)
forecasting
 Needs
to solve general circulation model equations
 Computation is carried out in
 3-dimensional
grid that partitions the atmosphere
 A fourth dimension is added: time

 With
Number of time steps in the simulation
a grid with 270 miles, 24-hour forecast needs
 100
billion data operations
Carleton University
© S. Dandamudi
16
Applications of Parallel Systems
 Weather
 On
forecasting (cont’d)
a 100-Mflops processor
 A 24-hour
 Want

 On
forecast takes about 1.5 hours
more accuracy?
 Use
 To
(cont’d)
half the grid size, halve the time step
Involves 24 = 16 times more processing
a 100 Mflops processor, 24-hour forecast takes 24 hours!
complete in 1.5 hours
 We
need 16 times faster system
Carleton University
© S. Dandamudi
17
Applications of Parallel Systems
 Engineering
 Circuit
(cont’d)
applications (VLSI design)
simulation
 Detailed
simulation of electrical characteristics
 Placement
 Automatic
positioning of blocks on a chip
 Wiring
 Automated

placement of wires to form desired connection
Done after the previous placement step
Carleton University
© S. Dandamudi
18
Applications of Parallel Systems
 Artificial
Intelligence
 Production
 Working

systems have three components
memory
Stores global database
 About various facts or data about the modeling world
 Production

memory
Stores knowledge base
 A set of production rules
 Control

(cont’d)
system
Chooses which rule should be applied
Carleton University
© S. Dandamudi
19
Applications of Parallel Systems
 Artificial
(cont’d)
Intelligence (cont’d)
f(curt,elaine)
f(dan,pat)
f(pat,john)
f(sam,larry)
f(larry,dan)
f(larry,doug)
 Example
Working memory
Production memory
m(elaine,john)
m(marian,elaine)
m(peg,dan)
m(peg,doug)
1. gf(X,Z)  f(X,Y), f(Y,Z)
2. gf(X,Z)  f(X,Y), m(Y,Z)
Carleton University
© S. Dandamudi
20
Applications of Parallel Systems
(cont’d)
Query: A grandchild of Sam
Carleton University
© S. Dandamudi
21
Applications of Parallel Systems
 Artificial
 Sources
Intelligence (cont’d)
of parallelism
 Assign

(cont’d)
each production rule its own processor
Each can search the working memory for pertinent facts in parallel
with all the other processors
 AND-parallelism

Synchronization is involved
 OR-parallelism

Abort other searches if one is successful
Carleton University
© S. Dandamudi
22
Applications of Parallel Systems
 Database
(cont’d)
applications
 Relational
model
 Uses
tables to store data
 Three basic operations
Selection
 Selects tuples that satisfy a specified condition
 Projection
 Selects certain specified columns
 Join
 Combines data from two tables

Carleton University
© S. Dandamudi
23
Applications of Parallel Systems
 Database
 Sources
(cont’d)
applications (cont’d)
of parallelism
 Within
a single query (intra-query parallelism)
Horizontally partition relations into P fragments
 Each processor independently works on each segment

 Among
queries (inter-query parallelism)
Execute several queries concurrently
 Exploit common subqueries
 Improves query throughput

Carleton University
© S. Dandamudi
24
Flynn’s Taxonomy
 Based
on number of instruction and data streams
 Single-Instruction,
 Uniprocessor
systems
 Single-Instruction,
 Array
Single-Data stream (SISD)
Multiple-Data stream (SIMD)
processors
 Multiple-Instruction,
 Not
Single-Data stream (MISD)
really useful
 Multiple-Instruction,
Carleton University
Multiple-Data stream (MIMD)
© S. Dandamudi
25
Flynn’s Taxonomy
 MIMD
systems
 Most
popular category
 Shared-memory
 Also

systems
called multiprocessors
Sometimes called tightly-coupled systems
 Distributed-memory
 Also

systems
called multicomputers
Sometimes called loosely-coupled systems
Carleton University
© S. Dandamudi
26
Another Taxonomy
 Parallel
systems
 Synchronous
 Vector
 Array
 SIMD
 Systolic
 Asynchronous
 MIMD
 Dataflow
Carleton University
© S. Dandamudi
27
SIMD Architecture
Multiple actors, single script
 SIMD comes in two flavours


Array processors

Large number of simple processors



Operate on small amount of data (bits, bytes, words,…)
Illiac IV, Burroughs BSP, Connection Machine CM-1
Vector processors

Small number (< 32) of powerful, pipelined processors


Operate on large amount of data (vectors)
Cray 1 (1976), Cray X/MP (mid 1980s, 4 processors), Cray Y/MP (1988,
8 processors), Cray 3 (1989, 16 processors)
Carleton University
© S. Dandamudi
28
SIMD Architecture
Carleton University
© S. Dandamudi
(cont’d)
29
Shared-Memory MIMD
 Two
major classes
 UMA
 Uniform
memory access
 Typically bus-based
 Limited to small size systems
 NUMA
 Non-uniform
memory access
 Use a MIN-based interconnection
 Expandable to medium system sizes
Carleton University
© S. Dandamudi
30
Shared-Memory MIMD
Carleton University
© S. Dandamudi
(cont’d)
31
Shared-Memory MIMD
(cont’d)
UMA
Carleton University
© S. Dandamudi
32
Shared-Memory MIMD
(cont’d)
NUMA
Carleton University
© S. Dandamudi
33
Shared-Memory MIMD
(cont’d)
 Examples
 SGI
Power Onyx
 Cray C90
 IBM SP2 Node
 Symmetric
Multi-Processing (SMP)
 Special
case of shared-memory MIMD
 Identical processors share the memory
Carleton University
© S. Dandamudi
34
Distributed-Memory MIMD
 Typically
use message-passing
 Interconnection
 Point-to-point
 System
 Intel
network is static
network
scales up to thousands of nodes
TFLOPS system consists of 9000+ processors
 Similar
to cluster systems
 Popular
architecture for large parallel systems
Carleton University
© S. Dandamudi
35
Distributed-Memory MIMD
Carleton University
© S. Dandamudi
(cont’d)
36
Distributed-Memory MIMD
Carleton University
© S. Dandamudi
(cont’d)
37
Hybrid Systems
Sanford DASH
Carleton University
© S. Dandamudi
38
Distributed Shared Memory
 Advantages
 Relatively
 Global
 Fast
of shared-memory MIMD
easy to program
shared memory view
communication & data sharing
 Via
the shared memory
 No physical copying of data
 Load
distribution is not a problem
Carleton University
© S. Dandamudi
39
Distributed Shared Memory
 Disadvantages
 Limited
(cont’d)
of shared-memory MIMD
scalability
 UMA can
scale to 10s of processors
 NUMA can scale to 100s of processors
 Expensive
Carleton University
network
© S. Dandamudi
40
Distributed Shared Memory
 Advantages
 Good
(cont’d)
of distributed-memory MIMD
scalability
 Can
scale to 1000s of processors
 Inexpensive
 Uses
static interconnection
 Cheaper
 Can
network (relatively speaking)
to build
use off-the-shelf components
Carleton University
© S. Dandamudi
41
Distributed Shared Memory
 Disadvantages
 Not
(cont’d)
of distributed-memory MIMD
easy to program
 Deal
with explicit message-passing
 Slow
network
 Expensive data copying
 Done
 Load
by message passing
distribution is an issue
Carleton University
© S. Dandamudi
42
Distributed Shared Memory
(cont’d)
 DSM
is proposed to take advantage of these two
types of systems
 Uses
distributed-memory MIMD hardware
 A software layer gives the appearance of sharedmemory to the programmer
 A memory
read, for example, is transparently converted to a
message send and reply
 Example:
Carleton University
Treadmarks from Rice
© S. Dandamudi
43
Distributed Shared-Memory
Carleton University
© S. Dandamudi
44
Cluster Systems
 Built
with commodity processors
 Cost-effective
 Often
use the existing resources
 Take
advantage of the technological advances in
commodity processors
 Not tied to a single vendor
 Generic
components means
Competitive price
 Multiple sources of supply

Carleton University
© S. Dandamudi
45
Cluster Systems
 Several
(cont’d)
types
 Dedicated
set of workstations (DoW)
 Specifically
built as a parallel system
 Represents one extreme
 Dedicated to parallel workload

No serial workload
 Closely

related to the distributed-memory MIMD
Communication network latency tends to be high
 Example: Fast Ethernet
Carleton University
© S. Dandamudi
46
Cluster Systems
 Several
(cont’d)
types (cont’d)
 Privately-owned
workstations (PoW)
 Represents
the other extreme
 All workstations are privately owned

Idea is to harness unused processor cycles for parallel workload
 Receives

local jobs from owners
Local jobs must receive higher priority
 Workstations

might be dynamically removed from the pool
Owner shutting down/resetting the system, keyboard/mouse activity
Carleton University
© S. Dandamudi
47
Cluster Systems
 Several
types (cont’d)
 Community-owned
 All

 In
(cont’d)
workstations (CoW)
workstations are community-owned
Example: Workstations in a graduate lab
the middle of DoW and PoW
In PoW, a workstation could be removed when there is owner activity
 Not so in CoW systems
 Parallel workload continues to run

 Resource
management should take these differences
into account
Carleton University
© S. Dandamudi
48
Cluster Systems
(cont’d)
 Beowulf
 Use
PCs for parallel processing
 Closely resembles a DoW
 Dedicated
PCs (no scavenging of processor cycles)
 A private system network (not a shared one)
 Open design using public domain software and tools
 Also
known as
 PoPC
(Pile of PCs)
Carleton University
© S. Dandamudi
49
Cluster Systems
 Beowulf
(cont’d)
(cont’d)
 Advantages
 Systems
not tied to a single manufacturer
Multiple vendors supply interchangeable components
 Leads to better pricing

 Technology
tracking is straightforward
 Incremental expandability
Configure the system to match user needs
 Not limited to fixed, vendor-configured system

Carleton University
© S. Dandamudi
50
Cluster Systems
 Beowulf
(cont’d)
(cont’d)
 Example
system
 Linux
NetworX designed the largest and most powerful
Linux cluster
Delivered to Lawrence Livermore National Lab (LLNL) in 2002
 Uses 2,304 Intel 2.4 GHz Xeon processors
 Peak rating: 11.2 Tflops
 Aggregate memory: 4.6 TB
 Aggregate disk space: 138.2 TB
 Ranked 5th fastest supercomputer in the world

Carleton University
© S. Dandamudi
51
ASCI System
Carleton University
© S. Dandamudi
52
Dataflow Systems
 Different
from control flow
 Availability
of data determines which instructin should
executed
 Example: A =
 On
(B + C) * (D – E)
von Neumann machine
 Takes

6 instructions
Sequential dependency
Carleton University
© S. Dandamudi
add
store
sub
store
mult
store
B,C
T1
D,E
T2
T1,T2
A
53
Dataflow Systems
(cont’d)
Addition & subtraction
can be done in parallel
 Dataflow supports finegrain parallelism



Causes implementation
problems
To overcome these
difficulties

Proposed hybrid
architectures
Carleton University
© S. Dandamudi
54
Dataflow Systems
Data flows around the ring
 Matching unit arranges
data into sets of matched
operands



(cont’d)
Manchester dataflow machine
Released to obtain
instruction from instruction
store
Any new data produced is
passed around the ring
Carleton University
© S. Dandamudi
55
Interconnection Networks
 A critical
component in many parallel systems
 Four design issues
 Mode
of operation
 Control strategy
 Switching method
 Topology
Carleton University
© S. Dandamudi
56
Interconnection Networks
 Mode
(cont’d)
of operation
 Refers
to the type of communication used
 Asynchronous
 Typically
used in MIMD
 Synchronous
 Typically
used in SIMD
 Mixed
Carleton University
© S. Dandamudi
57
Interconnection Networks
 Control
 Refers
(cont’d)
strategy
to how routing is achieved
 Centralized
control
Can cause scalability problem
 Reliability is an issue
 Non-uniform node structure

 Distributed
control
Uniform node structure
 Improved reliability
 Improved scalability

Carleton University
© S. Dandamudi
58
Interconnection Networks
 Switching
 Two
(cont’d)
method
basic types
 Circuit
switching
A complete path is established
 Good for large data transmission
 Causes problems at high loads

 Packet
switching
Uses store-and-forward method
 Good for short messages
 High latency

Carleton University
© S. Dandamudi
59
Interconnection Networks
 Switching
method (cont’d)
 Wormhole
 Uses

(cont’d)
routing
pipelined transmission
Avoids the buffer problem in packet switching
 Complete
(virtual) circuit is established as in circuit
switching

Avoids some of the problems associated with circuit switching
 Extensively
Carleton University
used in current systems
© S. Dandamudi
60
Interconnection Networks
 Network
 Static
(cont’d)
topology
topology
 Links
are passive and static
 Cannot be reconfigured to provide direct connection
 Used in distributed-memory MIMD systems
 Dynamic
 Links

topology
can be reconfigured dynamically
Provides direct connection
 Used
in SIMD and shared-memory MIMD systems
Carleton University
© S. Dandamudi
61
Interconnection Networks
 Dynamic
(cont’d)
networks
 Crossbar
 Very
expensive
 Limited to small sizes
 Shuffle-exchange
 Single-stage
 Multistage

Also called MIN (Multistage interconnection network)
Carleton University
© S. Dandamudi
62
Interconnection Networks
(cont’d)
Crossbar
network
Carleton University
© S. Dandamudi
63
Interconnection Networks
 Shuffle-exchange
 Use
(cont’d)
networks
a switching box
 Gives
the capability to dynamically reconfigure the network
 Different types of switches
2-function
 4-function

 Connections
 Perfect

between stages follow the shuffle pattern
shuffle
Think of how you mix a deck of cards
Carleton University
© S. Dandamudi
64
Interconnection Networks
2-function
switches
0
1
1
0
(cont’d)
4-function
switches
Carleton University
© S. Dandamudi
65
Interconnection Networks
(cont’d)
Perfect
shuffle
Carleton University
© S. Dandamudi
66
Interconnection Networks
Single-stage
shuffle-exchange
network
(cont’d)
Buffers
All outputs & inputs
are connected like this
Carleton University
© S. Dandamudi
67
Interconnection Networks
(cont’d)
MIN
Carleton University
© S. Dandamudi
68
Interconnection Networks
Carleton University
© S. Dandamudi
(cont’d)
69
Interconnection Networks
(cont’d)
IBM SP2
switch
Carleton University
© S. Dandamudi
70
Interconnection Networks
 Static
(cont’d)
interconnection networks
 Complete
connection
 One
extreme
 High cost, low latency
 Ring
network
 Other
extreme
 Low cost, high latency
 A variety
Carleton University
of networks between these two extremes
© S. Dandamudi
71
Interconnection Networks
Complete connection
Carleton University
Ring
© S. Dandamudi
(cont’d)
Chordal ring
72
Interconnection Networks
(cont’d)
Tree networks
Carleton University
© S. Dandamudi
73
Interconnection Networks
Carleton University
© S. Dandamudi
(cont’d)
74
Interconnection Networks
(cont’d)
Hypercube networks
1-d
Carleton University
2-d
© S. Dandamudi
3-d
75
Interconnection Networks
(cont’d)
A hierarchical
network
Carleton University
© S. Dandamudi
76
Future Parallel Systems
 Special-purpose
systems
+ Very efficient
+ Relatively simple
-
Narrow domain of applications
 May be cost-effective
 Depends
Carleton University
on the application
© S. Dandamudi
77
Future Parallel Systems
 General-purpose
(cont’d)
systems
+ Cost-effective
+ Wide range of applications
- Decreased speed
- Decreased hardware utilization
- Increased software requirements
Carleton University
© S. Dandamudi
78
Future Parallel Systems
 In
(cont’d)
favour of special-purpose systems
 Harold
Stone argues
 Major
advantage of general-purpose systems is that they are
economical due to their wide area of applicability
 Economics
of computer systems is changing rapidly because
of VLSI
 Makes
Carleton University
the special-purpose systems economically viable
© S. Dandamudi
79
Future Parallel Systems
 In
(cont’d)
favour of both types of systems
 Gajski
argues
 Problem
space is constantly expanding
 Special-purpose systems can only be designed to solve
“mature” problems
 Always new applications for which no “standardized”
solution exists
 For these applications, general-purpose systems are useful
Carleton University
© S. Dandamudi
80
Performance
 Amdahl’s
law
fraction of a program: a
 Parallel fraction: 1- a
 Execution time on n processors
 Serial
T(n) = T(1) a + T(1) (1 – a)
n
n
 Speedup S(n) =
a n + (1 – a)
Carleton University
© S. Dandamudi
 Amdahl’s law
81
Performance
n
a = 1%
(cont’d)
a = 10% a = 25%
10
9.17
5.26
3.08
20
16.81
6.90
3.48
30
23.26
7.69
3.64
40
28.76
8.16
3.72
50
33.56
8.47
3.77
100
50.25
9.17
3.88
Carleton University
© S. Dandamudi
82
Performance
 Gustafson-Barsis
 Obtained
 For

law
a speedup of 1000 on a 1024-node nCUBE/10
the problem, a values ranged from 0.4% to 0.8%
Won Gordon Bell prize in 1988
 Amdahl’s

(cont’d)
law predicts a speedup of 201 to 112!
Assumes that (1 - a) is independent of n
 Problem
scales up with system
T(1) = a + (1 - a) n
 Speedup
Carleton University
T(n) = a + (1 - a) = 1
S(n) = n – (n – 1) a
© S. Dandamudi
83
Performance
n
a = 1%
(cont’d)
a = 10% a = 25%
10
9.91
9.1
7.75
20
19.81
18.1
15.25
30
29.71
27.1
22.75
40
39.61
36.1
30.25
50
49.51
45.1
37.75
100
99.01
90.1
75.25
Carleton University
© S. Dandamudi
Last slide
84
Download