Pertemuan 26 Parallel Processing 2 Matakuliah : H0344/Organisasi dan Arsitektur Komputer

advertisement
Matakuliah
Tahun
Versi
: H0344/Organisasi dan Arsitektur Komputer
: 2005
: 1/1
Pertemuan 26
Parallel Processing 2
1
Learning Outcomes
Pada akhir pertemuan ini, diharapkan mahasiswa
akan mampu :
• Menjelaskan prinsip kerja parallel
processing
2
Outline Materi
•
•
•
•
•
•
Multiple Processor Organization
Symmetric Multiprocessor
Cache Coherence and The MESI Protocol
Clusters
Non-uniform Memory Access
Vector Computation
3
Cache coherence and MESI Protocol
The cache coherence:
Multiple copies of the same data can exist in different caches
simultaneously, and if processors are allowed to update their
own copies freely, an inconsistent view of memory can result.
• Software solution
• Hardware solution
• Directory protocol
• Snoopy protocol
4
Cache coherence and MESI Protocol
MESI cache line states
This cache line
valid?
The memory copy
is …
Copies exist in
other caches?
A write to this line
…
M
Modified
E
Exclusive
S
Shared
I
Invalid
Yes
Yes
Yes
No
Out of date
Valid
Valid
-
No
No
Maybe
Maybe
Does not go to bus
Does not go to bus
Goes to bus and
updates cache
Goes directly to
bus
5
Cache coherence and MESI Protocol
MESI state transition diagram
Invalid
WM
RMS
Shared
RH
Invalid
SHW
Shared
RME
SHW
SHR
WH
RH
Modified
SHR
WH
SHR
Exclusive
RH
Modified
SHW
Exclusive
WH
(a) Line in cache at initiating processor
(a) Line in snooping cache
6
Clusters
Four benefits that can be achieved with clustering:
• Absolute scalability
• Incremental scalability
• High availability
• Superior price/performance
7
Clusters
Cluster
configuration
P
P
P
P
High speed message link
M
I/O
I/O
I/O
I/O
M
(a) Standby server with no shared disk
High speed message link
P
P
I/O
I/O
P
P
M
I/O
I/O
I/O
I/O
M
RAID
(a) Shared disk
8
Clustering methods: benefits and limitations
Clustering method
Description
Benefits
Limitation
Passive standby
A secondary server
takes over in case of
primary server failure.
Easy to implement.
High cost because the
secondary server is
unavailable for other
processing tasks.
Active secondary
The secondary server is
also used for processing
tasks.
Reduces cost because
secondary servers can
be used for processing.
Increased complexity.
Separate servers
Separate servers have
their own disks. Data
are continuously copied
from primary to
secondary server.
High availability.
High network and
server overhead due
to copying operations.
Servers connected
to disk
Servers are cabled to
the same disks, but
each server owns its
disk. If one server fails,
its disks are taken over
by the other server.
Reduced network and
server overhead due to
elimination of copying
operations.
Usually requires disk
mirroring or RAID
technology to
compensate for risk of
disk failure.
Servers share
disks
Multiple servers
simultaneously share
access to disks.
Low network and server
overhead. Reduced risk
of downtime caused by
disk failure.
Requires look
manager software.
Usually used with disk
mirroring or RAID
technology.
9
Clusters
Operating system design issue:
• Failure management
• Load balancing
• Parallel computation
• Parallelizing compiler
• Parallelized application
• Parametric computing
10
Non uniform memory access
• Uniform memory access (UMA)
• Non uniform memory access (NUMA)
• Cache coherent NUMA (CC-NUMA)
11
Non uniform memory access
Processor
1-1
Processor
1-m
L1 cache
Processor
2-1
L1 cache
L1 cache
L1 cache
Main
memory
1
Processor
N-1
L1 cache
Directory
L1 cache
I/O
I/O
L1 cache
L1 cache
Directory
Main
memory
2
Interconnection
network
Processor
N-m
L1 cache
Processor
2-m
L1 cache
L1 cache
L1 cache
I/O
CC-NUMA Organization
Main
memory
N
Directory
12
Vector computation
DO 100 I = 1, N
DO 100 I = 1, N
DO 100 J = 1, N
C(I, J) = 0.0 (J = 1, N)
C(I, J) = 0.0
DO 100 K = 1, N
DO 100 K = 1, N
C(1, J) = C(I, J) + A(I, K) * B(K, J) (J = 1, N)
C(I, J) = C(I, J) + A(I, K) * B(K, J)
100
100
CONTINUE
CONTINUE
(a) Scalar processing
(b) Vector processing
DO 50 J = 1, N - 1
FORK 100
50
CONTINUE
J=N
100
DO 200 I = 1, N
C(I, J) = 0.0
DO 200 K = 1, N
C(I, J) = C(I, J) + A(I, K) * B(K, J)
200
CONTINUE
(c) Parallel processing
13
Vector computation
Memory
Input
register
Pipelined ALU
Output
register
(a) Pipelined ALU
Input
register
ALU
ALU
ALU
Memory
Output
register
(b) Parallel ALUs
14
xi
Compare
exponent
yi
Shift
significand
Add
significands
Normalize
zi
C
S
A
N
zi
C
S
A
N
zi+1
C
S
A
N
zi+2
C
S
A
N
zi+3
xi
xi
C
S
A
N
zi
yi
yi
xi+1
x1, y1
C
x2, y2
S
A
N
C
S
A
N
C
S
A
N
C
S
A
x3, y3
x4, y4
x5, y5
yi+1
z1
C
xi+2
z2
S
(a) Pipelined ALU
Vector
computation
yi+2
z3
N
A
xi+3
z4
N
yi+3
z5
x1, y1
C
S
A
N
z1
x2, y2
C
S
A
N
z2
x3, y3
C
S
A
N
z3
x4, y4
C
S
A
N
z4
x5, y5
C
S
A
N
z5
x6, y6
C
S
A
N
z6
x7, y7
C
S
A
N
z7
x8, y8
C
S
A
N
z8
x9, y9
C
S
A
N
z9
x10, y10
C
S
A
N
z10
x11, y11
C
S
A
N
z11
x12, y12
C
S
A
N
z12
(b) Four parallel ALUs
15
Vector
computation
DO 100 J = 1, 50
CR(J) = AR(J) * BR(J) – AI(J) * BI(J)
100
CI(J) = AR(J) * BI(J) + AI(J) * BR(J)
Operation
Cycle
Operation
Cycle
AR(J) * BR(J)  T1(J)
3
AR(J)  V1(J)
1
AI(J) * BI(J)  T2(J)
3
BR(J)  V2(J)
1
T1(J) – T2(J)  CR(J)
3
V1(J) * V2(J)  V3(J)
1
AR(J) * BI(J)  T3(J)
3
AI(J)  V4(J)
1
AI(J) * BR(J)  T4(J)
3
BI(J)  V5(J)
1
T3(J) + T4(J)  CI(J)
3
V4(J) * V5(J)  V6(J)
1
TOTAL
12
V3(J) – V6(J)  V7(J)
1
V7(J)  CR(J)
1
V1(J) * V5(J)  V8(J)
1
V4(J) * V2(J)  V9(J)
1
V8(J) + V9(J)  V0(J)
1
V0(J)  CI(J)
1
TOTAL
12
(a) Storage to storage
(b) Register to register
16
Vector
computation
DO 100 J = 1, 50
CR(J) = AR(J) * BR(J) – AI(J) * BI(J)
100
CI(J) = AR(J) * BI(J) + AI(J) * BR(J)
Operation
Cycle
Operation
Cy
cle
AR(J)  V1(J)
1
AR(J)  V1(J)
1
V1(J) * BR(J)  V2(J)
1
V1(J) * BR(J)  V2(J)
1
AI(J)  V3(J)
1
AI(J)  V3(J)
1
V3(J) * BI(J)  V4(J)
1
V2(J) – V(3) * BI(J)  V2(J)
1
V2(J) – V4(J)  V5(J)
1
V2(J)  CR(J)
1
V5(J)  CR(J)
1
V1(J) * BI(J)  V4(J)
1
V1(J) * BI(J)  V6(J)
1
V4(J) + V3(J) * BR(J)  V5(J)
1
V4(J) * BR(J)  V7(J)
1
V5(J)  CI(J)
1
V6(J) + V7(J)  V8(J)
1
TOTAL
8
V8(J)  CI(J)
1
(d) Compound instruction
TOTAL
10
(c) Storage to storage
17
Download