2 1

advertisement
FDA125 APP Topic X: Realization of PRAMs.
1
C. Kessler, IDA, Linköpings Universitet, 2007.
2
FDA125 APP Topic X: Realization of PRAMs.
C. Kessler, IDA, Linköpings Universitet, 2007.
Lecture 3 – Emulation of PRAMs; SB-PRAM architecture
SB-PRAM: A realization of the PRAM model in hardware
SB-PRAM overview
Hashing shared memory addresses
Parallel prefix computation on a tree
Multiprefix computation
Ranade’s emulation algorithm
SB-PRAM architecture
SB-PRAM system software tools and simulator
Exact barrier implementation in SB-PRAM assembler
based on “Fluent Machine” emulation approach
[Ranade et al. ’87, ’88]
cost-efficient, scalable [Abolhassan/Keller/Paul’90, ’91]
F physical processors
multithreading
each physical processor simulates V
= c log2 F
PRAM processors (vPs)
pipelined butterfly network
F log2 F switches with simple ALU and memory
P = F V memory modules;
write conflicts resolved by combining
on-the-fly parallel reductions and multiprefix in the network
university prototypes:
(1) F = 16, V = 32 (finished 1998)
(2) F = 64, V = 32 (finished 2000)
1991 ASIC design, 8 Mhz ! 250 kFLOPs (also memory bandw.) per vP
FDA125 APP Topic X: Realization of PRAMs.
3
C. Kessler, IDA, Linköpings Universitet, 2007.
SB-PRAM: A realization of the PRAM model in hardware (cont.)
V
4
FDA125 APP Topic X: Realization of PRAMs.
C. Kessler, IDA, Linköpings Universitet, 2007.
Distributed shared memory by hashing of addresses
Map m shared memory addresses
over p disjoint memory modules of size m0 = m= p each:
WRITE ACCESS
[PPP 3.4]
Hash functions
h1 : f0; :::; m 1g ! f0; :::; p 1g gives the module address
h2 : f0; :::; m 1g ! f0; :::; m0 1g gives the local address
F
Bad access sequence
a lot of requests go to the same module (not same location)
! overloaded, maybe request queue overflow
PHYSICAL/VIRTUAL PROCS.
BUTTERFLY NETWORK (2DIR.)
MEMORY MODULES
Concurrent accesses to the same memory location:
requests meet at a butterfly switch
combine requests (depending on request type and priority)
for reading requests: on the way back, split up replies again
Special End-of-Step packets mark end of one PRAM step
Prob(access sequence bad) is very low ! choose a random hash function
h1(x) =
ζ 1
∑i=0 aixi mod P mod p
with ai randomly chosen, P an appropriate prime (see [PPP 3.4]
On SB-PRAM: linear hash function h1(x) = a x mod p, default: a = 1 [EK93]
5
FDA125 APP Topic X: Realization of PRAMs.
C. Kessler, IDA, Linköpings Universitet, 2007.
Parallel prefix on the SB-PRAM
Global sum
p
s=s+
i=1
ai
j
012
i=1
+
2 1
5
3
3 4
5 6
ai
j=1,...,p
Multiprefix: Multiple parallel prefix operations on different
shared memory locations s1, s2, ... can be processed simultaneously.
SHARED MEMORY
partial
sums
5
0
+
3 2
0
j-1
y = s+
+
3
5
C. Kessler, IDA, Linköpings Universitet, 2007.
Multiprefix on the SB-PRAM
Prefix sums
8
6
FDA125 APP Topic X: Realization of PRAMs.
7
partial
prefix
sums
7
0xe40:
0
0xf3c:
4
P
P
P
P
P
0
1
2
3
4
mpadd( 0xf3c, mpadd( 0xe40, mpadd( 0xe40, mpadd( 0xf3c, mpadd( 0xe40,
1 );
2 );
3 );
4 );
5 );
returns
4
Parallel prefix “on” shared memory location s
virtually, by adding up all contributions ai to s in sequential
can be implemented using a binary tree rooted at s
(slightly suboptimal variant of odd-even prefix)
This binary tree is embedded in the SB-PRAM network
(request paths towards memory module hosting s).
returns
0
returns
2
returns
5
returns
SHARED MEMORY
0xe40:
10
0xf3c:
9
On SB-PRAM: for integer addition, maximum, bitwise
FDA125 APP Topic X: Realization of PRAMs.
7
C. Kessler, IDA, Linköpings Universitet, 2007.
Ranade’s simulation algorithm
8
FDA125 APP Topic X: Realization of PRAMs.
AND,
bitwise
OR
C. Kessler, IDA, Linköpings Universitet, 2007.
Prerequisite: pipelined merging of sorted access sequences
P1
[Ranade’87,’88,’91]
out
mergeStreams
left
+ very simple
+ requires only constant sized queues to store access requests
+ slowdown factor only O(log p)
+ can be made optimally efficient
+ randomization employed only to distribute the shared memory
across the memory modules.
+ pipelined butterfly network as the underlying communication network
! scalable
+ multiprefix on-the-fly [Ranade et al.’88]
+ cost-efficient implementation: SB-PRAM [Abolhassan/Keller/Paul’91]
5
P2
right
out
P3
mergeStreams
left
simpler case:
1 memory module
out
mergeStreams
right
left
right
Combining:
P4
out
mergeStreams
left
psrt[0]
right
psrt[1]
P5
out
mergeStreams
left
psrt[2]
right
psrt[3]
P6
out
mergeStreams
left
psrt[4]
right
psrt[5]
P7
out
mergeStreams
left
psrt[6]
right
psrt[7]
Tree of p 1
merge processors
working synchronously
in parallel
Initial arrays (addresses of memory requests) sorted in increasing order.
Per global step, each comparator moves the larger operand upwards,
the other waits ! FIFO queues along edges needed.
Combine requests with identical addresses.
10
FDA125 APP Topic X: Realization of PRAMs.
9
FDA125 APP Topic X: Realization of PRAMs.
Routing of messages
Fluent Abstract Machine
Step 1
Arrange p = F (log F + 1) PRAM processors
on a F (log F + 1) bidirectional butterfly network
col 0
col 1
row 1
0,1
1,1
row 2
0,2
1,2
C. Kessler, IDA, Linköpings Universitet, 2007.
C. Kessler, IDA, Linköpings Universitet, 2007.
Node (0,0)
Node (1,0)
27
4
4
3
col log F
2,1
Node (2,0)
18
Node (1,1)
3,1
keep streams sorted
by increasing
addresses
0,3
artificial Ghost
messages to keep
routing in flux
Step 2
17
30 27
15
6
4
3
4
18
Step 3
30 27
6
4
18
15
18
Ghost-4
17
EndOfStep msgs
combine
Ghost-15
row F
17
combine
11
FDA125 APP Topic X: Realization of PRAMs.
C. Kessler, IDA, Linköpings Universitet, 2007.
hc r i ! h 0 r i
;
;
Node s at level c
stage(s) = l
Phase 3: backward
Phase 2: forward
h0; ri ! hl ; r0i
;
0
;
Phase 5: backward
Phase 4: forward
hc0; r0i ! hl ; r0i
hl r i ! hc r i
0
0
Phase 6: forward
h0; ri ! hc; ri
hl r i ! h0 ri
;
0
;
Read request from h0; 0i
stage(s) = 3l
c
stage(s) = l + c
stage(s) = 5l
c
stage(s) = 3l + c
12
C. Kessler, IDA, Linköpings Universitet, 2007.
Simulating several PRAM time steps
Simulating one PRAM time step
Stage 1: backward
FDA125 APP Topic X: Realization of PRAMs.
c
stage(s) = 5l + c
Phase 1: Processor hc; ri sends the request to processor h0; ri
Phase 2: Processor h0; ri sends the request to processor hl ; r0i
Phase 3: Processor hl ; r0i sends the request to processor hc0; r0i.
Phase 4: Processor hc0; r0i sends the reply to processor hl ; r0i
Phase 5: Processor hl ; r0i sends the reply to processor h0; ri
Phase 6: Processor h0; ri sends the reply to processor hc; ri.
If the hash function h1 chosen turns out to be bad:
choose new h1 and rehash the memory
in parallel in time O(m= p log p) time with high probability.
expected value for timeout + rehashing
to be balanced by expected simulation time for t steps
Theorem [Ranade’87]
An arbitrary t step program for a p-processor Combining CRCW PRAM
with t m= p
can be simulated on a p-processor butterfly in O(t log p) time
with high probability as p ! ∞ and/or t ! ∞.
The size of the memory required at each butterfly node is O(m= p).
14
FDA125 APP Topic X: Realization of PRAMs.
FDA125 APP Topic X: Realization of PRAMs.
13
C. Kessler, IDA, Linköpings Universitet, 2007.
C. Kessler, IDA, Linköpings Universitet, 2007.
SB-PRAM: switch design
Making Ranade’s emulation algorithm cost-optimal
logϕ
Efficiency of Ranade’s algorithm: Ω(1= log p).
Improvement:
Each physical processor simulates log p PRAM processors.
Routing
Switch
ϕ
! Phases 3 and 4 become superfluous
! Phases 1 and 6 can be replaced by linear sorting arrays [PPP 4.2.1]
Phase 1
Phase 2
Phase 3/4
Sort
Phase 5
M
Proc.
@@?
M
Sort
Sort
r
?
Block structure of a routing
switch
M
M
? ?
r
6
r
-
Dir.Queue
-
Logic+
Arithm.
6
@@
6 6
@@? ? ?
Phase 3
Sort
FIFO
buffer
6
Logic+
Arithm.
Phase 1
Sort
FIFO
buffer
r
Phase 5
M
?
Mem.
Mod.
Phase 6
Sort
?
@@?
?
FIFO
buffer
FIFO
buffer
6
6
Phase 6
Phase 4
M
Phase 2
M
Sort
M
FDA125 APP Topic X: Realization of PRAMs.
15
Sort
C. Kessler, IDA, Linköpings Universitet, 2007.
FDA125 APP Topic X: Realization of PRAMs.
Program design flow; system software tools
pramsim
Fork95 Program
example.c
[Bosch/Franziskus’94]
uses file .pramsimrc
-M, --globmem: shared memory size in words
prass
Object Module
example.o
-P, --progmem: program
memory size in words
-v, --virtProz: # vP per pP
plink
Executable File
a.out
-p, --physProz: # pP
--net-er: warning on
concurrent read
loader
Simulator
pramsim
C. Kessler, IDA, Linköpings Universitet, 2007.
pramsim [optional parameters] [executable file]
fcc (compiler only)
Assembler Program
example.s
SB-PRAM
Machine
16
--net-ew: warning on
concurrent write
Configuration file: .ldrc
PRAM P0 = (p0, v0)>
Commands:
init F V
g
t VAL
help
r PROC
c VAL
d MEM
m MEM
k MEM
break VAL
q
(Re)Initialize with F pP, V vP
Start or continue the program
Trace. Execute 1 or VAL steps
Print help text
Show all registers of vP PROC
Show value VAL as decimal, hex, float, bin
Disassemble the memory range MEM
Show the memory area MEM as hex
Show the memory area MEM as hex and ascii
Set breakpoint at adress VAL
Quit from simulator
FDA125 APP Topic X: Realization of PRAMs.
17
C. Kessler, IDA, Linköpings Universitet, 2007.
FDA125 APP Topic X: Realization of PRAMs.
18
C. Kessler, IDA, Linköpings Universitet, 2007.
PRAMOS – Syscalls
Self-restoring exact barrier on SB-PRAM
_barrier:
bmc
0
/*continue at modulo=0*/
getlo
-1,par2
/*load constant -1
0*/
syncadd
par2,gps,1
/*atomic decrement
1*/
FORKLIB_SYNCLOOP:
ldg
gps,1,r30
/*load sync cell
0*/
getlo
1,par1
/*load constant 1
1*/
add
r30,0,r30
/*compare sync cell
0*/
bne
FORKLIB_SYNCLOOP/*all procs there?
1*/
ldg
gps,1,r30
/*sync:cmp Sync,
0*/
syncadd
par1,gps,1
/*restore sync cell
1*/
add
r30,0,r30
/*compare with 0,
0*/
bne
FORKLIB_SYNCHRON/*late wave skips nops*/
nop
/*early wave delayed 0*/
nop
/*early wave delayed 1*/
FORKLIB_SYNCHRON:
SB-PRAM operating system PRAMOS
[Grün/Rauber/Röhrig’95]
no direct support for synchronous execution at user level
Fork runtime system only uses the syscalls (esp., host file system I/O)
No. type name
parameters (C declaration)
0 int open
char *file, int mode, flags
1 int read
int fd, void *buf, int num
2 int write
int fd, void *buf, int num
3 int close
int fd
4 int lseek
int fd, int offst, int orign
5 int sys std open int fd open stdin/-out/-err
int pc, int reason
6 int sys abort
7 int sys getnr
get my physical processor ID
call OS program exit routine
8 int sys exit
9 int sys getct
get the global counter
12 int sys getbas get BASE register
13 void sys putbas int base write BASE register
Download