FDA125 APP Topic X: Realization of PRAMs. 1 C. Kessler, IDA, Linköpings Universitet, 2007. 2 FDA125 APP Topic X: Realization of PRAMs. C. Kessler, IDA, Linköpings Universitet, 2007. Lecture 3 – Emulation of PRAMs; SB-PRAM architecture SB-PRAM: A realization of the PRAM model in hardware SB-PRAM overview Hashing shared memory addresses Parallel prefix computation on a tree Multiprefix computation Ranade’s emulation algorithm SB-PRAM architecture SB-PRAM system software tools and simulator Exact barrier implementation in SB-PRAM assembler based on “Fluent Machine” emulation approach [Ranade et al. ’87, ’88] cost-efficient, scalable [Abolhassan/Keller/Paul’90, ’91] F physical processors multithreading each physical processor simulates V = c log2 F PRAM processors (vPs) pipelined butterfly network F log2 F switches with simple ALU and memory P = F V memory modules; write conflicts resolved by combining on-the-fly parallel reductions and multiprefix in the network university prototypes: (1) F = 16, V = 32 (finished 1998) (2) F = 64, V = 32 (finished 2000) 1991 ASIC design, 8 Mhz ! 250 kFLOPs (also memory bandw.) per vP FDA125 APP Topic X: Realization of PRAMs. 3 C. Kessler, IDA, Linköpings Universitet, 2007. SB-PRAM: A realization of the PRAM model in hardware (cont.) V 4 FDA125 APP Topic X: Realization of PRAMs. C. Kessler, IDA, Linköpings Universitet, 2007. Distributed shared memory by hashing of addresses Map m shared memory addresses over p disjoint memory modules of size m0 = m= p each: WRITE ACCESS [PPP 3.4] Hash functions h1 : f0; :::; m 1g ! f0; :::; p 1g gives the module address h2 : f0; :::; m 1g ! f0; :::; m0 1g gives the local address F Bad access sequence a lot of requests go to the same module (not same location) ! overloaded, maybe request queue overflow PHYSICAL/VIRTUAL PROCS. BUTTERFLY NETWORK (2DIR.) MEMORY MODULES Concurrent accesses to the same memory location: requests meet at a butterfly switch combine requests (depending on request type and priority) for reading requests: on the way back, split up replies again Special End-of-Step packets mark end of one PRAM step Prob(access sequence bad) is very low ! choose a random hash function h1(x) = ζ 1 ∑i=0 aixi mod P mod p with ai randomly chosen, P an appropriate prime (see [PPP 3.4] On SB-PRAM: linear hash function h1(x) = a x mod p, default: a = 1 [EK93] 5 FDA125 APP Topic X: Realization of PRAMs. C. Kessler, IDA, Linköpings Universitet, 2007. Parallel prefix on the SB-PRAM Global sum p s=s+ i=1 ai j 012 i=1 + 2 1 5 3 3 4 5 6 ai j=1,...,p Multiprefix: Multiple parallel prefix operations on different shared memory locations s1, s2, ... can be processed simultaneously. SHARED MEMORY partial sums 5 0 + 3 2 0 j-1 y = s+ + 3 5 C. Kessler, IDA, Linköpings Universitet, 2007. Multiprefix on the SB-PRAM Prefix sums 8 6 FDA125 APP Topic X: Realization of PRAMs. 7 partial prefix sums 7 0xe40: 0 0xf3c: 4 P P P P P 0 1 2 3 4 mpadd( 0xf3c, mpadd( 0xe40, mpadd( 0xe40, mpadd( 0xf3c, mpadd( 0xe40, 1 ); 2 ); 3 ); 4 ); 5 ); returns 4 Parallel prefix “on” shared memory location s virtually, by adding up all contributions ai to s in sequential can be implemented using a binary tree rooted at s (slightly suboptimal variant of odd-even prefix) This binary tree is embedded in the SB-PRAM network (request paths towards memory module hosting s). returns 0 returns 2 returns 5 returns SHARED MEMORY 0xe40: 10 0xf3c: 9 On SB-PRAM: for integer addition, maximum, bitwise FDA125 APP Topic X: Realization of PRAMs. 7 C. Kessler, IDA, Linköpings Universitet, 2007. Ranade’s simulation algorithm 8 FDA125 APP Topic X: Realization of PRAMs. AND, bitwise OR C. Kessler, IDA, Linköpings Universitet, 2007. Prerequisite: pipelined merging of sorted access sequences P1 [Ranade’87,’88,’91] out mergeStreams left + very simple + requires only constant sized queues to store access requests + slowdown factor only O(log p) + can be made optimally efficient + randomization employed only to distribute the shared memory across the memory modules. + pipelined butterfly network as the underlying communication network ! scalable + multiprefix on-the-fly [Ranade et al.’88] + cost-efficient implementation: SB-PRAM [Abolhassan/Keller/Paul’91] 5 P2 right out P3 mergeStreams left simpler case: 1 memory module out mergeStreams right left right Combining: P4 out mergeStreams left psrt[0] right psrt[1] P5 out mergeStreams left psrt[2] right psrt[3] P6 out mergeStreams left psrt[4] right psrt[5] P7 out mergeStreams left psrt[6] right psrt[7] Tree of p 1 merge processors working synchronously in parallel Initial arrays (addresses of memory requests) sorted in increasing order. Per global step, each comparator moves the larger operand upwards, the other waits ! FIFO queues along edges needed. Combine requests with identical addresses. 10 FDA125 APP Topic X: Realization of PRAMs. 9 FDA125 APP Topic X: Realization of PRAMs. Routing of messages Fluent Abstract Machine Step 1 Arrange p = F (log F + 1) PRAM processors on a F (log F + 1) bidirectional butterfly network col 0 col 1 row 1 0,1 1,1 row 2 0,2 1,2 C. Kessler, IDA, Linköpings Universitet, 2007. C. Kessler, IDA, Linköpings Universitet, 2007. Node (0,0) Node (1,0) 27 4 4 3 col log F 2,1 Node (2,0) 18 Node (1,1) 3,1 keep streams sorted by increasing addresses 0,3 artificial Ghost messages to keep routing in flux Step 2 17 30 27 15 6 4 3 4 18 Step 3 30 27 6 4 18 15 18 Ghost-4 17 EndOfStep msgs combine Ghost-15 row F 17 combine 11 FDA125 APP Topic X: Realization of PRAMs. C. Kessler, IDA, Linköpings Universitet, 2007. hc r i ! h 0 r i ; ; Node s at level c stage(s) = l Phase 3: backward Phase 2: forward h0; ri ! hl ; r0i ; 0 ; Phase 5: backward Phase 4: forward hc0; r0i ! hl ; r0i hl r i ! hc r i 0 0 Phase 6: forward h0; ri ! hc; ri hl r i ! h0 ri ; 0 ; Read request from h0; 0i stage(s) = 3l c stage(s) = l + c stage(s) = 5l c stage(s) = 3l + c 12 C. Kessler, IDA, Linköpings Universitet, 2007. Simulating several PRAM time steps Simulating one PRAM time step Stage 1: backward FDA125 APP Topic X: Realization of PRAMs. c stage(s) = 5l + c Phase 1: Processor hc; ri sends the request to processor h0; ri Phase 2: Processor h0; ri sends the request to processor hl ; r0i Phase 3: Processor hl ; r0i sends the request to processor hc0; r0i. Phase 4: Processor hc0; r0i sends the reply to processor hl ; r0i Phase 5: Processor hl ; r0i sends the reply to processor h0; ri Phase 6: Processor h0; ri sends the reply to processor hc; ri. If the hash function h1 chosen turns out to be bad: choose new h1 and rehash the memory in parallel in time O(m= p log p) time with high probability. expected value for timeout + rehashing to be balanced by expected simulation time for t steps Theorem [Ranade’87] An arbitrary t step program for a p-processor Combining CRCW PRAM with t m= p can be simulated on a p-processor butterfly in O(t log p) time with high probability as p ! ∞ and/or t ! ∞. The size of the memory required at each butterfly node is O(m= p). 14 FDA125 APP Topic X: Realization of PRAMs. FDA125 APP Topic X: Realization of PRAMs. 13 C. Kessler, IDA, Linköpings Universitet, 2007. C. Kessler, IDA, Linköpings Universitet, 2007. SB-PRAM: switch design Making Ranade’s emulation algorithm cost-optimal logϕ Efficiency of Ranade’s algorithm: Ω(1= log p). Improvement: Each physical processor simulates log p PRAM processors. Routing Switch ϕ ! Phases 3 and 4 become superfluous ! Phases 1 and 6 can be replaced by linear sorting arrays [PPP 4.2.1] Phase 1 Phase 2 Phase 3/4 Sort Phase 5 M Proc. @@? M Sort Sort r ? Block structure of a routing switch M M ? ? r 6 r - Dir.Queue - Logic+ Arithm. 6 @@ 6 6 @@? ? ? Phase 3 Sort FIFO buffer 6 Logic+ Arithm. Phase 1 Sort FIFO buffer r Phase 5 M ? Mem. Mod. Phase 6 Sort ? @@? ? FIFO buffer FIFO buffer 6 6 Phase 6 Phase 4 M Phase 2 M Sort M FDA125 APP Topic X: Realization of PRAMs. 15 Sort C. Kessler, IDA, Linköpings Universitet, 2007. FDA125 APP Topic X: Realization of PRAMs. Program design flow; system software tools pramsim Fork95 Program example.c [Bosch/Franziskus’94] uses file .pramsimrc -M, --globmem: shared memory size in words prass Object Module example.o -P, --progmem: program memory size in words -v, --virtProz: # vP per pP plink Executable File a.out -p, --physProz: # pP --net-er: warning on concurrent read loader Simulator pramsim C. Kessler, IDA, Linköpings Universitet, 2007. pramsim [optional parameters] [executable file] fcc (compiler only) Assembler Program example.s SB-PRAM Machine 16 --net-ew: warning on concurrent write Configuration file: .ldrc PRAM P0 = (p0, v0)> Commands: init F V g t VAL help r PROC c VAL d MEM m MEM k MEM break VAL q (Re)Initialize with F pP, V vP Start or continue the program Trace. Execute 1 or VAL steps Print help text Show all registers of vP PROC Show value VAL as decimal, hex, float, bin Disassemble the memory range MEM Show the memory area MEM as hex Show the memory area MEM as hex and ascii Set breakpoint at adress VAL Quit from simulator FDA125 APP Topic X: Realization of PRAMs. 17 C. Kessler, IDA, Linköpings Universitet, 2007. FDA125 APP Topic X: Realization of PRAMs. 18 C. Kessler, IDA, Linköpings Universitet, 2007. PRAMOS – Syscalls Self-restoring exact barrier on SB-PRAM _barrier: bmc 0 /*continue at modulo=0*/ getlo -1,par2 /*load constant -1 0*/ syncadd par2,gps,1 /*atomic decrement 1*/ FORKLIB_SYNCLOOP: ldg gps,1,r30 /*load sync cell 0*/ getlo 1,par1 /*load constant 1 1*/ add r30,0,r30 /*compare sync cell 0*/ bne FORKLIB_SYNCLOOP/*all procs there? 1*/ ldg gps,1,r30 /*sync:cmp Sync, 0*/ syncadd par1,gps,1 /*restore sync cell 1*/ add r30,0,r30 /*compare with 0, 0*/ bne FORKLIB_SYNCHRON/*late wave skips nops*/ nop /*early wave delayed 0*/ nop /*early wave delayed 1*/ FORKLIB_SYNCHRON: SB-PRAM operating system PRAMOS [Grün/Rauber/Röhrig’95] no direct support for synchronous execution at user level Fork runtime system only uses the syscalls (esp., host file system I/O) No. type name parameters (C declaration) 0 int open char *file, int mode, flags 1 int read int fd, void *buf, int num 2 int write int fd, void *buf, int num 3 int close int fd 4 int lseek int fd, int offst, int orign 5 int sys std open int fd open stdin/-out/-err int pc, int reason 6 int sys abort 7 int sys getnr get my physical processor ID call OS program exit routine 8 int sys exit 9 int sys getct get the global counter 12 int sys getbas get BASE register 13 void sys putbas int base write BASE register