Multi Threaded Architectures Sima, Fountain and Kacsuk Chapter 16 CSE462 1 Memory and Synchronization Latency Scalability of system is limited by ability to handle memory latency & algorithmic sychronization delays Overall solution is well known – Do something else whilst waiting Remote memory accesses – Much slower than local – Varying delay depending on • Network traffic • Memory traffic David Abramson, 2004 2 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Processor Utilization Utilization – P/T • P time spent processing • T total time – P/(P + I + S) • I time spent waiting on other tasks • S time spent switching tasks 3 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Basic ideas - Multithreading Fine Grain – task switch every cycle Blocked Blocked Blocked Coarse Grain – Task swith every n cycles Task Switch Overhead Task Switch Overhead 4 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Design Space Multi threaded architectures Computational Model Memory Organization Granularity Von Neumann Hybrid Von Neumann/ (Sequential Control Dataflow Flow) Parallel Control flow Based on parallel Control operators Parallel control flow Based on control tokens Number of threads per processor Fine Grain Physical Shared Memory Small (4 – 10) Coarse Grain Distributed Shared Memory Middle (10 – 100) Cache-coherent Distributed shared Memory Large (over 100) 5 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Classification of multi-threaded architectures Multi-threaded architectures Von Neumann based architectures HEP Hybrid von Neumann/ Dataflow architectures RISC Like Macro dataflow architectures Decoupled Tera P-RISC USC MIT Hybrid Machine MIT Alewife & Sparcle *T McGill MGDA & SAM EM-4 6 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Computational Models 7 Sequential control flow (von Neumann) Flow of control and data separated Executed sequentially (or at least sequential semantics – see chapter 7) Control flow changed with JUMP/GOTO/CALL instructions Data stored in rewritable memory – Flow of data does not affect execution order 8 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Sequential Control Flow Model L1: A B m1 L2: + B 1 m2 L3: * m1 m2 R Control Flow R = (A - B) * (B + 1) 9 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Dataflow Control tied to data Instruction “fires” when data is available – Otherwise it is suspended Order of instructions in program has no effect on execution order – Cf Von Neumann No shared rewritable memory – Write once semantics Code is stored as a dataflow graph Data transported as tokens Parallelism occurs if multiple instructions can fire at same time – Needs a parallel processor Nodes are self scheduling 10 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Dataflow – arbitrary execution order A B 1 - + * R R = (A - B) * (B + 1) 11 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Dataflow – arbitrary execution order A B 1 - + * R R = (A - B) * (B + 1) 12 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Dataflow – Parallel Execution A B 1 - + * R R = (A - B) * (B + 1) 13 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Implementation Dataflow model required very different execution engine Data must be stored in special matching store Instructions must be triggered when both operands are available Parallel operations must be scheduled to processors dynamically – Don’t know apriori when they are available. Instruction operands are pointers – To instruction – Operand number David Abramson, 2004 14 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Dataflow model of execution L1: Compte B L2/2 L3/1 L2: - A L3: B + B 1 L4/2 L4/1 L4: * L6/1 15 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Parallel Control flow Sometimes called macro dataflow – Data flows between blocks of sequential code – Has advantaged of dataflow & Von Neumann • Context switch overhead reduced • Compiler can schedule instructions statically • Don’t need fast matching store Requires additional control instructions – Fork/Join 16 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Macro Dataflow (Hybrid Control/Dataflow) L1: FORK L4 L2: A B m1 Control Flow L3: L4: Control Flow GOTO L5 R = (A - B) * (B + 1) David Abramson, 2004 + B 1 m2 L5: JOIN 2 L6: * m1 m2 R 17 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Issues for Hybrid dataflow Blocks of sequential instructions need to be large enough to absorb overheads of context switching Data memory same as MIMD – Can be partitioned or shared – Synchronization instructions required • Semaphores, test-and-set Control tokens required to synchronize threads. 18 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Some examples 19 Denelcor HEP Designed to tolerate latency in memory Fine grain interleaving of threads Processor pipeline contains 8 stages Each time step a new thread enters the pipeline Threads are taken from the Process Status Word (PSW) After thread taken from the PSW queue, instruction and operands are fetched When an instruction is executed, another one is placed on the PSW queue Threads are interleaved at the instruction level. 20 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Denelcor HEP Memory latency toleration solved with Scheduler Function Unit (SFU) Memory words are tagged as full or empty Attempting to read an empty suspends the current thread – Then current PSW entry is moved to the SFU When data is written, taken from the SFU and placed back on the PSW queue. 21 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Synchronization on the HEP All registers have Full/Empty/Reserved bit Reading an empty register causes thread toe be placed back on the PSW queue without updating its program counter Thread synchronization is busy-wait – But other threads can run 22 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 HEP Architecture PSW queue Matching Unit Program memory Operand hand 1 Increment control Operand fetch Operand hand 2 Registers SFU Funct unit 1 Funct unit 2 Funct unit N To/from Data memory 23 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 HEP configuration Up to 16 processors Up to 128 data memories Connected by high speed switch Limitations – Threads can have only 1 outstanding memory request – Thread synchronization puts bubbles in the pipeline – Maximum of 64 threads causing problems for software • Need to throttle loops – If parallelism is lower than 8 full utilisation not possible. David Abramson, 2004 24 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 MIT Alewife Processor 512 Processors in 2-dim mesh Sparcle Processor Physcially distributed memory Logical shared memory Hardware supported cache coherence Hardware supported user level message passing Multi-threading 25 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Threading in Alewife Coarse-grained multithreading Pipeline works on single thread as long as remote memory access or synchronization not required Can exploit register optimization in the pipeline Integration of multi-threading with hardware supported cache coherence 26 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 The Sparcle Processor Extension of SUN Sparc architecture Tolerant of memory latency Fine grained synchronisation Efficient user level message passing 27 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Fast context switching In Sparc 8 overlapping register windows Used in Sparcle in paris to represent 4 independent, non-overlapping contexts – Three for user threads – 1 for traps and message handlers Each context contains 32 general purpose registers and – PSR (Processor State Register) – PC (Program Counter) – nPC (next Program Counter) Thread states – Active – Loaded • State stored in registers – can become active – Ready • Not suspended and not loaded – Suspended Thread switching – In fast if one is active and the other is loaded – Need to flush the pipeline (cf HEP) David Abramson, 2004 28 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Sparcle Architecture 0:R0 PSR PC nPC PSR PC nPC PSR PC nPC PSR PC nPC 0:R31 1:R0 CP Active thread 1:R31 2:R0 2:R31 3:R0 3:R31 29 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 MIT Alewife and Sparcle NR Sparcle Cache 64 kbytes NR = Network router CMMU = Communication & memory management unit FPU = Floating point unit Main CMMU Memory 4 Bytes FPU 30 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 From here figures are drawn by Tim 31 Figures 16.10 Thread states in Sparcle Global register frames Process state Memory G0 Ready queue PSR PC nPC PSR PC nPC PSR PC nPC PSR PC nPC ... 0:R31 1:R0 active thread CP 1:R31 2:R0 Loaded thread Unloaded thread 2:R31 3:R0 3:R31 David Abramson, 2004 ... G7 0:R0 PC and PSR frames Suspended queue 32 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.11 structure of a typical static dataflow PE Fetch unit Instruction queue Func. Unit 1 Func. Unit 2 Func. Unit N Activity store Update unit To/From other (PEs) 33 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.12 structure of a typical tagged-token dataflow PE Token queue Func. Unit 1 Matching unit Matching store Fetch unit Instruction/ data memory Func. Unit 2 Func. Unit N Update unit To other (PEs) David Abramson, 2004 34 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.13 organization of the Istructure storage Data storage k: k+1: k+2: k+3: k+4: W A A P W Data storage tag X tag Z tag Y nil nil datum Presence bits (A=Absent, P=Present, W=Waiting 35 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.14 coding in explicit token-store architectures (a) and (b) <35, <FP, IP>> <12, <FP, IP>> - - fire <23, <FP, IP+2>> * + * <23, <FP, IP+1>> + 36 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.14 coding in explicit token-store architectures (c) Instruction memory IP SUB 2 +1, +2 ADD 3 +2 MUL 4 +7 Frame memory Frame memory FP FP fire FP+2 1 0 35 0 FP+3 1 23 0 FP+4 1 23 Presence bit 37 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.15 structure of a typical explicit token-store dataflow PE From other PEs Fetch unit Fetch unit Effective address Presence bits Frame memory Frame store operation Form token unit Func. Unit 1 Func. Unit 2 Func. Unit N Form token unit 38 To/from other PEs David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.16 scale of von Neumann/dataflow architectures Dataflow Macro dataflow Decoupled hybrid dataflow RISC-like hybrid von Neumann 39 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.17 structure of a typical macro dataflow PE Matching unit Instruction Frame memory Fetch unit Token queue Func. Unit Internal control pipeline (program counter-based sequential execution) Form token unit To/from other (PEs) David Abramson, 2004 40 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.18 organization of a PE in the MIT hybrid Machine PC FBR +1 Instruction fetch Enabled continuation queue Instruction memory Decode unit Frame memory (Token queue) Operand fetch Execution unit To/from global memory David Abramson, 2004 Registers 41 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.19 comparison of (a) SQ and (b) SCB macro nodes a SQ1 c b l4 l1 SQ2 a SCB1 c b l4 l1 SCB2 3 l5 l2 1 l3 l5 l2 2 1 l6 l3 2 l6 42 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.20 structure of the USC Decoupled Architecture To/from network (Graph virtual space) Cluster graph memory GC GC DFGE DFGE RQ AQ RQ AQ Cluster 0 CE CE CC CC Cluster graph memory To/from network (Computation virtual space) David Abramson, 2004 43 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.21 structure of a node in the SAM fire APU Main memory done SEU ASU LEU To/from network 44 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.22 structure of the PRISC processing element Local memory Instruction Instruction fetch Internal control pipeline (conventional RISCprocessor) Operand fetch Load/Store Func. unit Messages to/from other PE’s memory Operand store Start Token queue Frame memory 45 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.23 transformation of dataflow graphs into control flow graphs (a) dataflow graph (b) control flow graph join + + fork L1 * - join * L1: join - 46 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997 Figures 16.24 structure of *T node From network Network interface Message formatter To network Message queues Remote memory request coprocessor Synchronization coprocessor sIP sFP sV1 sV2 Data processor Continuation queue <IP,FP> dIP dFP dV1 dV2 Local memory 47 David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997