Executing a Program on the MIT Tagged-Token Dataflow Architecture* (This is true black magic) Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “Executing a Program on the MIT Tagged-Token Dataflow Architecture” in IEEE Transactions on Computing, March 1990 Arvind Tagged-Token Dataflow Architecture Pai-Mei Notice they both have only one name and a special power with eight syllables. Five Point Palm Exploding Heart Technique Outline 1. Motivation 2. What is Dataflow? 3. Id 4. I-Structures 5. Tokens 6. Id Constructs 7. Putting It Together: Single Dataflow Processor 8. Supporting Multiple Processors 9. Conclusions Motivation - Lots of transistors to do computation - Von-Neumann model doesn’t fully exploit available parallelism - Parallel computing is hard, dataflow is alternative type of parallelism - Memory and communication latency growing So what exactly is dataflow? What is Dataflow? - Dataflow is a NON-Von-Neumann model of computation - No concept of instruction order (as specified by program counter) - No separation between memory and computation (no longer any concept of load/store ordering) - Program is specified as a dataflow graph, movement of data from one operation to the next - Static dataflow specifies resources at compile time, similar to VLIW* - Dynamic dataflow performs resource allocation at runtime (the MIT architecture is dynamic dataflow)* * Definitions from “Dataflow Architectures and Multithreading” in IEEE Micro, August 1994. Id Id, ego, superego... …not that Id - Id is a functional programming language - Everything based on primitives and functions - No objects (note this a fundamental problem for dataflow machines) - Can be compiled into explicit dataflow graphs Arc s + A[j] * B[j] - Dataflow graphs can then be executed by dataflow machine Operator I-Structures Dataflow is inherently stateless I-structures add some sense of state to aid execution without compromising parallelism I-structures are composed of 1. Tag associated with I-structure 2. Some number of physical locations to store data 3. A label for each location specifying whether the location is ‘absent’, ‘waiting’, or ‘present’ 4. A queue for each location of tokens waiting to access that particular location Two most important things to remember: 1. I-structures can only be written to once 2. I-structures can be initialized and tag returned without having any data written to them, all requests block until a write occurs Tokens Tokens represent the propagation of data along edges in the dataflow graph Token format: <c.s, v>p c – context (used to specify which “frame” token is a part of, used to resolve which dynamic invocation of a loop or a function call is referred to) s – address destination of the instruction v – actual data p – which input to the operation this is (e.g. divide t1 t2) c.s – is called the Tag of the token Operations take two tokens and generate a new token: op: <c.s,v1>l x <c.s,v2>r -> <c.t, (v1 op v2)> Id Constructs (1) Conditionals Conditionals are inherently hard for dataflow Instead utilize switch statements and combine operators Switch blocks only generate tokens on one output Note, these are not actual operators just symbolic (I’m also ignoring idea of a dataflow graph being wellbehaved) Id Constructs (2) Loops Loops generate problems for dataflow due to their asynchronous nature. What happens if ‘s’ tokens race ahead of ‘j’ tokens for different contexts? Use loop throttling to control the rate at which different operations are performed. Id Constructs (3) Functions Architecture only supports operations with two inputs How do you handle an nargument function? Idea: Partial functions Recursively use I-structures to represent (n-1) functions until reaching n=2 How do we handle recursive function calls? Manager programs generate separate contexts, also allocate I-structures Aside: How do we know to release I-structures after a function call returns? Well-behaved dataflow graphs: All operations must have a token argument on each input and must generate a token output. Putting it Together: Single Dataflow Processor 1. Look at tokens generated/incoming and try and match them with operations in Wait-Match Unit (WMU) 2. If all necessary tokens arrived, fetch instruction, any constants from memory and data from I-structures 3. Perform data operation and compute the tag for the next token 4. Generate the output token and forward it back to WMU and network Where do you see the bottlenecks? Supporting Multiple Processors - Have multiple processing elements connected by a token passing network - Can hide latency as other work can occur at PE’s before message arrive - Problem: How do managers coordinate across PE’s - Problem: What if application lacks sufficient parallelism? Multi-threading anyone? Conclusion Dataflow architectures are very good at exploiting parallelism In practice suffer from several pathologies*: 1. Associative search to match tags doesn’t scale 2. Resource allocation is difficult as number of resources increases 3. Handling data structures/objects is very difficult (e.g. SIMD) 4. Can’t get enough memory close enough to the processor Von-Neumann acts enough like dataflow to perform better (e.g. out of order execution, superscalar, branch prediction) Maybe we can get around some of these: - Better ways of performing Tag matching - IRAM (Patterson, 1998) to get DRAM on chip - New languages/ programming model - Modified memory interaction (e.g. use Von-Neumann memory model like Wavescalar) * Taken from “Dataflow Architectures and Multithreading” in IEEE Micro, August 1994.