2 The Maxeler AppGallery Dictionary 1 Inside front cover Version 12/12/2015 Prepared by: Dj. Pesic, D. Veljovic, G. Gaydadjiev, N. Korolija, V. Milutinovic (only for Maxeler internal information) 2 The Maxeler AppGallery Dictionary This dictionary defines all relevant terms found in two different sets of Maxeler documents: the Maxeler Application Gallery (AppGallery.Maxeler.com) and the Maxeler Tutorial Set. First, it concentrates on the essence and performance advantages of the Maxeler dataflow approach. Second, it reviews the support technologies that enable the dataflow approach to achieve its maximum. Third, one paragraph definitions are given per each relevant term. Fourth, for easy navigation to the dictionary, an index is provided at the end of the document. 3 A. Introduction A rule is that each and every paradigm-shift idea passes through four phases in its lifetime. In the phase #1, the idea is radicalized (people laugh on it). In the phase #2, the idea is attacked (some people aggressively try to destroy it). In the phase #3, the idea is accepted (most of those who were attacking it, now keep telling around that it was their idea, or at least that they always were supportive of that idea). In the final phase #4, the idea is considered as something that existed forever (those who played roles in initial phases are already dead, physically or professionally, by the time the fourth phase started). The main question for each paradigm-shift research effort is how to make the first two phases as short as possible? In the rest of this text, this goal is referred to as the “New Paradigm Acceptance Acceleration" goal, or the NPAA goal, for short. The Maxeler Application Gallery project and the Maxeler Tutorial Set project are an attempt to achieve the NPAA goal in the case of dataflow supercomputing. The dataflow paradigm exists on several levels of computing abstraction. The Maxeler Application Gallery project concentrates on the dataflow approach that accelerates critical loops by forming customized execution graphs that map onto an reconfigurable infrastructure (currently FPGA-based). This approach provides considerable speedups over the existing control flow approaches, unprecedented power savings, as well as a significant size reduction of the overall supercomputer. This said, however, dataflow supercomputing is still not widely accepted, due to all kinds of barriers, ranging from the NIH syndrome (Not Invented Here), through 2G2BT (Too Good To Be True) till the AOC syndrome (Afraid Of Change) widely present in the high-performance community. The Maxeler Tutorial Set project concentrates on the educational aspects, using the approach that provides conditions for a fast learning curve. It focuses on issues like compiler, operating system, programming of kernels, programming the manager, programming for network applications, debugging, etc. This dictionary collects and defines the terms found in the AppGallery and in the Tutorial Set, which are not commonly used, or are new for a dataflow programmer or user. B. The Dataflow Supercomputing Paradigm Essence At the time when the von Neumann paradigm for computing was formed, the technology was such that the ratio of arithmetic or logic (ALU) operation latencies over the communication (COMM) delays to memory or another processor t(ALU)/t(COMM) was extremely large (sometime argued to be approaching infinity). In his famous lecture notes on Computing, the Nobel Laureate Richard Feynman presented an observation that in theory, ALU operations could be done with zero energy, while communications can never reach zero energy levels, and that speed and energy of computing could be traded. In other words, this means that in practice, the future technologies will be characterized with t(COMM)/t(ALU) extremely large (in theory, t(COMM)/t(ALU) approaching infinity), which is exactly the opposite of what was the case at the times of von Neumann. That is why a number of pioneers in dataflow supercomputing accepted to use the term Feynman Paradigm for the approach utilized by Maxeler computing systems. Feynman never worked in dataflow computing, but his observations made many to believe into the great future of the dataflow computing paradigm (along with his involvement with the Connection Machine design). Obviously, when computing technology is characterized with extremely large t(COMM)/t(ALU), the control flow machines of the multi-core type (like Intel) or the many-core type (like NVidia) could never be as fast as the dataflow machines like Maxeler, for one simple reason: Buses of the control flow machines will never become of zero length, while many edges of the execution graph can easily be made zero length. The main technology related question now is: "Where is the ratio of t(ALU) and t(COMM) now? According to the above mentioned sources, the energy needed to do an IEEE Floating Point Double Precision Multiplication will be only 10pJ around the year 2020, while the energy needed to move the result from one core to another core of a multi-core or a many-core machine is 2000pJ, which represents a factor of 200x. On the other hand, moving the same result over an almost-zero-length edge in a Maxeler dataflow machine, in some cases, may take less than 2pJ, which represents a factor of 0.2x. 4 Therefore, the times have arrived for the technology to enable the dataflow approach to be effective. Here the term technology refers to the combination of: the hardware technology, the programming paradigm and its support technologies, the compiler technology, and the code analysis, development, and testing technology. These technologies are shed more light at in the next section. C. The Dictionary 5 Multiscale DataFlow Programming - Maxeler’s Multiscale Dataflow Computing is a combination of traditional synchronous dataflow, vector and array processors. We exploit loop level parallelism in a spatial, pipelined way, where large streams of data flow through a sea of arithmetic units, connected to match the structure of the compute task. DataFlow Engine (DFE) - DataFlow Engine represents the main element of Multiscale DataFlow Programming paradigm. Data stream from memory into the processing chip; within a DFE chip, data are forwarded directly from one arithmetic unit (DataFlow Core) to another, until the chain of processing is complete. Once a dataflow program has processed its streams of data, the dataflow engine could be reconfigured for a new application in less than a second. An FPGA alone is not enough to provide a self-contained and reusable computation resource. It requires logic to connect the device to the host, RAM for bulk storage, interfaces to other buses and interconnects, and circuitry to service the device. This complete system is called a DFE or a Dataflow Engine: http://www.harness-project.eu/. MaxCompiler - The MaxCompiler generates dataflow implementations that could then be called from the CPU via the SLiC interface. The SLiC (Simple Live CPU) interface is an automatically generated interface to the dataflow program, making it easy to call dataflow engines from the attached CPUs. The high-performance dataflow computing systems are fully programmable using the general-purpose MaxCompiler programming tool suite. Any accelerated application runs on a Maxeler node as a standard Linux executable. Programmers can write new applications using existing dataflow engine configurations. by linking the dataflow library file into their code and then calling simple function interfaces. To create applications exploiting new dataflow engine configurations, MaxCompiler allows an application to be split into three parts: (a) Kernel(s), which implement the computational components of the application in hardware. (b) Manager configuration, which connects Kernels to the CPU, engine RAM, other Kernels, and other dataflow engines via MaxRing. (c) CPU application, which interacts with the dataflow engines, to read and write data to the Kernels and engine RAM. MaxCompiler includes tools to support all three steps: the Kernel Compiler, the Manager Compiler, and a software library (accessible from C or Fortran) for bridging between hardware and software. Programmers develop kernels by writing programs in Java. However, using the tools requires only minimal familiarity with Java. Maxeler provides MaxIDE, an Eclipse-based development environment to maximize programmer productivity. Once written, MaxCompiler transforms user kernels into low-level hardware and generates a hardware dataflow implementation (the .max file), which the developer can link into their CPU application using the standard GNU development tool chain. MaxCompiler provides complete support for debugging during the development cycle, including a high-speed simulator for verifying code correctness before generating a hardware implementation and the MaxDebug tool for examining the state of running chips. MaxelerOS - The MaxelerOS is the operating system that connects Maxeler infrastructure with the programming interface. Besides functionality provided by typical operating systems, it also provides appropriate interface for accessing the DataFlow engines. The overall system is managed by MaxelerOS, which sits within Linux and also within the Dataflow Engine’s manager. MaxelerOS manages data transfer and dynamic optimization at runtime. Tick - The Tick represents the beginning of a Maxeler cycle. Cycle - The Cycle represents a Maxeler clock cycle. This cycle usually lasts one order of magnitude more than the cycle of today's computers, but the Maxeler often does two to four orders of magnitude more computations per cycle than today's computers. MaxIDE - MaxIDE is the Eclipse-based Maxeler integrated development environment. It provides necessary functionality for programming Maxeler Kernels, Maxeler Manager, but also the C code. WebIDE - The WebIDE is the Maxeler web-based integrated development environment. It offers a subset of the functionality offered by MaxIDE, but provides the possibility for working with the integrated development environment, from remote computers, via web. Kernel - The Kernel represents the code written in Java-like language, extended with functionality for matching software variables with the hardware beneath. Manager - The Manager describes the connections between kernels and streams connecting kernel to off-chip I/O channels such as CPU interconnect (e.g., PCI express, Infiniband), inter-DFE Maxeler proprietary connections named MaxRing, and the DRAM memory (LMem). As the C++, programmers divide the functionality of programs into classes, the Maxeler programmers divide the functionality of the accelerated application on kernels. Maxeler manager is used for connecting 6 kernels. It also specifies the interface between the code written in the supported high-level language (e.g., C++) and kernels. CPUCode - The CPUCode is a code written in the supported high-level language (e.g., C++). Comparing to the application that is originally not meant to run on Maxeler DataFlow engines, this code has the part that should be executed on DataFlow engines replaced with the appropriate calls to Maxeler. These include initializing Maxeler DataFlow engines, setting up scalar variables, and, of course, starting the execution on DataFlow engines and synchronization. The CPUCode contains files that interact with EngineCode using SLiC. EngineCode - The EngineCode includes one or more Kernels and one Manager. Details of kernels and the Manager are described elsewhere. EngineCode contains files written in MaxJ. MaxJ - The MaxJ is the Maxeler DataFlow Programming language, an extension of the Java programming language, for describing data choreography within a DFE. MaxRing - The MaxRing is the inter-DFE Maxeler proprietary connection. DataFlow Core (DFC) - The DataFlow Core computes only a single type of arithmetic operation (for example, an addition or multiplication) and is thus simple, so thousands can fit on one dataflow engine. Stream - The Stream represents a continuous flow of data. For example, if a for loop consisting of mutually independent iterations is implemented using Maxeler DataFlow engines, a kernel will execute one iteration of the loop. The input for each one of the iterations is given to the kernel in separate clock cycles, forming a stream of data. The acceleration of algorithm execution is based on streams of data. Multiscale Dataflow Computing - The Multiscale Dataflow Computing is a combination of traditional synchronous dataflow, vector, and array processors. Maxeler exploits loop level parallelism in a spatial, pipelined way, where large streams of data flow through a sea of arithmetic units, connected to match the structure of the compute task. Small on-chip memories form a distributed register file, with as many access ports as needed to support a smooth flow of data through the chip. Multiscale Dataflow Computing employs dataflow on multiple levels of abstraction: The system level, the architecture level, the arithmetic level, and the bit level. On the system level, multiple dataflow engines are connected to form a supercomputer. On the architecture level, Maxeler decouples memory access from arithmetic operations, while the arithmetic and the bit levels provide opportunities to optimize the representation of the data, and balance computation with communication. SLiC (Simple Live CPU) - The SLiC interface is an automatically generated interface to the dataflow program, making it easy to call dataflow engines from attached CPUs. MaxCompiler Scheduling - The MaxCompiler Scheduling refers to the scheduling of the execution on DataFlow engines. The MaxCompiler is responsible for timing constraints, making sure that each DataFlow engine will have the correct data at the input, at the moment of processing. One key advantage of dataflow computing is that one can estimate the performance of the dataflow implementation before actually implementing it, thanks to the static scheduling. MaxRing - The MaxRing is the inter-DFE Maxeler proprietary connection. The architecture of a Maxeler acceleration system often comprises of DataFlow engines attached directly to local memories and to a CPU, but in case that more computation power is needed, multiple DataFlow engines may be connected together via high-bandwidth MaxRing interconnect. MPC-X - The MPC-X (https://www.maxeler.com/) is a series of dataflow nodes providing multiple dataflow engines as shared resources on the network, allowing them to be used by applications running anywhere in a cluster. “Dataflow as a shared resource” brings the efficiency benefits of virtualization to dataflow computing, maximizing utilization in a multiuser and multi-application environment such as a private or public Cloud service or HPC cluster. The CPU nodes can utilize as many DFEs as are required for a particular application, and they release DFEs for use by other nodes when not running computation, ensuring all cluster resources are optimally balanced at runtime. Individual MPC-X nodes provide large memory capacities (up to 768GB) and compute performance equivalent to dozens of conventional x86 servers. The MPC-X series enables remote access to DFEs by providing dual FDR/QDR Infiniband connectivity combined with unique RDMA 7 technology that provides direct transfers from CPU node memory to remote dataflow engines without inefficient memory copies. Client machines can run standard Linux, using the Maxeler software. Maxeler’s software automatically manages the resources within the MPC-X nodes, including dynamic allocation of dataflow engines to CPU threads and processes and balances demands on the cluster at runtime to maximize the overall performance. With a simple Linux RPM installation, any CPU server in the cluster can be quickly upgraded to begin benefiting from dataflow computing. MPC-X2000 - The MPC-X2000 provides eight MAX4 (maia) dataflow engines with power consumption comparable to a single high-end server. Each dataflow engine is accessible by any CPU client machine via the Infiniband network, while multiple engines within the same MPC-X node can also communicate directly using the dedicated high-speed MaxRing interconnect. Latency - The Latency represents the time needed for a kernel to produce the first result, measured from the moment the first input is given. LMem (Large Memory) - The LMem is a DRAM memory of large capacity (in the Maxeler technology of 2015, up to 48GB). A Dataflow Engine needs to communicate with its LMem (Large Memory, GBs of off-chip memory), CPUs, and other DFEs. The Manager, in a dataflow program, describes the choreography of data movement between DFEs, connecting CPUs, and also the GBs of data in LMem. FMem (Fast Memory) - The FMem is an on-chip Static RAM (SRAM) which can hold several MBs of data. DFEs provide two basic kinds of memory: FMem and LMem. FMem (Fast Memory) is an on-chip Static RAM (SRAM) that can hold several MBs of data. Off-chip LMem (Large Memory) is implemented using DRAM technology and can hold many GBs of data. The key to efficient dataflow implementations is to choreograph the data movements to maximize the reuse of data while it is in the chip and minimize movement of data in and out of the chip. Sea of Arithmetic Units (MDP.1) Data Choreography (MDP.2) Computing in Time (MDP.2) Computing in Space (MDP.2) Linux (MDP.4) .max (MDP.7) API (MDP.7) Header File (MDP.8) Basic Static SLiC Interface (MDP.8, MDP.12) Engine Interface (MDP.9) Advanced Static SLiC Interface (MDP.1) SLiCcompile Tool (MDP.10) simutils Directory (MDP.11) Python Skin (MDP.11) Streams in Python Skin (MDP.12) Nested Python Lists (MDP.12) NumPy Arrays (MDP.12) R Skin (MDP.13) R Vector (MDP.13) R Array (MDP.13) Installer Files (MDP.14) SLiC Interface Levels (MDP.18) Advanced Dynamic SLiC Interface (MDP.18) .maxj (MDP.20) Profiling Tools (MDP.22) Ticks/Second (MDP.25) - Tick is a unit of time in a DFE; the speed of movement through a dataflow pipeline is expressed in Ticks/Second Bandwidth (MDP.25) 8 Conditionals in Dataflow Computing (MDP.28) Global Conditionals (MDP.28) Local Conditionals (MDP.28) Conditional Loops (MDP.28) Ternary-If Operator (?:) (MDP.28) Java Compilation (MDP.31) Java Run-Time (MDP.31) Kernel Graph (MDP.39) MaxCompiler Java Library (MDP.39) Constructor in Java (MDP.40) io.input Method (MDP.40) Simulation Watches (MDP.53, MDP.54) Simulation Printf (MDP.53) DFE Printf (MDP.53) CVS Format (MDP.54) Java Expression Field (MDP.54) Debug Directory (MDP.55) Debug Output (MDP.55) Graphical Debugger (MDP.60) Code Snippet (MDP.61) Stream Status Block (MDP.62, MDP.63, MDP.65) MaxDebug (MDP.64) Variables in a Dataflow Program (MDP.71) Primitive DFE Types (MDP.72) Composite DFE Types (MDP.72, MDP.75) Casting (MDP.72, MDP.74) DFERawBits (MDP.73) Dataflow Operators (MDP.80) stream.offset (MDP.87) Multiplexer (MDP.93) Static Offset (MDP.89, MDP.94) Variable Offset (MDP.89, MDP.94) Dynamic Offset (MDP.89, MDP.93, MDP.94) Offset Expression (MDP.90) Stream Hold (MDP.95) Simple Counters (MDP.103) Advanced Counter (MDP.104, MDP.105) Engine Groups (MDP.115) Sharing Mode (MDP.115) High Performance Atomic Execution (MDP.116) Engine Loads (MDP.117) Engine Interface Parameters (MDP.117, MDP.124) Error Contexts (MDP.129) Debug Directories (MDP.132) SLiC Installer (MDP.132) BRAM (MDP.143) Standard Manager (MDP.150) MAX-UP (MDP.163) - Maxeler University Program. 9 D. The Index 10 Inside back cover 11 12