AppGalleryGLOSSARY - start [kondor.etf.rs]

advertisement
2
The Maxeler AppGallery Dictionary
1
Inside front cover
Version 12/12/2015
Prepared by: Dj. Pesic, D. Veljovic, G. Gaydadjiev, N. Korolija, V. Milutinovic (only for Maxeler internal information)
2
The Maxeler AppGallery Dictionary
This dictionary defines all relevant terms found in two different sets of Maxeler documents: the Maxeler Application Gallery
(AppGallery.Maxeler.com) and the Maxeler Tutorial Set. First, it concentrates on the essence and performance advantages
of the Maxeler dataflow approach. Second, it reviews the support technologies that enable the dataflow approach to achieve
its maximum. Third, one paragraph definitions are given per each relevant term. Fourth, for easy navigation to the
dictionary, an index is provided at the end of the document.
3
A. Introduction
A rule is that each and every paradigm-shift idea passes through four phases in its lifetime. In the phase #1, the idea is
radicalized (people laugh on it). In the phase #2, the idea is attacked (some people aggressively try to destroy it). In the
phase #3, the idea is accepted (most of those who were attacking it, now keep telling around that it was their idea, or at least
that they always were supportive of that idea). In the final phase #4, the idea is considered as something that existed forever
(those who played roles in initial phases are already dead, physically or professionally, by the time the fourth phase started).
The main question for each paradigm-shift research effort is how to make the first two phases as short as possible? In the
rest of this text, this goal is referred to as the “New Paradigm Acceptance Acceleration" goal, or the NPAA goal, for short.
The Maxeler Application Gallery project and the Maxeler Tutorial Set project are an attempt to achieve the NPAA goal in
the case of dataflow supercomputing. The dataflow paradigm exists on several levels of computing abstraction.
The Maxeler Application Gallery project concentrates on the dataflow approach that accelerates critical loops by forming
customized execution graphs that map onto an reconfigurable infrastructure (currently FPGA-based). This approach
provides considerable speedups over the existing control flow approaches, unprecedented power savings, as well as a
significant size reduction of the overall supercomputer. This said, however, dataflow supercomputing is still not widely
accepted, due to all kinds of barriers, ranging from the NIH syndrome (Not Invented Here), through 2G2BT (Too Good To
Be True) till the AOC syndrome (Afraid Of Change) widely present in the high-performance community.
The Maxeler Tutorial Set project concentrates on the educational aspects, using the approach that provides conditions for a
fast learning curve. It focuses on issues like compiler, operating system, programming of kernels, programming the
manager, programming for network applications, debugging, etc.
This dictionary collects and defines the terms found in the AppGallery and in the Tutorial Set, which are not commonly
used, or are new for a dataflow programmer or user.
B. The Dataflow Supercomputing Paradigm Essence
At the time when the von Neumann paradigm for computing was formed, the technology was such that the ratio of
arithmetic or logic (ALU) operation latencies over the communication (COMM) delays to memory or another processor
t(ALU)/t(COMM) was extremely large (sometime argued to be approaching infinity).
In his famous lecture notes on Computing, the Nobel Laureate Richard Feynman presented an observation that in theory,
ALU operations could be done with zero energy, while communications can never reach zero energy levels, and that speed
and energy of computing could be traded. In other words, this means that in practice, the future technologies will be
characterized with t(COMM)/t(ALU) extremely large (in theory, t(COMM)/t(ALU) approaching infinity), which is exactly
the opposite of what was the case at the times of von Neumann. That is why a number of pioneers in dataflow
supercomputing accepted to use the term Feynman Paradigm for the approach utilized by Maxeler computing systems.
Feynman never worked in dataflow computing, but his observations made many to believe into the great future of the
dataflow computing paradigm (along with his involvement with the Connection Machine design).
Obviously, when computing technology is characterized with extremely large t(COMM)/t(ALU), the control flow machines
of the multi-core type (like Intel) or the many-core type (like NVidia) could never be as fast as the dataflow machines like
Maxeler, for one simple reason: Buses of the control flow machines will never become of zero length, while many edges of
the execution graph can easily be made zero length.
The main technology related question now is: "Where is the ratio of t(ALU) and t(COMM) now?
According to the above mentioned sources, the energy needed to do an IEEE Floating Point Double Precision Multiplication
will be only 10pJ around the year 2020, while the energy needed to move the result from one core to another core of a
multi-core or a many-core machine is 2000pJ, which represents a factor of 200x. On the other hand, moving the same result
over an almost-zero-length edge in a Maxeler dataflow machine, in some cases, may take less than 2pJ, which represents a
factor of 0.2x.
4
Therefore, the times have arrived for the technology to enable the dataflow approach to be effective. Here the term
technology refers to the combination of: the hardware technology, the programming paradigm and its support technologies,
the compiler technology, and the code analysis, development, and testing technology. These technologies are shed more
light at in the next section.
C. The Dictionary
5
Multiscale DataFlow Programming - Maxeler’s Multiscale Dataflow Computing is a combination of traditional synchronous
dataflow, vector and array processors. We exploit loop level parallelism in a spatial, pipelined way, where large streams of
data flow through a sea of arithmetic units, connected to match the structure of the compute task.
DataFlow Engine (DFE) - DataFlow Engine represents the main element of Multiscale DataFlow Programming paradigm.
Data stream from memory into the processing chip; within a DFE chip, data are forwarded directly from one arithmetic unit
(DataFlow Core) to another, until the chain of processing is complete. Once a dataflow program has processed its streams of
data, the dataflow engine could be reconfigured for a new application in less than a second. An FPGA alone is not enough
to provide a self-contained and reusable computation resource. It requires logic to connect the device to the host, RAM for
bulk storage, interfaces to other buses and interconnects, and circuitry to service the device. This complete system is called
a DFE or a Dataflow Engine: http://www.harness-project.eu/.
MaxCompiler - The MaxCompiler generates dataflow implementations that could then be called from the CPU via the SLiC
interface. The SLiC (Simple Live CPU) interface is an automatically generated interface to the dataflow program, making it
easy to call dataflow engines from the attached CPUs. The high-performance dataflow computing systems are fully
programmable using the general-purpose MaxCompiler programming tool suite. Any accelerated application runs on a
Maxeler node as a standard Linux executable. Programmers can write new applications using existing dataflow engine
configurations. by linking the dataflow library file into their code and then calling simple function interfaces. To create
applications exploiting new dataflow engine configurations, MaxCompiler allows an application to be split into three parts:
(a) Kernel(s), which implement the computational components of the application in hardware. (b) Manager configuration,
which connects Kernels to the CPU, engine RAM, other Kernels, and other dataflow engines via MaxRing. (c) CPU
application, which interacts with the dataflow engines, to read and write data to the Kernels and engine RAM. MaxCompiler
includes tools to support all three steps: the Kernel Compiler, the Manager Compiler, and a software library (accessible
from C or Fortran) for bridging between hardware and software. Programmers develop kernels by writing programs in Java.
However, using the tools requires only minimal familiarity with Java. Maxeler provides MaxIDE, an Eclipse-based
development environment to maximize programmer productivity. Once written, MaxCompiler transforms user kernels into
low-level hardware and generates a hardware dataflow implementation (the .max file), which the developer can link into
their CPU application using the standard GNU development tool chain. MaxCompiler provides complete support for
debugging during the development cycle, including a high-speed simulator for verifying code correctness before generating
a hardware implementation and the MaxDebug tool for examining the state of running chips.
MaxelerOS - The MaxelerOS is the operating system that connects Maxeler infrastructure with the programming interface.
Besides functionality provided by typical operating systems, it also provides appropriate interface for accessing the
DataFlow engines. The overall system is managed by MaxelerOS, which sits within Linux and also within the Dataflow
Engine’s manager. MaxelerOS manages data transfer and dynamic optimization at runtime.
Tick - The Tick represents the beginning of a Maxeler cycle.
Cycle - The Cycle represents a Maxeler clock cycle. This cycle usually lasts one order of magnitude more than the cycle of
today's computers, but the Maxeler often does two to four orders of magnitude more computations per cycle than today's
computers.
MaxIDE - MaxIDE is the Eclipse-based Maxeler integrated development environment. It provides necessary functionality
for programming Maxeler Kernels, Maxeler Manager, but also the C code.
WebIDE - The WebIDE is the Maxeler web-based integrated development environment. It offers a subset of the
functionality offered by MaxIDE, but provides the possibility for working with the integrated development environment,
from remote computers, via web.
Kernel - The Kernel represents the code written in Java-like language, extended with functionality for matching software
variables with the hardware beneath.
Manager - The Manager describes the connections between kernels and streams connecting kernel to off-chip I/O channels
such as CPU interconnect (e.g., PCI express, Infiniband), inter-DFE Maxeler proprietary connections named MaxRing, and
the DRAM memory (LMem). As the C++, programmers divide the functionality of programs into classes, the Maxeler
programmers divide the functionality of the accelerated application on kernels. Maxeler manager is used for connecting
6
kernels. It also specifies the interface between the code written in the supported high-level language (e.g., C++) and kernels.
CPUCode - The CPUCode is a code written in the supported high-level language (e.g., C++). Comparing to the application
that is originally not meant to run on Maxeler DataFlow engines, this code has the part that should be executed on DataFlow
engines replaced with the appropriate calls to Maxeler. These include initializing Maxeler DataFlow engines, setting up
scalar variables, and, of course, starting the execution on DataFlow engines and synchronization. The CPUCode contains
files that interact with EngineCode using SLiC.
EngineCode - The EngineCode includes one or more Kernels and one Manager. Details of kernels and the Manager are
described elsewhere. EngineCode contains files written in MaxJ.
MaxJ - The MaxJ is the Maxeler DataFlow Programming language, an extension of the Java programming language, for
describing data choreography within a DFE.
MaxRing - The MaxRing is the inter-DFE Maxeler proprietary connection.
DataFlow Core (DFC) - The DataFlow Core computes only a single type of arithmetic operation (for example, an addition
or multiplication) and is thus simple, so thousands can fit on one dataflow engine.
Stream - The Stream represents a continuous flow of data. For example, if a for loop consisting of mutually independent
iterations is implemented using Maxeler DataFlow engines, a kernel will execute one iteration of the loop. The input for
each one of the iterations is given to the kernel in separate clock cycles, forming a stream of data. The acceleration of
algorithm execution is based on streams of data.
Multiscale Dataflow Computing - The Multiscale Dataflow Computing is a combination of traditional synchronous
dataflow, vector, and array processors. Maxeler exploits loop level parallelism in a spatial, pipelined way, where large
streams of data flow through a sea of arithmetic units, connected to match the structure of the compute task. Small on-chip
memories form a distributed register file, with as many access ports as needed to support a smooth flow of data through the
chip. Multiscale Dataflow Computing employs dataflow on multiple levels of abstraction: The system level, the architecture
level, the arithmetic level, and the bit level. On the system level, multiple dataflow engines are connected to form a
supercomputer. On the architecture level, Maxeler decouples memory access from arithmetic operations, while the
arithmetic and the bit levels provide opportunities to optimize the representation of the data, and balance computation with
communication.
SLiC (Simple Live CPU) - The SLiC interface is an automatically generated interface to the dataflow program, making it
easy to call dataflow engines from attached CPUs.
MaxCompiler Scheduling - The MaxCompiler Scheduling refers to the scheduling of the execution on DataFlow engines.
The MaxCompiler is responsible for timing constraints, making sure that each DataFlow engine will have the correct data at
the input, at the moment of processing. One key advantage of dataflow computing is that one can estimate the performance
of the dataflow
implementation before actually implementing it, thanks to the static scheduling.
MaxRing - The MaxRing is the inter-DFE Maxeler proprietary connection. The architecture of a Maxeler acceleration
system often comprises of DataFlow engines attached directly to local memories and to a CPU, but in case that more
computation power is needed, multiple DataFlow engines may be connected together via high-bandwidth MaxRing
interconnect.
MPC-X - The MPC-X (https://www.maxeler.com/) is a series of dataflow nodes providing multiple dataflow engines as
shared resources on the network, allowing them to be used by applications running anywhere in a cluster. “Dataflow as a
shared resource” brings the efficiency benefits of virtualization to dataflow computing, maximizing utilization in a multiuser and multi-application environment such as a private or public Cloud service or HPC cluster. The CPU nodes can utilize
as many DFEs as are required for a particular application, and they release DFEs for use by other nodes when not running
computation, ensuring all cluster resources are optimally balanced at runtime. Individual MPC-X nodes provide large
memory capacities (up to 768GB) and compute performance equivalent to dozens of conventional x86 servers. The MPC-X
series enables remote access to DFEs by providing dual FDR/QDR Infiniband connectivity combined with unique RDMA
7
technology that provides direct transfers from CPU node memory to remote dataflow engines without inefficient memory
copies. Client machines can run standard Linux, using the Maxeler software. Maxeler’s software automatically manages the
resources within the MPC-X nodes, including dynamic allocation of dataflow engines to CPU threads and processes and
balances demands on the cluster at runtime to maximize the overall performance. With a simple Linux RPM installation,
any CPU server in the cluster can be quickly upgraded to begin benefiting from dataflow computing.
MPC-X2000 - The MPC-X2000 provides eight MAX4 (maia) dataflow engines with power consumption comparable to a
single high-end server. Each dataflow engine is accessible by any CPU client machine via the Infiniband network, while
multiple engines within the same MPC-X node can also communicate directly using the dedicated high-speed MaxRing
interconnect.
Latency - The Latency represents the time needed for a kernel to produce the first result, measured from the moment the
first input is given.
LMem (Large Memory) - The LMem is a DRAM memory of large capacity (in the Maxeler technology of 2015, up to
48GB). A Dataflow Engine needs to communicate with its LMem (Large Memory, GBs of off-chip memory), CPUs, and
other DFEs. The Manager, in a dataflow program, describes the choreography of data movement between DFEs, connecting
CPUs, and also the GBs of data in LMem.
FMem (Fast Memory) - The FMem is an on-chip Static RAM (SRAM) which can hold several MBs of data. DFEs provide
two basic kinds of memory: FMem and LMem. FMem (Fast Memory) is an on-chip Static RAM (SRAM) that can hold
several MBs of data. Off-chip LMem (Large Memory) is implemented using DRAM technology and can hold many GBs of
data. The key to efficient dataflow implementations is to choreograph the data movements to maximize the reuse of data
while it is in the chip and minimize movement of data in and out of the chip.
Sea of Arithmetic Units (MDP.1) Data Choreography (MDP.2) Computing in Time (MDP.2) Computing in Space (MDP.2) Linux (MDP.4) .max (MDP.7) API (MDP.7) Header File (MDP.8) Basic Static SLiC Interface (MDP.8, MDP.12) Engine Interface (MDP.9) Advanced Static SLiC Interface (MDP.1) SLiCcompile Tool (MDP.10) simutils Directory (MDP.11) Python Skin (MDP.11) Streams in Python Skin (MDP.12) Nested Python Lists (MDP.12) NumPy Arrays (MDP.12) R Skin (MDP.13) R Vector (MDP.13) R Array (MDP.13) Installer Files (MDP.14) SLiC Interface Levels (MDP.18) Advanced Dynamic SLiC Interface (MDP.18) .maxj (MDP.20) Profiling Tools (MDP.22) Ticks/Second (MDP.25) - Tick is a unit of time in a DFE; the speed of movement through a dataflow pipeline is expressed
in Ticks/Second
Bandwidth (MDP.25) 8
Conditionals in Dataflow Computing (MDP.28) Global Conditionals (MDP.28) Local Conditionals (MDP.28) Conditional Loops (MDP.28) Ternary-If Operator (?:) (MDP.28) Java Compilation (MDP.31) Java Run-Time (MDP.31) Kernel Graph (MDP.39) MaxCompiler Java Library (MDP.39) Constructor in Java (MDP.40) io.input Method (MDP.40) Simulation Watches (MDP.53, MDP.54) Simulation Printf (MDP.53) DFE Printf (MDP.53) CVS Format (MDP.54) Java Expression Field (MDP.54) Debug Directory (MDP.55) Debug Output (MDP.55) Graphical Debugger (MDP.60) Code Snippet (MDP.61) Stream Status Block (MDP.62, MDP.63, MDP.65) MaxDebug (MDP.64) Variables in a Dataflow Program (MDP.71) Primitive DFE Types (MDP.72) Composite DFE Types (MDP.72, MDP.75) Casting (MDP.72, MDP.74) DFERawBits (MDP.73) Dataflow Operators (MDP.80) stream.offset (MDP.87) Multiplexer (MDP.93) Static Offset (MDP.89, MDP.94) Variable Offset (MDP.89, MDP.94) Dynamic Offset (MDP.89, MDP.93, MDP.94) Offset Expression (MDP.90) Stream Hold (MDP.95) Simple Counters (MDP.103) Advanced Counter (MDP.104, MDP.105) Engine Groups (MDP.115) Sharing Mode (MDP.115) High Performance Atomic Execution (MDP.116) Engine Loads (MDP.117) Engine Interface Parameters (MDP.117, MDP.124) Error Contexts (MDP.129) Debug Directories (MDP.132) SLiC Installer (MDP.132) BRAM (MDP.143) Standard Manager (MDP.150) MAX-UP (MDP.163) - Maxeler University Program.
9
D. The Index
10
Inside back cover
11
12
Download